DemoInfer

Documentation for DemoInfer.

Module to run demographic inference on diploid genomes, under the assumption of panmixia (i.e. the inferred effective population size is half the inverse of the observed mean coalescence rate). See this repo for a demo of how to use it.

Data form of input and output

The genome needs to be SNP-called and the genomic distance between consecutive heterozygous positions needs to be computed. Heterozygous positions are the ones with genotype 0/1 or 1/0 (Note that the phase is not important). The input is then a vector containind such distances. Additionally, mutation and recombination rates need to be chosen and passed as input as well.

For example, suppose you have a .vcf file with called variants you want to analyze. Then you may compute distances between heterozygous SNPs as follows:

using CSV
using DataFrames
using DataFramesMeta

f = "/myproject/myfavouritespecies.vcf"
df = CSV.read(f, DataFrame, 
    delim='\t', 
    comment="##",
    missingstring=[".", "NaN"],
    normalizenames=true,
    ntasks = 1,
    drop = [:INFO, :ID, :FILTER],
)

# remove homozygous variants
@chain df begin
    @rsubset! (:SampleName[1] == '1' && :SampleName[3] == '0') || (:SampleName[1] == '0' && :SampleName[3] == '1')
end

ils = df.POS[2:end] .- df.POS[1:end-1]
@assert all(ils .> 0)

The demographic model underlying the inference is composed of a variable number of epochs and the population size is constant along each epoch.

The output is a vector of parameters in the form [L, N0, T1, N1, T2, N2, ...] where L is the genome length, N0 is the ancestral population size in the furthermost epoch and extending to the infinite past, the subsequent pairs $(T_i, N_i)$ are the duration and size of following epochs going from past to present. This format is referred to as TN vector throughout.

DemoInfer.FitOptionsMethod
FitOptions(Ltot::Number; kwargs...)

Construct an an object of type FitOptions, requiring total genome length Ltot in base pairs.

Optional Arguments

  • Tlow::Number=10, Tupp::Number=1e7: The lower and upper bounds for the duration of epochs.
  • Nlow::Number=10, Nupp::Number=1e8: The lower and upper bounds for the population sizes.
  • level::Float64=0.95: The confidence level for the confidence intervals on the parameters estimates.
  • solver: The solver to use for the optimization, default is LBFGS().
  • opt: The optimization options, a named tuple which is passed to

Optim.jl. Default is has keywords: - iterations = 6000 - allow_f_increases = true - time_limit = 60 - g_tol = 5e-8 - show_warnings = false.

  • smallest_segment::Int=1: The smallest segment size present in the histogram to consider

for the signal search.

  • force::Bool=true: if true try to fit further epochs even when no signal is found.
  • maxnts::Int=10: The maximum number of new time splits to consider when adding a new epoch.
source
DemoInfer.adapt_histogramMethod
adapt_histogram(segments::AbstractVector{<:Integer}; lo::Int=1, hi::Int=50_000_000, nbins::Int=200)

Build an histogram from segments logbinned between lo and hi with nbins bins (see HistogramBinnings.jl).

The upper limit is adapted to the maximum observed length, so default value is on purpose high.

source
DemoInfer.compare_mldsMethod
compare_mlds(segs1::Vector{Int}, segs2::Vector{Int}; lo = 1, hi = 1_000_000, nbins = 200)

Build two histograms from segs1 and segs2, rescale them for number and average heterozygosity and return four vectors containing respectively common midpoints for bins, the two rescaled weights and variances of the difference between weights.

source
DemoInfer.compare_modelsMethod
compare_models(models::Vector{FitResult})

Compare the models parameterized by FitResults and return the best one.

Theoretical explanation

TBD

source
DemoInfer.compute_residualsMethod
compute_residuals(h1::Histogram, h2::Histogram)

When two histograms are given the std error of the difference is used for normalization.

source
DemoInfer.demoinferMethod
demoinfer(segments::AbstractVector{<:Integer}, epochrange::UnitRange{Int}, mu::Float64, rho::Float64; kwargs...)

Make an histogram with IBS segments and infer demographic histories with piece-wise constant epochs where the number of epochs is in epochrange.

Return a vector of FitResult of length smaller or equal to epochrange, see FitResult, mu and rho are respectively the mutation and recombination rates per base pair per generation.

Optional Arguments

  • fop::FitOptions = FitOptions(sum(segments)): the fit options, see FitOptions.
  • lo::Int=1: The lowest segment length to be considered in the histogram
  • hi::Int=50_000_000: The highest segment length to be considered in the histogram
  • nbins::Int=200: The number of bins to use in the histogram
  • iters::Int=8: The number of iterations to perform. Currently automatic check for convergence

is not implemented.

  • annealing=nothing: correction is computed by simulating a genome of length factor times

the length of the input genome. At each iteration the factor is changed according to the annealing function. It can be Flat(), Lin() or Sq(). It can be a user defined function with signature (L, it) -> factor with L the genome length and it the iteration index. By default it is computed adaptively based on the input data, such that the total expected volume of the histogram is 2e8.

  • s::Int=1234: The random seed for the random number generator, used to compute the correction.
  • restart::Int=100: The number of iterations after which the fit is restarted with

a different seed. Set to a default high number, it should not be needed.

  • top::Int=1: the number of fits at chain tail which is averaged for the final estimate.
source
DemoInfer.demoinferMethod
demoinfer(h_obs::Histogram, nepochs::Int, mu::Float64, rho::Float64, Ltot::Number, init::Vector{Float64}; kwargs...)

Fit iteratively histogram h_obs with a single demographic model of piece-wise constant nepochs starting from an initial parameter vector init.

Return a FitResult, see FitResult, above methods fall back to this, which is called on multiple threads if available.

source
DemoInfer.demoinferMethod
demoinfer(h, epochrange, mu, rho, Ltot; kwargs...)

Does the same as above, but takes a histogram as input and the total length of the IBS segments.

It is much lighter to distribute the histogram than the vector of segments which may also be streamed directly from disk into the histogram.

source
DemoInfer.get_chainMethod
get_chain(fit::FitResult)

Return two matrices containing the chain of fitted parameters and std errors respectively (both as columns).

source
DemoInfer.get_sim!Method
get_sim!(params::Vector, h::Histogram, mu::Float64, rho::Float64; factor = 1)

Simulate a population according to the demographic parameters in params and stores the IBS segments in the h.

Arguments

  • 'params': vector of the form [L, N0, T1, N1, T2, N2] where L is the genome length, N0 is the ancestral population size and successive pairs of Ti and Ni are the durations and sizes of subsequent epochs.
  • factor: determine how many genomes are simulated and averaged
source
DemoInfer.pre_fitMethod
pre_fit(h::Histogram, nfits::Int, mu::Float64, Ltot::Number; require_convergence=true)
pre_fit(h::Histogram, nfits::Int, mu::Float64, fop::FitOptions; require_convergence=true)

Preliminarily fit h with an approximate model of piece-wise constant epochs for each number of epochs from 1 to nfits.

If given the total length of the genome Ltot it initialize the fit options to default. See FitOptions for how to specify them. The mutation rate mu is assumed to be per base pairs per generation and the total length of the genome Ltot is in base pairs. Return a vector of FitResult, one for each number of epochs, see also FitResult.

source
DemoInfer.sdsMethod
sds(fit::FitResult)

Return the standard deviations of the parameters of the fit.

source