DemoInfer
Documentation for DemoInfer.
Module to run demographic inference on diploid genomes, under the assumption of panmixia (i.e. the inferred effective population size is half the inverse of the observed mean coalescence rate). See this repo for a demo of how to use it.
Data form of input and output
The genome needs to be SNP-called and the genomic distance between consecutive heterozygous positions needs to be computed. Heterozygous positions are the ones with genotype 0/1 or 1/0 (Note that the phase is not important). The input is then a vector containind such distances. Additionally, mutation and recombination rates need to be chosen and passed as input as well.
For example, suppose you have a .vcf file with called variants you want to analyze. Then you may compute distances between heterozygous SNPs as follows:
using CSV
using DataFrames
using DataFramesMeta
f = "/myproject/myfavouritespecies.vcf"
df = CSV.read(f, DataFrame,
delim='\t',
comment="##",
missingstring=[".", "NaN"],
normalizenames=true,
ntasks = 1,
drop = [:INFO, :ID, :FILTER],
)
# remove homozygous variants
@chain df begin
@rsubset! (:SampleName[1] == '1' && :SampleName[3] == '0') || (:SampleName[1] == '0' && :SampleName[3] == '1')
end
ils = df.POS[2:end] .- df.POS[1:end-1]
@assert all(ils .> 0)The demographic model underlying the inference is composed of a variable number of epochs and the population size is constant along each epoch.
The output is a vector of parameters in the form [L, N0, T1, N1, T2, N2, ...] where L is the genome length, N0 is the ancestral population size in the furthermost epoch and extending to the infinite past, the subsequent pairs $(T_i, N_i)$ are the duration and size of following epochs going from past to present. This format is referred to as TN vector throughout.
DemoInfer.FitOptionsDemoInfer.FitResultDemoInfer.adapt_histogramDemoInfer.compare_mldsDemoInfer.compare_modelsDemoInfer.compute_residualsDemoInfer.compute_residualsDemoInfer.demoinferDemoInfer.demoinferDemoInfer.demoinferDemoInfer.durationsDemoInfer.evdDemoInfer.get_chainDemoInfer.get_paraDemoInfer.get_sim!DemoInfer.pop_sizesDemoInfer.pre_fitDemoInfer.sds
DemoInfer.FitOptions — MethodFitOptions(Ltot::Number; kwargs...)Construct an an object of type FitOptions, requiring total genome length Ltot in base pairs.
Optional Arguments
Tlow::Number=10,Tupp::Number=1e7: The lower and upper bounds for the duration of epochs.Nlow::Number=10,Nupp::Number=1e8: The lower and upper bounds for the population sizes.level::Float64=0.95: The confidence level for the confidence intervals on the parameters estimates.solver: The solver to use for the optimization, default isLBFGS().opt: The optimization options, a named tuple which is passed to
Optim.jl. Default is has keywords: - iterations = 6000 - allow_f_increases = true - time_limit = 60 - g_tol = 5e-8 - show_warnings = false.
smallest_segment::Int=1: The smallest segment size present in the histogram to consider
for the signal search.
force::Bool=true: if true try to fit further epochs even when no signal is found.maxnts::Int=10: The maximum number of new time splits to consider when adding a new epoch.
DemoInfer.FitResult — Typestruct FitResultA data structure to store the results of a fit.
Some methods are defined for this type to get the vector of parameters, std errors, model evidence, etc. See get_para, sds, evd, pop_sizes, durations.
DemoInfer.adapt_histogram — Methodadapt_histogram(segments::AbstractVector{<:Integer}; lo::Int=1, hi::Int=50_000_000, nbins::Int=200)Build an histogram from segments logbinned between lo and hi with nbins bins (see HistogramBinnings.jl).
The upper limit is adapted to the maximum observed length, so default value is on purpose high.
DemoInfer.compare_mlds — Methodcompare_mlds(segs1::Vector{Int}, segs2::Vector{Int}; lo = 1, hi = 1_000_000, nbins = 200)Build two histograms from segs1 and segs2, rescale them for number and average heterozygosity and return four vectors containing respectively common midpoints for bins, the two rescaled weights and variances of the difference between weights.
DemoInfer.compare_models — Methodcompare_models(models::Vector{FitResult})Compare the models parameterized by FitResults and return the best one.
Theoretical explanation
TBD
DemoInfer.compute_residuals — Methodcompute_residuals(h::Histogram, mu::Float64, TN::Vector)Compute the residuals between the observed and expected weights.
DemoInfer.compute_residuals — Methodcompute_residuals(h1::Histogram, h2::Histogram)When two histograms are given the std error of the difference is used for normalization.
DemoInfer.demoinfer — Methoddemoinfer(segments::AbstractVector{<:Integer}, epochrange::UnitRange{Int}, mu::Float64, rho::Float64; kwargs...)Make an histogram with IBS segments and infer demographic histories with piece-wise constant epochs where the number of epochs is in epochrange.
Return a vector of FitResult of length smaller or equal to epochrange, see FitResult, mu and rho are respectively the mutation and recombination rates per base pair per generation.
Optional Arguments
fop::FitOptions = FitOptions(sum(segments)): the fit options, seeFitOptions.lo::Int=1: The lowest segment length to be considered in the histogramhi::Int=50_000_000: The highest segment length to be considered in the histogramnbins::Int=200: The number of bins to use in the histogramiters::Int=8: The number of iterations to perform. Currently automatic check for convergence
is not implemented.
annealing=nothing: correction is computed by simulating a genome of lengthfactortimes
the length of the input genome. At each iteration the factor is changed according to the annealing function. It can be Flat(), Lin() or Sq(). It can be a user defined function with signature (L, it) -> factor with L the genome length and it the iteration index. By default it is computed adaptively based on the input data, such that the total expected volume of the histogram is 2e8.
s::Int=1234: The random seed for the random number generator, used to compute the correction.restart::Int=100: The number of iterations after which the fit is restarted with
a different seed. Set to a default high number, it should not be needed.
top::Int=1: the number of fits at chain tail which is averaged for the final estimate.
DemoInfer.demoinfer — Methoddemoinfer(h_obs::Histogram, nepochs::Int, mu::Float64, rho::Float64, Ltot::Number, init::Vector{Float64}; kwargs...)Fit iteratively histogram h_obs with a single demographic model of piece-wise constant nepochs starting from an initial parameter vector init.
Return a FitResult, see FitResult, above methods fall back to this, which is called on multiple threads if available.
DemoInfer.demoinfer — Methoddemoinfer(h, epochrange, mu, rho, Ltot; kwargs...)Does the same as above, but takes a histogram as input and the total length of the IBS segments.
It is much lighter to distribute the histogram than the vector of segments which may also be streamed directly from disk into the histogram.
DemoInfer.durations — Methoddurations(fit::FitResult)Return the fitted durations of the epochs.
DemoInfer.evd — Methodevd(fit::FitResult)Return the evidence of the fit.
DemoInfer.get_chain — Methodget_chain(fit::FitResult)Return two matrices containing the chain of fitted parameters and std errors respectively (both as columns).
DemoInfer.get_para — Methodpars(fit::FitResult)Return the parameters of the fit.
DemoInfer.get_sim! — Methodget_sim!(params::Vector, h::Histogram, mu::Float64, rho::Float64; factor = 1)Simulate a population according to the demographic parameters in params and stores the IBS segments in the h.
Arguments
- 'params': vector of the form [L, N0, T1, N1, T2, N2] where L is the genome length, N0 is the ancestral population size and successive pairs of Ti and Ni are the durations and sizes of subsequent epochs.
factor: determine how many genomes are simulated and averaged
DemoInfer.pop_sizes — Methodpop_sizes(fit::FitResult)Return the fitted population sizes, from past to present.
DemoInfer.pre_fit — Methodpre_fit(h::Histogram, nfits::Int, mu::Float64, Ltot::Number; require_convergence=true)
pre_fit(h::Histogram, nfits::Int, mu::Float64, fop::FitOptions; require_convergence=true)Preliminarily fit h with an approximate model of piece-wise constant epochs for each number of epochs from 1 to nfits.
If given the total length of the genome Ltot it initialize the fit options to default. See FitOptions for how to specify them. The mutation rate mu is assumed to be per base pairs per generation and the total length of the genome Ltot is in base pairs. Return a vector of FitResult, one for each number of epochs, see also FitResult.
DemoInfer.sds — Methodsds(fit::FitResult)Return the standard deviations of the parameters of the fit.