DemoInfer

Documentation for DemoInfer.

Module to run demographic inference on diploid genomes, under the assumption of panmixia (i.e. the inferred effective population size is half the inverse of the observed mean coalescence rate). See this repo for a demo of how to use it.

Data form of input and output

The genome needs to be SNP-called and the genomic distance between consecutive heterozygous positions needs to be computed. Heterozygous positions are the ones with genotype 0/1 or 1/0 (Note that the phase is not important). The input is then a vector containind such distances. Additionally, mutation and recombination rates need to be chosen and passed as input as well.

For example, suppose you have a .vcf file with called variants you want to analyze. Then you may compute distances between heterozygous SNPs as follows:

using CSV
using DataFrames
using DataFramesMeta

f = "/myproject/myfavouritespecies.vcf"
df = CSV.read(f, DataFrame, 
    delim='\t', 
    comment="##",
    missingstring=[".", "NaN"],
    normalizenames=true,
    ntasks = 1,
    drop = [:INFO, :ID, :FILTER],
)

# remove homozygous variants
@chain df begin
    @rsubset! (:SampleName[1] == '1' && :SampleName[3] == '0') || (:SampleName[1] == '0' && :SampleName[3] == '1')
end

ils = df.POS[2:end] .- df.POS[1:end-1]
@assert all(ils .> 0)

The demographic model underlying the inference is composed of a variable number of epochs and the population size is constant along each epoch.

The output is a vector of parameters in the form [L, N0, T1, N1, T2, N2, ...] where L is the genome length, N0 is the ancestral population size in the furthermost epoch and extending to the infinite past, the subsequent pairs $(T_i, N_i)$ are the duration and size of following epochs going from past to present. This format is referred to as TN vector throughout.

DemoInfer.FitOptions
DemoInfer.FitResult
DemoInfer.adapt_histogram
DemoInfer.compare_mlds
DemoInfer.compare_models
DemoInfer.compute_residuals
DemoInfer.compute_residuals
DemoInfer.demoinfer
DemoInfer.demoinfer
DemoInfer.demoinfer
DemoInfer.durations
DemoInfer.evd
DemoInfer.get_chain
DemoInfer.get_para
DemoInfer.get_sim!
DemoInfer.pop_sizes
DemoInfer.pre_fit
DemoInfer.sds

DemoInfer.FitOptions — Method

FitOptions(Ltot::Number; kwargs...)

Construct an an object of type FitOptions, requiring total genome length Ltot in base pairs.

Optional Arguments

Tlow::Number=10, Tupp::Number=1e7: The lower and upper bounds for the duration of epochs.
Nlow::Number=10, Nupp::Number=1e8: The lower and upper bounds for the population sizes.
level::Float64=0.95: The confidence level for the confidence intervals on the parameters estimates.
solver: The solver to use for the optimization, default is LBFGS().
opt: The optimization options, a named tuple which is passed to

Optim.jl. Default is has keywords: - iterations = 6000 - allow_f_increases = true - time_limit = 60 - g_tol = 5e-8 - show_warnings = false.

smallest_segment::Int=1: The smallest segment size present in the histogram to consider

for the signal search.

force::Bool=true: if true try to fit further epochs even when no signal is found.
maxnts::Int=10: The maximum number of new time splits to consider when adding a new epoch.