HetDister
Documentation for HetDister.
Module to run demographic inference on diploid genomes, under the assumption of panmixia.
Data form of input and output
The genome needs to be SNP-called and the genomic distance between consecutive heterozygous positions needs to be computed. Heterozygous positions are the ones with genotype 0/1 or 1/0 (Note that the phase is not important). The input is then a vector containind such distances. Additionally, mutation and recombination rates need to be chosen and passed as input as well. See Tutorial for more details on preparing input data.
The demographic model underlying the inference is composed of a variable number of epochs and the population size is constant along each epoch.
The output is a vector of parameters in the form [L, N0, T1, N1, T2, N2, ...] where L is the total sequence length, N0 is the ancestral population size in the furthermost epoch and extending to the infinite past, the subsequent pairs $(T_i, N_i)$ are the duration and size of following epochs going from past to present. This format is referred to as TN vector throughout. The length L should match the input sequence length and is floating to improve the fit.
HetDister.FitOptionsHetDister.FitResultHetDister.Spectra.laplacekingmanHetDister.Spectra.mldsmcpHetDister.adapt_histogramHetDister.compare_mldsHetDister.compare_modelsHetDister.compute_residualsHetDister.compute_residualsHetDister.demoinferHetDister.demoinferHetDister.durationsHetDister.evdHetDister.get_paraHetDister.pop_sizesHetDister.pre_fitHetDister.sds
HetDister.FitOptions — Method
FitOptions(Ltot, mu, rho; kwargs...)Construct an an object of type FitOptions, requiring total genome length Ltot in base pairs, mutation rate and recombination rate per base pair per generation.
Optional Arguments
Tlow::Number=10,Tupp::Number=1e7: The lower and upper bounds for the duration of epochs.Nlow::Number=10,Nupp::Number=1e8: The lower and upper bounds for the population sizes.level::Float64=0.95: The confidence level for the confidence intervals on the parameters estimates.solver: The solver to use for the optimization, default isLBFGS().smallest_segment::Int=1: The smallest segment size present in the histogram to consider for the signal search.force::Bool=true: if true try to fit further epochs even when no signal is found.maxnts::Int=10: The maximum number of new time splits to consider when adding a new epoch. Higher is greedier.naive::Bool=true: if true the expected weights are computed using the closed form integral, otherwise using higher order transition probabilities from SMC' theory (slower).order::Int=10: maximum number of higher order corrections to use whennaiveis false, i.e. number of intermediate recombination events plus one.ndt::Int=800: number of Legendre nodes to use whennaiveis false.locut::Int=1: index of the first histogram bin to consider in the fit.
Optim Arguments
Additional keywords are passed to Optimization.solve, see Optimization.jl. and the specific Optim.jl section, which is the default optimizer. Defaults are:
maxiters = 6000maxtime = 60g_tol = 5e-8
HetDister.FitResult — Type
struct FitResultA data structure to store the results of a fit.
See the introduction for how the model is parameterized Data form of input and output. Some methods are defined for this type to get the vector of parameters, std errors, model evidence, etc. See get_para, sds, evd, pop_sizes, durations.
HetDister.adapt_histogram — Method
adapt_histogram(segments::AbstractVector{<:Integer}; lo::Int=1, hi::Int=50_000_000, nbins::Int=800, tailthr::Int=1)Build an histogram from segments logbinned between lo and hi with nbins bins.
The upper limit is adapted to ensure logspacing with the requested nbins.
HetDister.compare_mlds — Method
compare_mlds(segs1::Vector{Int}, segs2::Vector{Int}; lo = 1, hi = 1_000_000, nbins = 200)Build two histograms from segs1 and segs2, rescale them for number and average heterozygosity and return four vectors containing respectively common midpoints for bins, the two rescaled weights and variances of the difference between weights.
HetDister.compare_models — Function
compare_models(models)Compare the models parameterized by FitResults and return the best one. Takes an iterable of FitResult as input.
Theoretical explanation
TBD
HetDister.compute_residuals — Method
compute_residuals(h::Histogram, mu, rho, TN::Vector; naive=false)Compute the residuals between the observed and expected weights.
Optional arguments
naive::Bool=false: if true the expected weights are computed using the closed form integral, otherwise using higher order transition probabilities from SMC' theory.order::Int=10: maximum number of higher order corrections to use whennaiveis false, i.e. number of intermediate recombination events plus one.ndt::Int=800: number of Legendre nodes to use whennaiveis false.
HetDister.compute_residuals — Method
compute_residuals(h1::Histogram, h2::Histogram)When two histograms are given the std error of the difference is used for normalization.
HetDister.demoinfer — Method
demoinfer(segments::AbstractVector{<:Integer}, epochrange::AbstractRange{<:Integer}, mu::Float64, rho::Float64; kwargs...)Make an histogram with IBS segments and infer demographic histories with piece-wise constant epochs where the number of epochs is epochrange.
Return a named tuple which contains a vector of FitResult in the field fits (see FitResult).
Optional Arguments
fop::FitOptions = FitOptions(sum(segments), mu, rho): the fit options, seeFitOptions.lo::Int=1: The lowest segment length to be considered in the histogramhi::Int=50_000_000: The highest segment length to be considered in the histogramnbins::Int=400: The number of bins to use in the histogramiters::Int=10: The number of iterations to perform. It might converge earliersetorder::Bool=true: the order at which the SMC' approximation is truncated will be set automatically according to the cutoffcutoff=2e-5: whensetordera fraction of segments smaller thancutoffwill be ignored to set the order
HetDister.demoinfer — Method
demoinfer(h::Histogram, epochrange, fop::FitOptions; iters=15, reltol=1e-2, corcut=20, finalize=false)
demoinfer(h, epochs, fop; iters=15, reltol=1e-2, corcut=20, finalize=false)Take an histogram of IBS segments, fit options, and infer demographic histories with piece-wise constant epochs where the number of epochs is epochrange. See FitOptions. Return a named tuple which contains a vector of FitResult in the field fits (see FitResult).
If epochrange is a integer, then it fits only the model with that number of epochs. Return a named tuple with a FitResult object in the field f.
HetDister.durations — Method
durations(fit::FitResult)Return the fitted durations of the epochs.
HetDister.evd — Method
evd(fit::FitResult)Return the evidence of the fit.
HetDister.get_para — Method
pars(fit::FitResult)Return the parameters of the fit.
HetDister.pop_sizes — Method
pop_sizes(fit::FitResult)Return the fitted population sizes, from past to present.
HetDister.pre_fit — Method
pre_fit(h::Histogram, nfits, mu, rho, order, Ltot; require_convergence=true)
pre_fit!(fop::FitOptions, h::Histogram, nfits; require_convergence=true)Preliminarily fit h with an approximate model of piece-wise constant epochs for each number of epochs from 1 to nfits.
With the first signature it initializes the fit options to default. See FitOptions for how to specify them. Otherwise it modifies fop in place to adapt it to all the requested epochs. The mutation rate mu and recombination rate rho are assumed to be per base pairs per generation and the total length of the genome Ltot is in base pairs. Return a vector of FitResult, one for each number of epochs, see also FitResult.
HetDister.sds — Method
sds(fit::FitResult)Return the standard deviations of the parameters of the fit.
HetDister.Spectra.laplacekingman — Method
laplacekingman(r, mu, TN)Compute the approximate number of segments of length r using the Laplace transform of the Kingman coalescent at frequency 2mu r, given mutation rate mu and population size history TN.
HetDister.Spectra.mldsmcp — Method
mldsmcp(rs, edges, mu, rho, TN; order = 10, ndt = 800)Compute the expected number of segments with length in each bin defined by edges, given the midpoints rs, mutation rate mu, recombination rate rho, and population size history TN.
The computation uses the SMC' higher order transition probabilities with order maximum number of intermediate recombination events plus one, and ndt Legendre nodes for the numerical integration.