sibreg package
Subpackages
- sibreg.bin package
- Submodules
- sibreg.bin.impute_from_sibs module
- sibreg.bin.impute_from_sibs_hdf5 module
- sibreg.bin.impute_from_sibs_setup module
- sibreg.bin.impute_po module
- sibreg.bin.impute_runner module
- sibreg.bin.make_rdr_grms module
- sibreg.bin.pGWAS module
- sibreg.bin.poGWAS module
- sibreg.bin.preprocess_data module
- sibreg.bin.sGWAS module
- sibreg.bin.triGWAS module
- Module contents
Submodules
sibreg.sibreg module
- sibreg.sibreg.compute_pgs(par_gts_f, gts_f, pgs, sib=False, compute_controls=False)[source]
Compute a polygenic score (PGS) for the individuals with observed genotypes and observed/imputed parental genotypes.
- Args:
- par_gts_f
str path to HDF5 file with imputed parental genotypes
- gts_f
str path to bed file with observed genotypes
- pgs
sibreg.pgs the PGS, defined by the weights for a set of SNPs and the alleles of those SNPs
- sib
bool Compute the PGS for genotyped individuals with at least one genotyped sibling and observed/imputed parental genotypes. Default False.
- compute_controls
bool Compute polygenic scores for control families (families with observed parental genotypes set to missing). Default False.
- par_gts_f
- Returns:
- pg
sibreg.gtarray Return the polygenic score as a genotype array with columns: individual’s PGS, mean of their siblings’ PGS, observed/imputed paternal PGS, observed/imputed maternal PGS
- pg
- sibreg.sibreg.find_individuals_with_sibs(ids, ped, gts_ids, return_ids_only=False)[source]
Used in get_gts_matrix and get_fam_means to find the individuals in ids that have genotyped siblings.
- sibreg.sibreg.find_par_gts(pheno_ids, ped, fams, gts_id_dict)[source]
Used in get_gts_matrix to find whether individuals have imputed or observed parental genotypes, and to find the indices of the observed/imputed parents in the observed/imputed genotype arrays. ‘par_status’ codes whether an individual has parents that are observed or imputed or neither. ‘gt_indices’ records the relevant index of the parent in the observed/imputed genotype arrays ‘fam_labels’ records the family of the individual based on the pedigree
- sibreg.sibreg.fit_sibreg_model(y, X, fam_labels, add_intercept=False, tau_init=1, return_model=True, return_vcomps=True, return_fixed=True)[source]
Compute the MLE for the fixed effects in a family-based linear mixed model.
- Args:
- y
array vector of phenotype values
- X:
array regression design matrix for fixed effects
- fam_labels
array vector of family labels: residual correlations in y are modelled between family members (that share a fam_label)
- add_intercept
bool whether to add an intercept to the fixed effect design matrix
- y
- Returns:
- model
sibreg.model the sibreg model object, if return_model=True
- vcomps:
float the MLEs for the variance parameters: sigma2 (residual variance) and tau (ratio between sigma2 and family variance), if return_vcomps=True
- alpha
array MLE of fixed effects, if return_fixed=True
- alpha_cov
array sampling variance-covariance matrix for MLE of fixed effects, if return_fixed=True
- model
- sibreg.sibreg.get_fam_means(ids, ped, gts, gts_ids, remove_proband=True, return_famsizes=False)[source]
Used in get_gts_matrix to find the mean genotype in each sibship (family) for each SNP or for a PGS. The gtarray that is returned is indexed based on the subset of ids provided from sibships of size 2 or greater. If remove_proband=True, then the genotype/PGS of the index individual is removed from the fam_mean given for that individual.
- sibreg.sibreg.get_gts_matrix(par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, compute_controls=False, parsum=False, start=0, end=None, print_sample_info=False)[source]
Reads observed and imputed genotypes and constructs a family based genotype matrix for the individuals with observed/imputed parental genotypes, and if sib=True, at least one genotyped sibling.
- Args:
- par_gts_f
str path to HDF5 file with imputed parental genotypes
- gts_f
str path to bed file with observed genotypes
- snp_ids
numpy.ndarray If provided, only obtains the subset of SNPs specificed that are present in both imputed and observed genotypes
- ids
numpy.ndarray If provided, only obtains the ids with observed genotypes and imputed/observed parental genotypes (and observed sibling genotypes if sib=True)
- sib
bool Retrieve genotypes for individuals with at least one genotyped sibling along with the average of their siblings’ genotypes and observed/imputed parental genotypes. Default False.
- compute_controls
bool Compute polygenic scores for control families (families with observed parental genotypes set to missing). Default False.
- parsum
bool Return the sum of maternal and paternal observed/imputed genotypes rather than separate maternal/paternal genotypes. Default False.
- par_gts_f
- Returns:
- G
sibreg.gtarray Genotype array for the subset of genotyped individuals with complete imputed/obsereved parental genotypes. The array is [N x k x L], where N is the number of individuals; k depends on whether sib=True and whether parsum=True; and L is the number of SNPs. If sib=False and parsum=False, then k=3 and this axis indexes individual’s genotypes, individual’s father’s imputed/observed genotypes, individual’s mother’s imputed/observed genotypes. If sib=True and parsum=False, then k=4, and this axis indexes the individual, the sibling, the paternal, and maternal genotypes in that order. If parsum=True and sib=False, then k=2, and this axis indexes the individual and sum of paternal and maternal genotypes; etc. If compute_controls=True, then a list is returned, where the first element is as above, and the following elements give equivalent genotyping arrays for control families where the mother has been set to missing, the father has been set to missing, and both parents have been set to missing.
- G
- sibreg.sibreg.get_gts_matrix_given_ped(ped, par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, parsum=False, start=0, end=None, verbose=False, print_sample_info=False)[source]
Used in get_gts_matrix: see get_gts_matrix for documentation
- sibreg.sibreg.get_gts_matrix_given_ped_bgen(ped, par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, parsum=False, start=0, end=None, verbose=False, print_sample_info=False)[source]
Used in get_gts_matrix: see get_gts_matrix for documentation
- sibreg.sibreg.get_indices_given_ped(ped, fams, gts_ids, ids=None, sib=False, verbose=False)[source]
Used in get_gts_matrix_given_ped to get the ids of individuals with observed/imputed parental genotypes and, if sib=True, at least one genotyped sibling. It returns those ids along with the indices of the relevant individuals and their first degree relatives in the observed genotypes (observed indices), and the indices of the imputed parental genotypes for those individuals.
- class sibreg.sibreg.gtarray(garray, ids, sid=None, alleles=None, pos=None, chrom=None, fams=None, par_status=None)[source]
Bases:
objectDefine a genotype or PGS array that stores individual IDs, family IDs, and SNP information.
- Args:
- garray
array 2 or 3 dimensional numpy array of genotypes/PGS values. First dimension is individuals. For a 2 dimensional array, the second dimension is SNPs or PGS values. For a 3 dimensional array, the second dimension indexes the individual and his/her relatives’ genotypes (for example: proband, paternal, and maternal); and the third dimension is the SNPs.
- ids
array vector of individual IDs
- sid
array vector of SNP ids, equal in length size of last dimension of array
- alleles
array [L x 2] matrix of ref and alt alleles for the SNPs. L must match size of sid
- pos
array vector of SNP positions; must match size of sid
- chrom
array vector of SNP chromosomes; must match size of sid
- fams
array vector of family IDs; must match size of ids
- par_status:class:`~numpy:numpy.array’
[N x 2] numpy matrix that records whether parents have observed or imputed genotypes/PGS, where N matches size of ids. The first column is for the father of that individual; the second column is for the mother of that individual. If the parent is neither observed nor imputed, the value is -1; if observed, 0; and if imputed, 1.
- garray
- Returns:
G :
sibreg.gtarray
- add(garray)[source]
Adds another gtarray of the same dimension to this array and returns the sum. It matches IDs before summing.
- diagonalise(inv_root)[source]
This will transform the genotype array based on the inverse square root of the phenotypic covariance matrix from the family based linear mixed model.
- fill_NAs()[source]
This normalises the SNP columns to have mean-zero, then fills in NA values with zero.
- sibreg.sibreg.make_gts_matrix(gts, imp_gts, par_status, gt_indices, parsum=False)[source]
Used in get_gts_matrix to construct the family based genotype matrix given observed/imputed genotypes. ‘gt_indices’ has the indices in the observed/imputed genotype arrays; and par_status codes whether the parents are observed (0) or imputed (1).
- sibreg.sibreg.make_id_dict(x, col=0)[source]
Make a dictionary that maps from the values in the given column (col) to their row-index in the input array
- sibreg.sibreg.match_observed_and_imputed_snps(gts_f, par_gts_f, bim, snp_ids=None, start=0, end=None)[source]
Used in get_gts_matrix_given_ped to match observed and imputed SNPs and return SNP information on shared SNPs. Removes SNPs that have duplicated SNP ids. in_obs_sid contains the SNPs in the imputed genotypes that are present in the observed SNPs obs_sid_index contains the index in the observed SNPs of the common SNPs
- sibreg.sibreg.match_observed_and_imputed_snps_bgen(gts_f, par_gts_f, snp_ids=None, start=0, end=None)[source]
Used in get_gts_matrix_given_ped to match observed and imputed SNPs and return SNP information on shared SNPs. Removes SNPs that have duplicated SNP ids. in_obs_sid contains the SNPs in the imputed genotypes that are present in the observed SNPs obs_sid_index contains the index in the observed SNPs of the common SNPs
- sibreg.sibreg.match_phenotype(G, y, pheno_ids)[source]
Match a phenotype to a genotype array by individual IDs.
- Args:
- G
gtarray genotype array to match phenotype to
- y
array vector of phenotype values
- pheno_ids:
array vector of individual IDs corresponding to phenotype vector, y
- G
- Returns:
- y
array vector of phenotype values matched by individual IDs to the genotype array
- y
- class sibreg.sibreg.model(y, X, labels, add_intercept=False)[source]
Bases:
objectDefine a linear model with within-class correlations.
- Args:
- y
array 1D array of phenotype observations
- X
array Design matrix for the fixed mean effects.
- labels
array 1D array of sample labels
- y
- Returns:
model :
sibreg.model
- alpha_mle(tau, sigma2, compute_cov=False, xtx_out=False)[source]
Compute the MLE of alpha given variance parameters
- Args:
- sigma2
float variance of model residuals
- tau
float ratio of variance of model residuals to variance explained by mean differences between classes
- sigma2
- Returns:
- alpha
array MLE of alpha
- alpha
- likelihood_and_gradient(sigma2, tau)[source]
Compute the loss function, which is -2 times the likelihood along with its gradient
- Args:
- sigma2
float variance of model residuals
- tau
float ratio of variance of model residuals to variance explained by mean differences between classes
- sigma2
- Returns:
- L, grad
float loss function and gradient, divided by sample size
- L, grad
- optimize_model(init_params)[source]
Find the parameters that minimise the loss function for a given regularisation parameter
- Args:
- init_param
array initial values for residual variance (sigma^2_epsilon) followed by ratio of residual variance to within-class variance (tau)
- init_param
- Returns:
- optim
dict dictionary with keys: ‘success’, whether optimisation was successful (bool); ‘warnflag’, output of L-BFGS-B algorithm giving warnings; ‘sigma2’, MLE of residual variance; ‘tau’, MLE of ratio of residual variance to within-class variance; ‘likelihood’, maximum of likelihood.
- optim
- class sibreg.sibreg.pgs(snp_ids, weights, alleles)[source]
Bases:
objectDefine a polygenic score based on a set of SNPs with weights and ref/alt allele pairs.
- Args:
- snp_ids
array vector of SNP ids
- snp_ids
array vector of weights of equal length to snp_ids
- alleles
array [L x 2] matrix of ref and alt alleles for the SNPs. L must match size of snp_ids
- snp_ids
- Returns:
pgs :
sibreg.pgs
- compute(garray, cols=None)[source]
Compute polygenic score values from a given genotype array. Finds the SNPs in the genotype array that have weights in the pgs and matching alleles, and computes the PGS based on these SNPs and the weights after allele-matching.
- Args:
- garray
sbreg.gtarray genotype array to compute PGS values for
- cols
numpy:numpy.array names to give the columns in the output gtarray
- garray
- Returns:
- pg
sibreg.gtarray 2d gtarray with PGS values. If a 3d gtarray is input, then each column corresponds to the second dimension on the input gtarray (for example, individual, paternal, maternal PGS). If a 2d gtarray is input, then there will be only one column in the output gtarray. The names given in ‘cols’ are stored in ‘sid’ attribute of the output.
- pg
- sibreg.sibreg.read_phenotype(phenofile, missing_char='NA', phen_index=1)[source]
Read a phenotype file and remove missing values.
- Args:
- phenofile
str path to plain text phenotype file with columns FID, IID, phenotype1, phenotype2, …
- missing_char
str The character that denotes a missing phenotype value; ‘NA’ by default.
- phen_index
int The index of the phenotype (counting from 1) if multiple phenotype columns present in phenofile
- phenofile
- Returns:
- y
array vector of non-missing phenotype values from specified column of phenofile
- pheno_ids:
array corresponding vector of individual IDs (IID)
- y
- sibreg.sibreg.simulate(n, alpha, sigma2, tau)[source]
- Simulate from a linear model with correlated observations within-class. The mean for each class
is drawn from a normal distribution.
- Args:
- n
int sample size
- alpha
array value of regression coefficeints
- sigma2
float variance of residuals
- tau
float ratio of variance of residuals to variance of distribution of between individual means
- n
- Returns:
- model
regrnd.model linear model with repeated observations
- model