Usage

File formats

recombulator-x uses a file organized as a PED files ([PLINK](https://www.cog-genomics.org/plink/ pedigree file) as input. The pedigree file can be a .tsv (tab as separator value), a .xlsx or whatever format with a space as separator value. The PED file format stores sample pedigree information (i.e., the familial relationships between samples) and the genotypes. In particular, the first 6 mandatory columns contain:

  • Family ID
  • Individual ID
  • Paternal ID
  • Maternal ID
  • Sex
  • Phenotype

The "Sex" field may be coded as: 1=male/2=female; XY=male/XX=female; M=male/F=female; MALE=male/FEMALE=female. If you are using STR markers, you can store within this column the Amelogenin marker.

The "Phenotype" field refers to the use of PED files in medical research. In non-medical application, it may be -9 which means "unknown".

From the 7th column on, there are the markers genotypes (two columns for a genetic marker, each of the two storing an allele). In case of STRs, the columns contain numbers, which correspond to the STR repeats or "0" when missing.

:warning: Important: Genetic markers (from the 7th column on) must be provided according to their physical genomic position. Indeed, the algorithm will infer the recombination rate between A1 and A2, A2 and A3 and so on.

Example

Here is a family (each row is an individual):

FID IID PAT MAT SEX PHENO STR1-A1 STR1-A2 STR2-A1 STR2-A2 STR3-A1 STR3-A2
FAM_I GRANDFATHER 0 0 1 -9 12 0 29 0 39 0
FAM_I MOTHER GRANDFATHER 0 2 -9 12 16 27 29 34 39
FAM_I SON_1 0 MOTHER 1 -9 12 0 29 0 34 0
FAM_I FATHER_1 0 0 1 -9 14 0 21 0 37 0
FAM_I DAUGHTER_1 FATHER_1 MOTHER 2 -9 14 16 21 27 34 37
FAM_I FATHER_2 0 0 1 -9 18 0 25 0 36 0
FAM_I DAUGHTER_2 FATHER_2 MOTHER 2 -9 12 18 25 29 36 39

Python module and workflow

A detailed guide for the Python module usage can be found in the Jupyter Notebook Estimation Example.ipynb on GitHub.

The initial steps of the Python module recombulator-x consist in reading the PED file and identifying the informative families for the estimation of recombination rates using the function ped2graph. This function takes a ped file as input and build a graph with the relationships. It returns a list of tuples, each composed by the graph, a dictionary (with iid as key and their tab row as value) and the family identifier.

The pedigree file can be a .tsv (tab as separator value), a .xlsx or whatever format with a space as separator value.

For recombination, informative subfamilies are either those with:

  • a phased mother and at least one son or phased daughter, called type I families
  • an unphased mother and at least two between sons and phased daughters, called type II families

Notably, females can be phased when their father is available: in this way, they will be virtually transformed into males, thus being allowed to take part to informative families.

The function plot_family_graph can then be used to graphically represent the reported relationships between individuals within the same family.

family_graphs, marker_names = recombulatorx.ped2graph(ped_path)
xstr_recomb.families.plot_family_graph(family_graphs[0][1]) 

The fuction preprocess_families will then check the consistency of each family graph and raise errors whenever necessary. For instance, an error is raised when more than two parents or same-sex parents are present in the same family. Unconnected individuals are also flagged.

processed_families = recombulatorx.preprocess_families(family_graphs)

The estimation of recombination and mutation rates can be launched with the following line:

est_recomb_rates, est_mut_rates = recombulatorx.estimate_rates(processed_families, 0.1, 0.1, estimate_mutation_rates='all')

The function estimate_rates estimates recombination and mutation rates from a set of families and takes the following parameters:

  • the families,
  • the initial recombination rate,
  • the initial mutation rate,
  • which mutation rate needs to be estimated (no: no mutation rate estimation, one: just one mutation rate for all markers, all: a mutation rate for each marker),
  • the type of implementation (the default implementation is the one using dynamic programming).

An example of output generated by function \emph{estimate_rates} in Python is:

(array([0.03874299, 0.32869992, 0.01459788, 0.19265765, 0.01016452]),
 array([1.00000000e-08, 1.00000000e-08, 1.23511804e-01, 2.10614659e-02, 2.24679981e-03, 1.00000000e-08]))

where, the first array (n-1 long) stores the recombination rates, while the second (n long) contains the mutation rates estimated for simulated families and six X-STRs.

Command line tool

The command line interface of recombulator-x takes as input the PED file and returns recombination and mutation rates.

usage: recombulator-x [-h] [--mutation-rates MUT-RATE [MUT-RATE ...]]
                      [--estimate-mutation-rates {no,one,all}]
                      PED

Estimate recombination and mutation rates.

positional arguments:
  PED                   path to ped file

optional arguments:
  -h, --help            show this help message and exit
  --mutation-rates MUT-RATE [MUT-RATE ...]
                        mutation rates used in the estimation, either
                        as fixed or as starting point in the
                        optimization depending on the value of the
                        --estimate-mutation-rates option. If not
                        given the rates are set to 0.001 for all
                        markers
  --estimate-mutation-rates {no,one,all}
                        controls the estimation of the mutation
                        rates. With "no" the mutation rates are not
                        estimated, with "one" the same rate is
                        estimated for all markers, with "all" a
                        separate estimation rate is estimated for
                        each marker. Defaults to "no"

Its basic usage consists in estimating just recombination rate and using the default single value for mutation rate (0.001).

recombulator-x ped_path

Alternatively, one may also decide to estimate mutation rates. In particular, adding --estimate-mutation-rates all, the tool will compute a mutation value for each marker.

recombulator-x ped_path --estimate-mutation-rates all

Output

The output of recombulator-x command line interface is returned in a tabular format according to the options no, one, all for the parameter --estimate-mutation-rates (Tables 1-3). In particular, the recombination rates are computed between markers following the order in which they were provided in the PED file.

TYPE MARKER RATE
RECOMBINATION M1-M2 0.0362
RECOMBINATION M2-M3 0.3309
RECOMBINATION M3-M4 0.0656
RECOMBINATION M4-M5 0.1683
RECOMBINATION M5-M6 0.0138

Table 1: recombulator-x output when --estimate-mutation-rates no is used.

TYPE MARKER RATE
MUTATION * 0.0253
RECOMBINATION M1-M2 0.0323
RECOMBINATION M2-M3 0.3191
RECOMBINATION M3-M4 0.0407
RECOMBINATION M4-M5 0.1634
RECOMBINATION M5-M6 0.0091

Table 2: recombulator-x output when --estimate-mutation-rates one is used.

TYPE MARKER RATE
MUTATION M1 1e-08
MUTATION M2 1e-08
MUTATION M3 0.1420
MUTATION M4 0.0191
MUTATION M5 1e-08
MUTATION M6 1e-08
RECOMBINATION M1-M2 0.0366
RECOMBINATION M2-M3 0.3148
RECOMBINATION M3-M4 0.0214
RECOMBINATION M4-M5 0.1605
RECOMBINATION M5-M6 0.0141

Table 3: recombulator-x output when --estimate-mutation-rates all is used.