This document gives an overview of visual valdidation of genotypes from the
1001 crosses project using the program ibdpainting.
It gives representative examples of different outcomes and a summary of how many cases were found of each.
A comprehensive list of the result for each line is given in the file ibdpainting_results.csv
Random pairs of 219 accessions of Arabidopsis thaliana from Sweden were crossed, and one plant from each cross was then maintained by self-pollination. For most crosses two F2 seeds were propogated to give two replicates per cross. At the F8 generation 429 lines were genotyped with Illumina short-read data at about 6X coverage. We want to confirm that the F8 genotypes really match the parents they ought to come from.
To check this, we calculate genetic distance of each F8 to to the two expected parents in windows of 500kb using the program ibdpainting. This is done for SNPs inside genes only, because SNP calls outside genes are dubious due to widespread structural polymorphism. We also calculate the distance to each of 1163 accessions based on data from an extended set of the 1135 accessions from the 1001 genomes project, from this paper:
Brachi, Benjamin, et al. “Plant genetic effects on microbial hubs impact host fitness in repeated field trials.” Proceedings of the National Academy of Sciences 119.30 (2022): e2201285119
One goal would be to determine whether parents are correct so that we can use the high-quality parental genotypes to improve the low-coverage F8 genotyping data.
ibdpainting is a visual tool to validate ancestry across a genome. For windows across the genome this calculates the genetic distance from a test individual to every genotype in a reference panel. It then plots these distances for the expected parents, and a subsample of the most likely parents. Smaller distances mean a closer match.
In addition, it also generates an IBD score file for each pair of candidate parents. In each window it takes the minimum distances to either parent, and takes the average distance over the whole genome.
F8 individuals have been through seven rounds of self-fertilisation, so we expect them to be homozygous across the genome, with megabase-sized haplotypes matching one of the two parents. Figure 1 shows a good example of how such a genome should look. Each point shows the average genetic distance from the sample 1_A3 in each window to the two expected parents (8423 and 9413 in red and blue). Also shown in grey are distances to the ten next-best-matching accessions. You can see that the F8 genome is a perfect match for one of the two expected parents at each window in the genome (where distance is zero), and that there are megabase-scale blocks of identity. We can also see that distances are not quite zero where ancestry blocks swap over; that’s fine, it just means that there is a recombination site somewhere in the 500kb window, such that neither parent is a perfect match. A bit more than half of the samples look like this.
Figure 2 shows a similar example of an F8 that matches one or both parents in each window. In this case however the distances are not zero as they were in figure 1, and are generally noisy. The notes from the library preparation reveal that this sample had a very low DNA concentration. Nevertheless, it does match the expected parents, so we can say that the seed material is likely not contaminated, although one might wish to repeat the genotyping. There are seven samples that look like this.
Figure 3 shows an example of an F8 individual that matches one of the two expected parents, except for a two regions on chromosome 5.
Nevertheless, the expected parents are still the closest matches.
A simple explanation is that the F8 is heterozygous for two parental haplotypes in these regions.
This can be confirmed by checking the genotypes in the region with bcftools
:
bcftools query -s 1_C12 -r Chr5:12000000-13000000 -f"%CHROM %POS [ %GT] [ %AD]\n" vcf_file.vcf.gz | grep "0/1" | wc -l
In this collection, genuinely heterozygous regions like this one have hundreds of heterozygous SNP calls per megabase. Homozygous regions have tens of heterozygous SNP calls per megabase, which are almost certainly miscalled.
There are 53 samples that show clear signs of residual heterozygosity in at least one place in the genome. This is a lower bound on the true number, because it is difficult to spot heterozygosity in cases where a parent is missing, or genotyping is ambiguous (see cases below).
16 samples are recorded as having zero or very low concentrations of DNA in the NGS libraries. Not surprisingly, it was not possible to call SNPs at almost any window, so the plots look nearly empty (figure 4). This should be easy to fix by sowing more seeds and repeating the genotyping.
Figure 5 shows a case where nothing matches. Lines for the expected parents are not plotted at all. The figure shows only grey lines, none of which match the F8. It turns out that the expected parents (accessions 1137 and 1074) were not included in the reference panel.
Likewise, figure 6 shows an F8 from a cross between accessions 1137 and 8241. You can see long tracts where 8241 matches the F8, and long tracts where nothing matches, which are presumably homozygous for the 1137 haplotypes.
Four crosses are affected by the two missing accessions (including the replicates of each, that gives eight lines in total)
Sample 2_H4 doesn’t match its expected parents 9481 and 8307 (figure 7). However, there are grey lines close to zero across the genome, suggesting that something else might fit.
A quick look at the first rows of the score file for this sample suggests that the combination of 175 and 6913 are likely candidates - the IBD score for this pair is an order of magnitude lower than any other pair.
parent1 parent2 min_IBD
175 6913 0.013578783141498326
175 9743 0.12009568344735648
175 9852 0.12318103871182563
175 9581 0.1238256307101355
175 5210 0.1261425921515468
175 9972 0.12651983569807917
175 4840 0.12652044717407385
175 9821 0.12656418647709405
175 6924 0.12658849831752222
The other replicate of 9481x8307 shows the same pattern, and also matches 175 and 6913. At the same time the two samples which should be derived from 175 and 6913 do not match those parents, but do match 9481 and 8307.
This looks like a simple case of 9481x8307 and 175x6913 being swapped. Surprisingly, there is only one extra case where both parents have clearly been confused with something else (6012x997 F8 rep1 is probably 6021x9407), and one where a single parent has clearly been swapped for something else (9470x6107 F8 rep1 is probably 7002x6107).
Figure 8 shows what should be a cross between accessions 6108 and 6192. However, it is in fact homozygous for the 6108 haplotype across the whole genome. This is likely because the initial cross failed, and the line is in fact descended from a self-pollinated seed of 6108. There is one additional case of this happening (6122x6252 replicate 2).
Figure 9 shows a sample that matches an expected parent at most of the genome, but there are some regions of chromosomes 1, 3, 4 and 5 that don’t match either parent, and don’t really match anything else in the reference panel.
The fact that most of the genome matches indicates that this is not a sample mix up.
There also don’t appear to be any better candidates.
Here is the first ten lines of the ibd_scores.csv
file:
parent1 parent2 min_IBD
6188 9381 0.07048910670002984
6197 9381 0.08793660411929506
6198 9381 0.08796521252504941
6188 9383 0.09783508908165152
6198 9383 0.1126233937330796
6197 9383 0.11262808158979185
6188 9382 0.13429055446514007
6191 9381 0.13632110527914162
6195 9381 0.14557521669338727
The expected parents (6198 and 9381) are there, and there are some other candidates with very similar names (6188 and 9383).
That might suggest that somebody had mixed up seed packets with very similar labels.
However, the plot shows that none of these are good matches.
Furthermore, the min_IBD
scores are very similar for different candidate pairs; compare this table to the one when labels were swapped, where the top candidate pairs had an order-of-magnitude lower IBD score than the rest.
This pattern is also unlikley to be residual heterozygosity left over from the cross; in the cases where there is heterozygosity (see above) there are usually hundreds of heterozygous calls per megabase, but here there are an order of magnitude fewer. You can also see this visually. In genuinely heterozygous regions the lines for the expected parents are non-zero, but usually closer to zero than the point for the other accessions. In this case the grey lines frequently are lower than those of the expected parents.
We think this pattern arises because of heterozygosity in the parents used to make the cross and reference panel, rather than arising from the cross itself. The A. thaliana collection is a set of accessions sampled from natural populations, and although they are mostly self-pollinating, outcrossing does occur. If an accession was sampled from the wild that was the ancestor of a cross some generations back then the heterozygous regions would still be segregating in the seed stock. That would mean that the plant that was sampled for genotyping in the reference panel would not be the sample genotype as that used in the crosses, leading to partial mismatches. This is even though both plants may have been homozygous themselves, and no human error has occured. Consistent with this hypothesis, the replicate cross of the line shown in figure 9 shows mismatches in the same places (not plotted).
There are 45 lines that show pattern like this to a lesser or greater degree. In these cases the material is correct, but the reference panel is not a good representation of the genotypes. To get around this we will repeat the genotyping for these samples at high coverage to ensure SNPs are called accurately.
Sample 1_C4 matches expected parent 8283, but not 9471 (figure 10). The stretches of mismatch are too long to be residual heterozygosity, and no other accessions in the reference panel match instead. It is likely that some other accession was used in the cross, but that this accession was not in the reference panel. Note that this is different from the case where 1137 and 1074 were missing, because those crosses likely involve the correct parents.
There are 18 such cases where one parent appears to be missing, and one case where both parents seem to be missing. Possible options here are to
Sample 1_G10 matches expected parent 8256, but not 1367 (figure 11). You can also see that there is at least one (multiple lines can hide each other) grey line close to zero in the regions where neither expected parent matches. However, the grey lines are not really convincing matches.
Similarly, a look at the IBD score file for this sample suggests that the missing parent might be 1061, but the evidence is not strong:
parent1 parent2 min_IBD
1061 8256 0.09085585092160826
1363 8256 0.13570305630994167
7522 8256 0.1444148987551863
1061 1363 0.1472248350711171
1066 8256 0.14796460020431212
6195 8256 0.1532090569124295
8256 9381 0.1634984095874054
5865 8256 0.16947001251860855
6092 8256 0.17021008221624542
Both the plot and the IBD score file for the other replicate of this cross show the same pattern.
So, there is evidence that the missing parent of this line is another accession in the reference panel, but the evidence is ambiguous. An alternative explanation is that the true parent is really missing as before, perhaps coupled with some residual or ancestral heterozygosity. It is hard to know. There are three additional crosses where both replicates look like one parent is missing and there is another candidate that kinda looks like it could be the parents, and two where only one replicate looks like there is an obvious other candidate (ten lines in total).
Here is a summary of the outcomes of validation of 427 F8 lines: