BAM-matcher: Rapid identification of mislabelled samples using Next Generation Sequencing (NGS) data
Sample mislabelling or mix-up is a common problem in both medical diagnostics and research fields, with several recent studies placing the error rate at around 1% across hospitals and large research facilities. A mislabelled sample can lead to incorrect data processing and analysis, and result in conflicting results or false conclusion and diagnosis. Even if the error was identified, valuable resources and time would have been wasted.
One of the common methods to ensure sample veracity is to independently generate a small genomic signature (genotype at many different loci) for each sample using SNP panels, which can then be used to confirm the identity of the NGS samples. However, this adds extra cost and can potentially introduce more errors.
How did the facility help?
To help with the problem of sample mislabelling, we took a different approach based on two factors:
NGS data typically already contain high levels of genotype information that can be used to uniquely identify individual samples.
Many of the NGS projects which involve our sequencing facility have multiple samples for each individual, e.g. matching control and tumour samples from sample patient, or familial samples.
Thus, we decided that for many of the projects, we can simply use the NGS data to look for mislabelled samples, without having to perform further SNP panel genotyping at additional cost. We developed an algorithm, BAM-matcher, which can:
Rapidly identify whether two NGS data sets are from the same or closely related individuals with very high level of accuracy,
Get deployed at very early stages of data processing pipeline, which help can prevent delay in data processing when mislabelled samples are identified.
BAM-matcher is now publicly available (https://bitbucket.org/sacgf/bam-matcher), and an associated manuscript has been published in a major scientific journal:
Wang, Parker, Branford and Schreiber. BAM-matcher: a tool for rapid NGS sample matching. Bioinformatics (2016) 32 (17): 2699-2701.
Within the facility:
We now routinely use BAM-matcher, especially in large projects, to ensure sample veracity. To date, BAM-matcher has successfully identified several cases of mislabelled samples.
Outside the facility:
With successful publication the software has become publicly available and adopted by other research institutes including:
High-Performance Computing Core at the NIH (https://hpc.nih.gov/apps/bam-matcher.html)
Uppsala Multidisciplinary Center for Advanced Computational Science (http://www.uppmax.uu.se/changelog/?tarContentId=575116&languageId=3)
A number of pharma companies experimenting with NGS data for clinical trial matching with patient profiles that have chosen to remain anonymous.
“BAM-matcher is a great tool for quickly assessing sample identity in large NGS research projects. My laboratory studies chronic myeloid leukaemia and we often generate both RNA and exome sequencing data on our patient cohorts. BAM-matcher gives us great confidence that our RNAseq datasets have been properly paired with our exome sequencing datasets. Similarly, we have also used it to confirm sample validity in our longitudinal studies.” – A/Prof. Sue Branford, Head of Leukaemia Unit, SA Pathology
The ACRF Cancer Genomics Facility located within the Centre for Cancer Biology and SA Pathology provides microarray, high-throughput qPCR, next-generation sequencing and bioinformatics services to the research community throughout South Australia. The CGF has received ~$900,000 in funding through Therapeutic Innovation Australia's Translating Health into Discovery NCRIS projects, and has previously received $1M of EIF/SSI hard infrastructure funding for genomic analytical equipment.