The second day of the ASHG 2014 meeting once again featured a number of very exciting concurrent sessions. I was torn between attending the rather cute, and appropriately named ‘Cloudy with a Chance of Big Data’, ‘Population Structure, Admixture, and Human History’, or ‘Impact of Human Knockout Alleles’. I had already covered a bit about Big Data, so the last two seemed more interested. In the end, the human knockout alleles won out, not in small parts due to the fact that the first speaker was Daniel MacArthur of Harvard University, who is among the very prolific scientists on social media (@dgmacarthur).
MacArthur spoke on behalf of the Exome Aggregation Consortium (ExAC) about a new efficient and scalable pipeline for analysis of available exome analysis data. While globally there are more than 200,000 exome sequences available, most of it is siloed, and it is difficult to assign variability in the data from different groups to biology or technological differences. The pipeline tool developed by the group involved taking raw bam files and processing using the GATK tool kit. The pipeline was used for aggregating and calling 92k exomes covering a variety of disease and populations. They validated the final data by concordance with Sanger Sequencing data, or very high depth data from PCR-free 2x250bp sequencing. The lab also developed a very cool 3-D data visualization tool that showed among other things, a good representation of different ethnicities in the data, and certain rare variants that are distributed all over the world. The effort to build this sort of a database is quite incredible, with MacArthur describing it as the ‘largest ever collection of human protein-coding genetic variants’. The best part of course, is that the groups is sticking to tradition of sharing, and all this data is available through the website: exac.broadinstitute.org (the site crashed a few hours after the announcement, as actually predicted by MacArthur!). Users are encouraged to analyze and publish freely for individual variants. In response to a question later, MacArthur also mentioned that phenotype data would be layered on this, but was unable to provide an exact timeline.
The remaining talks in the session were on a similar vein, with researchers developing tools to analyze knockout alleles in large datasets representing geographical groups or diseases. For example, Patrick Sulem spoke about the deCODE project that performed whole genome sequencing of over two thousand Icelanders, a country with total isolated and bottlenecked population of near 320,000. They found that 1/13th of the individuals had a rare complete human knockout. In the future, the project plans to scrutinize hospital records, re-contact individuals for further clinical exams and biological tests to obtain more phenotype on people with the knockouts. They also plan to perform deep RNA sequencing on these individuals.
I also got the opportunity to hear Andrew Su talk on ‘Microtask crowdsourcing for annotating diseases in PubMed abstracts’. Currently databases for disease gene variants, pharmaceutical effect, signal pathways etc are highly fragmented and incomplete. The ideal database would combine all these together. This is obviously a massive undertaking but Su ran an experiment to see if we could take the first step by annotating disease information from abstracts in the Pubmed database, thereby converting free text to a knowledge network. Given that a new Pubmed abstract is published every 30 seconds, such an undertaking would have to be crowdsourced – but could non-scientists do as well as scientists in this task? Su demonstrated that non-scientists recruited via Amazon Mechanical Turk (AMT) could actually do better than PhDs in annotating texts from Pubmed – but only when aggregated: 6 lay people could do it better, faster and cheaper than a PhD! While his experiment is not scalable (since they were paying recruits ~7cents/article), Su hopes to develop an interface that will harness citizen scientists to do the job.
The other highlight of the day was Illumina’s session on products for diagnostics. Application of next-generation sequencing (NGS) in the clinical space is still evolving. However, as Matt Posard, Senior VP of New and Emerging Opportunities at Illumina said in his opening remark, the question is not ‘if but when’ sequencing will be ubiquitous in the clinical space. It does help that last year Illumina introduced MiSeq Dx, the first FDA cleared NGS-system (along with FDA-approved IVD kit s for cystic fibrosis, and a universal kit). Dr Jamie Platt from Molecular Pathology Lab Network spoke at the event, especially highlighting the usefulness of having a FDA-approved system, especially with the FDA indicating recently that it might scrutinize the so-called lab developed tests (LDTs) more closely. As such LDTs require significant investment in test development and validation, training, and reporting. The availability of a FDA-approved system reduces such burden significantly, and this is beneficial especially for smaller laboratories. However, FDA-cleared assays can only be used for the purpose it is approved for, e.g the MiSeqDx cystic fibrosis 139-variant assay is not indicated for stand-alone diagnostic use or for newborn screening, fetal diagnostic testing etc. As such, early results from her laboratory showed good concordance for Illumina’s cystic fibrosis assays with other tests.
Tomorrow I will cover the various genomics companies on the exhibition hall floor, including a focus on some of the smaller local San Diego biotechs such as Cypher Genomics (news release today), Edico Genome, Genection, and Pathway Genomics.