Back to top

Q&A

1. Can you speak to the cost per genome analyzed? (answered by Joel Sevinsky)

From Heather Carleton’s presentation at the regional PulseNet meeting:

Cost table

This just covers materials. There would be significant savings in labor if all of those tests were replaced with one WGS experiment.

Other informal analyses of the costs of bacterial whole-genome sequencing have produced estimates of about $250 per isolate, including labor and overhead.It may be possible to lower these costs in high volume settings to perhaps $200 or less with the use of more automation higher-throughput sequencing technology.

It should be noted that the trend over the past 10 years has been for sequencing costs to decrease and automation to increase.Assuming these trends continue, sequencing costs should continue to decrease.

2. For pathogens like TB - to what degree is the epi, for example relationships between cases, needed to interpret the WGS - do you needing strong epi data to interpret the WGS results? (answered by Gregory Armstrong)

In public health, molecular data should always be interpreted taking into account the epidemiologic context.  As with any isolated piece of epidemiologic data, molecular data are rarely conclusive by themselves and can sometimes be misleading.  It is always best to look at all of the available data in order to make the most accurate inferences.

3. What are the risks of SNP analysis when there are only a handful of sequenced isolates--i.e. operating on that initial steep downward slope of establishing the core genome? (answered by Martin Wiedmann based on comments by Gregory Armstrong )

One main issue with SNP analyses when only a few sequenced isolates are available is that SNP calling really works best when done with a closely related references genome (see JC Kwong, J Clin Micro 2016, Figure 3). With few available genomes, there is a higher risk that no reference genome that is closely related to the isolates of interest is available. For cgMLST, a good number of isolates is needed to identify the core genes that should be included in the cgMLST scheme, so design of a cgMLST scheme would be difficult if only a few sequenced isolates of a species or group (e.g., serotype) are available.

4. While WGS and variants therein are certainly promising, costs of both testing and of hiring staff who are trained in bioinformatics needs to be considered.  Can you comment on the ways that departments of health, or even direct care health-care systems, can approach this to determine cost-benefit of bringing this type of technology online? (answered by Joel Sevinsky with comments from Gregory Armstrong)

Question #1 contains a figure from a recent PulseNet regional meeting demonstrating the savings in cost of materials.There will also be a cost savings in labor if WGS is used to completely replace the tests listed, especially if you automate library preparation.As for bioinformatics skills, many labs currently use Bionumerics for PFGE analysis, and the plans PulseNet has for WGS analysis in Bionumerics is similar using wgMLST schemes.There will be some additional training but there should be minimal new hires required if Bionumerics is used.So for routine analysis the cost should be less and the benefit greater.

CDC’s AMD program is putting out funding to state and local health departments for workforce development.Up to this point, most of those funds have gone towards training laboratory staff, but it is now starting to address needs in the epidemiology staff as well.Also, while the move to genomics will be cost-saving in the long run, there will be transition costs, such as those for workforce development, and certain legacy costs, such as the need to maintain older technologies during an overlap period.

5. Can you describe the timescales required to prep samples prior to sequencing? (answered by Joel Sevinsky)

Starting with a streaked isolate on Monday morning, you would extract DNA and begin library preparation, finishing on Tuesday and loading the instrument early afternoon.If necessary, you can squeeze in the entire extraction and library preparation in one day but that would require more than 8 hours.

6. What do the numbers assigned to each isolate mean? (answered by Joel Sevinsky)

Numbers assigned to isolates are usually identifiers.There are identifiers for the original isolate in the local public health department, identifiers for PulseNet WGS, Biosample identifiers at NCBI, and short read archive identifiers at NCBI.Soon there will also be identifiers as to where the isolate is located in a composite phylogenetic tree made from wgMLST data.When looking at the phylogenetic tree the identifier will most often be the state public health lab ID (for trees built at the state public health lab) or the PulseNet WGS ID (for trees built at the CDC).

7. How will the advancement of Culture-independent diagnostic tests impact sequencing? (answered by Joel Sevinsky)

CIDT will reduce the number of isolates that are sent to public health labs.This will require the clinics, etc., to send the sample matrix directly to the public health lab for isolation in order to perform WGS.Public health labs will require additional funding for the materials and labor required for isolation.Although there is active research ongoing to directly sequence from clinical specimens without the need for isolation, this technology is not ready yet and needs significant development.

8. What is the role of WGS related to fungal outbreaks, like mucor? What do we know about thresholds for fungi? How do fungal genomes compare (in size) to bacterial genomes? (answered by Gregory Armstrong)

The genomes of eukaryotes such as fungi are always more complicated than those of prokaryotes (bacteria). Whereas bacterial genomes are typically in the millions of bases, eukaryotic genomes are typically in the billions, although both vary considerably in size.  In addition, the fact that eukaryotic genomes are typically diploid makes the sequencing and assembly more complex.  With those caveats, NGS has proved useful in certain fungal outbreaks.  For example, sequencing of Candida auris is giving us a better picture of its recent emergence throughout the world.  NGS has also been used to look at isolates from coccidioidomycosis (“Valley Fever”) cases that were acquired recently in the Pacific Northwest, outside of the usual endemic zone of the fungus.  So in summary, sequencing of fungi is more complex than sequencing of bacteria, but clearly has a role in public health.

9. You spoke of how WGS can be used as a public health tool in the future. To what extent will storage of this massive information be limiting and a barrier to this scenario? (answered by Joel Sevinsky)

Storage of routine sequencing as part of the PulseNet program includes immediate upload of data to the NCBI Short Read Archives (SRA), which hence will provide data storage.  Public health labs will have no reason to store data locally for E. coli, Salmonella, Listeria, and Campylobacter.  This would be the bulk of sequencing data for most public health labs. Of Note, most of the analysis we do is on the assembled genomes, which are much smaller than the raw data.

While there is no technical reason to keep raw data in-house after it has been uploaded to the SRA, there are certain legal issues that haven’t been sorted out yet.  Foodborne disease outbreaks, for example, occasionally result in lawsuits, and in those cases, laboratory and epidemiologic data are usually subject to release.  Will it be enough if the original, raw data is not available, but the slightly processed data are available on NCBI? This still needs to be resolved.

A single MiSeq run for 24-36 isolates will take up 10-15 GB hard drive space.This is about 500 MB per genome in raw data.So a public health lab that wished to sequence non-PulseNet organisms would need about 1 TB of data storage for every 2,000 non-PulseNet organisms.If the labs embrace cloud storage for anonymous sequencing data, the monthly cost for 1 TB is less than $50.

10. When sequencing, how do you distinguish between integral genes vs mobile elements? (answered by Joel Sevinsky)

This is based on sequence similarity.There are databases of known genes and mobile elements which are used to identify different regions within a genome sequence.

11. You mentioned that all of these new sequencing techniques would "be coming to public health lab near you soon!" Two questions: realistically how soon do you think that will be (e.g. current wait times for E. coli isolates can be months) and do think commercial labs would ever have those capabilities or would this be solely the work of state health labs/ CDC? (answered by Joel Sevinsky and Gregory Armstrong)

As far as when will public health labs have access to wgMLST schemes in Bionumerics for E. coli, Salmonella, Listeria, and Campylobacter, the latest estimates I have seen suggest that they will start arriving at a public health lab near you this summer.Ten public health labs are already using wgMLST in real time for Listeria analysis as part of the Listeria WGS pilot project.Applied Maths, the creator of Bionumerics, already has wgMLST schemes available for commercial use, but analyzing on their computation engine for this analysis is a fee-for-service analysis of about $10 per isolate (http://www.applied-maths.com/applications/wgmlst).

With TB, CDC is already contracting with several state labs to do sequencing on selected isolates for the entire country.Next year, they hope to be moving to universal sequencing.Similarly, the influenza program is funding three state labs to do sequencing for the entire country, although the uses for that data are mainly at the national and international levels—to improve vaccine strain selection. Several other programs, including viral hepatitis, Legionnaires disease, streptococcal pathogens, meningitis pathogens, viral vaccine-preventable diseases, GC, and HIV have already started to roll out their protocols to several states.

12. Can you recommend a good primer reference on WGS basics and language? (answered by Joel Sevinsky)

I would recommend viewing the webinars and modules provided by ELC training efforts and the Food Core Center of Excellence.

13. Should we routinely resample patients when we are first trying to establish SNP pipeline/interpretation with a new organism? What are the minimal criteria for establishing an evolutionary time clock for a new species--you mentioned the importance of tight epi correlations, thinking more of this in terms of resampling colonized patients? (answered by Joel Sevinsky with additions by Martin Wiedmann)

There is definitely some value in resampling patients to help establish the rate of change within patients (as well as potentially hypervariable genes). While definition of the mutational rates for different taxa typically is not a trivial research undertaking, the type of date generated through resampling (of both patients and environments where an organism persists) will always be helpful for interpretation of SNP or allele differences between isolates.

14. Could you speak about this type of genome sequencing for Zika virus?  Is it at all practical?  Do we know how stable the virus is? (answered by Gregory Armstrong)

The Zika genome is about 10,000 to 11,000 nucleotides long, so whole-genome sequencing is definitely possible, even with Sanger sequencing.  However, while sequencing has been useful in understanding the virus’s emergence in the Americas, it’s not being performed routinely at this point, and may have limited utility in standard public health practice.  In addition, many case-patients present after or towards the end of the relatively short viremic phase, making isolation or sequencing of the virus more challenging. 

15. Since TB is a unique disease where there is recent vs old exposure, especially for prevalent strains spanning transmission over 10-20 years, how can WGS help refute or confirm transmission in absence of epi data? Could you speak to how many SNPs mean relatedness in absence of epi data for TB in comparison to E. coli or other organisms that have point source transmission? (answered by Gregory Armstrong)

One of the main uses of MTB sequencing is helping to identify those cases likely to be due to recent transmission.  The older typing technologies—MIRU/VNTR and spoligotyping—have been shown to be very useful in this regard over the past decade.  Cases with closely related isolates are much more likely to be due to recent transmission.  The older technologies, however had much more limited resolution compared with whole-genome sequencing.  Experience with whole-genome sequencing has already demonstrated the usefulness of MTB sequencing, both in identifying cases likely to be due to recent transmission and, in some cases, further dividing those cases into smaller groups that could help in narrowing possible transmission settings.  As with any other pathogen in public health practice, sequencing by itself rarely if ever gives a definitive response.  Solid epidemiologic data and follow-through are needed as much as they were before sequencing.  Sequencing provides one more tool for TB programs. 

One challenge that is somewhat unique to MTB is its slow rate of mutation.  On average, MTB generates a single SNP every two years over its entire 4.4M base genome (0.5 SNPs/year, or ~0.1 SNP/MB/year).  But TB tends to spread relatively slowly (relative to many other pathogens, that is), so even with such a low mutation rate, MTB sequencing is proving very useful.

16. You spoke a lot about this technique's application in bacteriology, but do you see any applications/utility in viral outbreaks? (answered by Gregory Armstrong)

NGS is very useful for a number of viral pathogens.  In fact, because viral genomes are relatively small and often mutate very quickly, routine sequencing (i.e., Sanger sequencing), including whole-genome sequencing, of viral pathogens has been done since the 1990s.  The Global Polio Eradication Initiative, for example has been routine sequencing all wild poliovirus isolates since the early 2000s, and the data has been extremely useful in understanding the spread of the virus, the emergence of vaccine-derived polioviruses, and in monitoring progress towards eradication.  Sequencing is similarly being used for other viral vaccine-preventable diseases, such as measles and rubella.  CaliciNet is now replacing Sanger sequencing with NGS in some sites.  For hepatitis C, the CDC’s “GHOST” (Global Hepatitis Outbreak Sequencing Technology) system has shown how useful sequencing can be in investigating hospital-based outbreaks and is now being applied in the community setting.  During the waning phase of the West Africa Ebola outbreak, NGS was used to identify the source of outlier cases.  There are many other examples in addition to these.

17. Loading and storing data on NCBI is fine and often required for academic research, but in public health work, what bioethical or privacy issues would doing so impose?  And how (and if) would sequence results be shared with healthcare providers, laboratory, residents (like how AST results are being shared now)? (answered by Joel Sevinsky)

Very limited metadata (meaning data associated with a whole genome sequence, such as patient residence, age, etc.) will be included with the submissions of the genome sequences to the NCBI SRA database. In some cases, such as cases of rare diseases, the public metadata may not even include the state of residence or occurrence.The vast majority (or sometimes all) meta data associated with the isolates will stay within the public health network, so bioethical or privacy issues should be minimal. As for sharing results, it won’t differ much from how things are shared now.Serotype, antibiotic resistance, virulence data etc., which were created using other assays, are already shared. So although the assay will change the result, it won’t change the sharing (as long as the WGS is CLIA validated).For surveillance activities for outbreaks it will be very similar to the way PFGE is used right now.Phylogenetic trees will be created to visualize relatedness and clustering of isolates.

18. You indicated that the webinar will be available online but it would be great if the PowerPoint slides could be made available as well.

See the “Webinar” tab on the side menu for access to the slides and the recording.

19. What's the difference between wgSNP vs high quality SNP? (answered by Joel Sevinsky)

hqSNP analysis is just another name for wgSNP analysis.Both are looking for SNP within total genomic DNA minus the masked regions.

20. Could you recommend a good bioinformatics course or training for epidemiologists? (answered by Joel Sevinsky)

Same answer as question #12.