OVERVIEW OF NEXT-GENERATION SEQUENCING
The method for next-generation sequencing (NGS), or massively parallel digital sequencing, is distinct from Sanger sequencing in that the sequencing reactions alternate with cycles of signal detection to provide the data readout at a significantly accelerated scale.8,9 The use of NGS in the years after the completion of the Human Genome Project has greatly increased the use of genomics and has significantly impacted the pace of biomedical research.10 Although there are several different NGS platforms offered commercially, they are methodologically quite similar. Unlike Sanger sequencing, NGS does not require subcloning of DNA, propagation in a bacterial host, and isolation of individual templates prior to sequencing. Instead, DNA is randomly fragmented into a pool of small pieces (generally 100 to 500 bp) and then ligated with specific synthetic DNA linkers (or adaptors) at the fragment ends to generate a NGS “library.” The library fragments are subsequently amplified by a process that isolates individual library fragments to a specific location prior to amplification. In general, this in situ amplification occurs on a covalently modified surface (a bead or flat silicon surface) with complementary linkers covalently attached to it, using a specific dilution of library fragments as input. In this step, the individual library fragment amplification permits sufficient signal output for detection during the sequencing steps that follow. Because each sequencing read derived from an amplified library fragment originates from that single unique fragment, NGS data are digital in nature. This fact underlies an important concept for digital sequencing methods: the number of specific sequencing reads generated is directly proportional to the amount of input nucleic acid, accurately reflecting amplified regions of a genome, for example. However, as the generation of libraries and the amplification of fragments involve polymerase chain reaction (PCR) amplification, inaccuracies can result via amplification biases or from PCR enzyme substitution errors in which the wrong base is incorporated during amplification.
SEQUENCING BY SYNTHESIS: THE ILLUMINA PLATFORM
Currently, there are two commercially available NGS platforms in common use. One uses an approach called sequencing by synthesis that occurs in the microfluidic channels of a silicon-derived “flow cell” device (Fig. 11–1).11 Here, enzymatic amplification of library fragments on the flow cell surface results in hundreds of millions of DNA clusters, and the sequencing of each cluster occurs in parallel with all of the other clusters by a stepwise series of events. Solexa marketed the first commercially available sequencer using this technology in 2006, and was acquired by Illumina in 2007. Illumina offers a variety of different sequencing machines with varying run times (from hours to days), sequencing capacities (from 25 million reads to nearly 3 billion reads per flow cell), and overall output (from approximately 0.5 gigabase (Gb) to greater than 1.5 terabase (Tb) of sequenced bases per run).
Illumina library construction and sequencing process. Panel A represents the library construction process whereby high-molecular-weight genomic DNA is fragmented, ligated with adaptors, and amplified on a solid support prior to annealing of adaptor-complementary primers. Panel B represents the stepwise sequencing process whereby reagents are introduced to extend the primed fragments, the incorporated fluorescent nucleotides are detected, the 3′ end is deblocked, and the fluorescent groups on the incorporated nucleotides removed prior to the next stepwise sequencing-by-synthesis series. (Reproduced with permission from Mardis, ER: Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif) 2013;6:287-303.)
SEQUENCING BY SYNTHESIS: OVERVIEW OF METHODOLOGY
As in Sanger sequencing, the sequencing-by-synthesis steps begin with annealing sequencing primers complementary to the adaptors to the amplified library fragments on the flow cell surface. Then, a solution containing DNA polymerases and fluorescently-labeled, chemically modified dNTPs are added to the flow cell to begin an incorporation step. The DNA polymerases incorporate the complementary dNTP onto the 3′ ends of the primed fragments in each cluster. Each incorporation reaction is terminated after a dNTP is added, because of a blocking group at the 3′ position. After a cycle of dNTP incorporation on the flow cell, a laser-based detection system scans the flow cell surfaces to excite the incorporated fluorescent groups and to collect the unique light emission of each of the four fluorescently labeled dNTPs. Chemical deblocking steps follow to (1) remove the fluorophore by cleavage (the fluorescently-labeled dNTPs are known as “reversible dye terminators”) and (2) unblock the 3′ hydroxyl group to permit the next cycle of incorporation, detection and blocking.
Unlike Sanger sequencing, Illumina’s sequencing-by-synthesis method generates relatively short read lengths, typically 100 to 300 bp. The limitations on read length are primarily a signal-to-noise issue, where increasing numbers of steps in the sequencing-by-synthesis approach produces increasing noise at each step that competes with true signal detection. Hence, the data quality of Illumina reads tends to decrease with increasing step numbers. Illumina error rates are low, in the 0.1 to 0.3 percent range, and the predominant error type is base substitution.12 Ultimately, a complex, repetitive genome such as the human genome cannot be assembled from 300-bp read lengths, so algorithms were developed to align reads to the reference genome as a first step toward data interpretation.13 One approach by which Illumina has improved read mapping is by enabling paired-end sequencing that permits the sequence read off first one end and then the other of each amplified fragment cluster on the flow cell. Paired end reads of this type physically are linked and defined by the fragment size, permitting their accurate placement onto the reference genome by alignment, and effectively permitting more reads to contribute to coverage from a given sequencer run (when compared to single-end reads).14 Furthermore, as described later, the expected read placement onto the reference genome, when not met, is a source of information used to interpret structural variation.
SEQUENCING BY PH CHANGE SENSING: THE ION TORRENT PLATFORM
The second type of NGS platform in common use is the sequencing by pH sensing method that is marketed by Life Technologies (now a part of Thermo Fisher) as their Ion Torrent platform (Fig. 11–2). Life Technologies acquired Ion Torrent in 2010.15 The sequencing by pH-sensing method involves similar steps of library construction as described for sequencing by synthesis. However, the library DNA fragments are diluted and combined with (1) individual micron-scale beads that have covalently attached complementary adaptors on their surface and (2) PCR reagents, including DNA polymerase, into an emulsion PCR reaction. In emulsion PCR, one generates individual aqueous micelles that permit bead-based amplification of library fragments prior to sequencing. The emulsion PCR process generates beads carrying copies of identical DNA fragments suitable for sequencing. The DNA-coated beads are purified from the emulsion, enriched for those beads with amplified DNA on their surfaces, and then deposited into individual wells of a specifically constructed semiconductor plate, known as an Ion Chip. Sequencing primers (complementary to the adaptors) are annealed to the bead-amplified fragments, and then the sequencing process is initiated by the addition of DNA polymerase and flow of a single dNTP-containing buffer solution across the Ion Chip surface. The flow of the four nucleotides occurs in a stepwise fashion, with a detection step and an intervening wash. When a specific dNTP is incorporated into the elongating strands of DNA fragments on a specific bead, hydrogen ions are released, and a highly sensitive pH sensor built into the Ion Chip can read out the subsequent change in pH for the well containing that bead. If no dNTP is incorporated at that cycle, no change in pH is registered for that well. This approach follows for all wells containing beads on the Ion Chip, resulting in massively parallel sequencing. As with the Illumina technology, read lengths are short, in the 100 to 400 bp range. Unlike the Illumina platform that uses paired-end reads, Ion Torrent sequencing reads are single-end reads. The source of most sequencing errors generated by the Ion Torrent platform are insertion/deletion errors in stretches of identical bases on the template strand as a result of the difficulty of discerning the pH change ratio associated with incorporation of the same nucleotide above four consecutive identical nucleotides.16 Advantages of the Ion Torrent are that run times are very short (in the 2- to 7-hour range), and the cost per run is relatively inexpensive. The output, read length, run time, and cost vary by the Ion Chip type used (up to 2 Gb).
Ion Torrent library construction and sequencing process. Panel A represents the specifics of the Ion Torrent library amplification process, which requires an emulsion PCR amplification on the surface of a bead with covalently attached adaptor-complementary primers, followed by emulsion breaking and bead addition to the Ion Chip for sequencing. The sequencing process, illustrated in panel B, flows sequential high-purity dNTP solutions across the chip surface for incorporation. Upon incorporation, there is a release of hydrogen ions that are detected by the pH-sensing capability of the chip, detected in panel C. (Used with permission from Thermo Fisher Scientific.)
NEXT-GENERATION SEQUENCING TECHNOLOGY IN DEVELOPMENT: SINGLE-MOLECULE SEQUENCING
There is one commercially available platform for single-molecule sequencing, the Pacific Biosciences RSII instrument.17 Single-molecule sequencing differs primarily from the previous platforms discussed in that no PCR amplification is required prior to data generation. This has obvious advantages in eliminating some sources of bias that result from the use of PCR, but has a disadvantage in that higher input amounts of DNA are typically required. The other major difference in the Pacific Biosciences approach is in the read length obtained, which ranges according to the template type but can exceed 50,000 bases with the input of very long molecules to the library construction.
The Pacific Biosciences approach couples primed DNA library fragments with DNA polymerase molecules that are specifically engineered for the sequencing system (Fig. 11–3). These complexes are introduced to the surface of a SMRTCell, a nanofabricated sequencing device, which consists of 150,000 zero-mode waveguides (ZMWs). In effect, the loading of complexes aims to place one DNA polymerase/DNA template complex into each ZMW in preparation for sequencing. The ZMW is a nanofabricated pore that focuses the laser excitation and detection optics at the bottom of the ZMW where the DNA polymerase complex is bound, isolating the detection area to the active site of the polymerase. The sequencing process initiates with the introduction of fluorescent nucleotides and buffers, and is continuously monitored by the excitation/detection optics during the run time. As fluorescently tagged nucleotides sample into the active site, they can be detected with sufficient dwell time upon their incorporation into the synthesized strand. Because each fluorescent group is specific to the nucleotide identity, the sequence is read out based on the detected emission wavelength. The fluorescent group is attached to the phosphate portion of the nucleotide, so incorporation removes it by cleavage during the phosphodiester bond formation, and it diffuses out of the ZMW focus.
Pacific Biosciences real-time sequencing and detection process. The primed library fragments are complexed to DNA polymerases and applied to the surface of a SMRTCell, where they locate into zero-mode waveguides (ZMWs). After providing fluorescently labeled nucleotides and buffer, the sequencing process is monitored by real time detection, whereby incorporated nucleotides are detected in the active site of each ZMW-isolated polymerase complex by the laser/detection optics of the instrument. Here, the fluorescence events are recorded for each active ZMW throughout a preset duration, resulting in the final sequencing read data for each single DNA molecule. (Used with permission from Pacific Biosciences.)
Single-molecule sequencing has, by definition, an inherently higher error rate as a consequence of the signal-to-noise ratio associated with detecting a single event in real time. The predominant error type in Pacific Biosciences sequencing reads is an insertion/deletion error that may be a result of inaccuracies in detecting (1) a nucleotide that had a longer than average dwell time but was not incorporated, (2) a single nucleotide that incorporated but was mistaken for two (or more) nucleotides, or (3) by errors in detecting multiple nucleotide incorporations into a homopolymer stretch. In spite of an approximate 15 percent error rate, the errors are essentially random, which means that oversampling (or “coverage”) of the sequence of interest can correct most errors, resulting in a cumulative error rate of around 0.1 percent following read assembly.18 The very long read lengths possible on this platform enable read assembly, rather than read alignment needed in short read platform data analysis. Assembly has obvious advantages in its ability to represent novel content in a genome and to provide long-range haplotyping information.
Emerging DNA sequencing technologies are being developed around the central concept of translocating DNA molecules through nanopores, which can be either biologic or nanofabricated pores.19 In nanopore sequencing, detection of the nucleotide sequences occurs during nanopore translocation events that are sensed by changes in electrical current that correlate to sequence, or by laser-based detection of incorporated fluorescent nucleotides. Nanopore sequencing, while still somewhat theoretical, may offer rapid sequencing with very long read lengths.
TARGETED NEXT-GENERATION SEQUENCING: FROM GENE PANELS TO EXOMES AND BEYOND
Initially, NGS platforms were used for whole-genome sequencing of organisms with relatively small genomes, such as bacteria or model organisms (Caenorhabditis elegans, Drosophila melanogaster, etc.), or for combining large numbers of PCR products into a single sequencing run. As the throughput per run improved, larger genomes, including human genomes, were studied, including the first cancer genome.20,21 However, the cost and complexity of analysis for whole human genome studies, along with the difficulty of interpretation of variants identified outside of the known genes, inspired the development of methods to focus sequencing onto these loci. In particular, hybrid capture techniques were developed that provided either a subset of known genes (all kinases, for example), or all the known genes (the “exome”) by a series of selective steps that led directly to NGS data generation (Fig. 11–4).22 At its essence, hybrid capture relies on synthetic DNA probes that are complementary to sequences of the known exons of genes in the genome of interest.23,24 In typical current protocols, the probes have covalently attached biotin moieties that enable downstream selection by streptavidin-coated magnetic particles. By combining a whole-genome library with the hybrid capture probes under conditions that favor hybridization (stoichiometry of probes to targets, temperature, and buffer conditions), probe:library fragment hybrids are formed. Following their selection by streptavidin magnetic bead binding and application of a magnetic force to isolate the beads, the noncaptured library fragments are removed, washes performed and the hybridized fragments eluted by denaturation from the probes. The resulting fragments are PCR amplified, quantitated, and sequenced by NGS. At present, the throughput of genome-scale NGS platforms permits the combination or “multiplexing” of the resulting fragments from several hybrid capture reactions into a sequencing run. Multiplexing is enabled by the inclusion of DNA barcodes that are synthesized onto the library adapters, and demultiplexing of the sequencing reads occurs downstream of the instrument run using the appropriate bioinformatics program. Although exome sequencing costs about one-tenth of whole-genome sequencing, it is important to note that typical yields from hybrid capture range from 85 to 90 percent of the targeted regions being covered at sufficient depth to confidently predict variants. Furthermore, the range of variant types detected by hybrid capture is often limited. Single nucleotide variants and short insertion/deletion variants can be detected, but copy number and structural variants are difficult to detect reliably, especially if they are not anticipated by the addition of specially designed probes to capture them and by the specialized analyses required to detect them.
Overview of hybrid capture preparation for sequencing. This illustration presents a generalized overview of the process for hybrid capture selection prior to DNA or RNA sequencing. In general, probes are designed for the targeted regions of interest, which can constitute a small number of genes or hotspot loci, up to the full exome (all annotated genes in a genome). Following hybrid capture, the probe:library duplexes are isolated from solution by streptavidin magnetic beads. Release of the library fragments by denaturation is followed by amplification, quantitation, and sequencing. (Used with permission from Illumina, Inc.)
OVERVIEW OF NEXT-GENERATION DNA SEQUENCING ANALYSIS
It can be easily argued that the relative ease of performing biomedical experimentation imparted by NGS-based methods has conversely required more complicated analytical approaches to accurately interpret the resulting data.25 As mentioned earlier, this is partly a result of the complexities of the human genome and the requirement for short reads to be aligned to the reference sequence as a first step for data analysis. It also is a result of computational infrastructure and software pipeline requirements to align and analyze data because of the sheer magnitude of data generated in a single experiment, which is exacerbated by multiple samples, multiple time points, and the need to integrate data of different types for the correlative analyses that are desired.
Most cancer-focused analyses have as a central goal the identification of DNA variants that are unique to the tumor cells (“somatic”) as compared to the inherited (“constitutional” or “germline”) genome. In practice, the desired comparison (whether the sequencing platform is a targeted gene panel, exome, or whole genome) is achieved by first aligning sequencing reads from the tumor library and from the matched normal library against the human genome reference sequence as separate entities. Algorithms that have specialized logic to identify different types of variation (single nucleotide, or “point” mutations, small insertions or deletions, copy number, or structural alterations) then are used to separately examine each set of read alignments and to identify the specific variation type relative to the human genome reference sequence. Lastly, the resulting variants that are identified are compared between the tumor and normal datasets, to identify those variants that appear unique to the tumor. As a means of interpreting the impact of all identified somatic variants on the sequence of amino acids in a given gene, for example, one must secondarily apply the annotation of the human genome onto identified single nucleotide and indel (term for the insertion or the deletion of bases) variants that occur within the coding regions and splice sites of known genes. Somatic single nucleotide variants (Chap. 10) can preserve the resulting amino acid (“synonymous”); can encode a different amino acid (“nonsynonymous”); can abolish a splice site and therefore alter the gene reading frame according to the intronic sequences up to the next encoded stop codon (“splice site”); and can omit (“readthrough”) or introduce a stop codon (“nonsense”). Indel mutations typically cause a shift in the open reading frame (“frameshift”) and result in a different amino acid sequence and length of the resulting protein, depending upon the number of added or deleted nucleotides. If the number added or deleted is a multiple of three nucleotides, the open reading frame is preserved but the protein sequence is altered accordingly.
Copy number gains or losses are defined by statistically significant variation in regional read density, and often are defined by the genes that lie in the altered region.26,27 Structural variants are broadly defined as chromosomal segments that are inserted, inverted relative to the germline sequence, or translocated relative to the germline sequence. Here, algorithms identify the different types of structural variants based on multiple read alignments that are spaced farther apart than expected defined by the insert size of the sequencing library used (“insertions”); or are spaced more closely than expected (“deletions”); or have the incorrect orientation of read direction for the read pairs aligned to the same chromosome (“inversions”); or have the forward and reverse reads of multiple read pairs on different chromosomes (“translocations”). Insertions and inversions may result in a fusion protein by virtue of juxtaposition of exons from two genes on either the same (inversion) or different (insertion) chromosomes. Translocations also can result in gene fusions but involve juxtaposed exons from genes present on different chromosomes in the germline. There are multiple examples of gene fusions that result in proteins with a demonstrated role in oncogenesis.28
Genetic susceptibility to hematologic malignancies can occur either by inheritance or by de novo mutations in genes, such as BRCA1/2, TP53, and others. Here, variants in the germline can be identified from aligned sequence read data to the human reference sequence, followed by annotation of the known cancer susceptibility genes. The pathogenicity of a given variant can be evaluated relative to databases of previously catalogued variants in these genes, if available. Identification of these variants typically will require consenting the patient and family members to a genetic counseling session to communicate the information about the germline susceptibility and its possible consequences for siblings and children (discussed below in “Next-Generation Sequencing as a Clinical Assay: Implications for the Practicing Hematologist”).
There are a variety of data analyses that integrate NGS data from different starting materials such as DNA and RNA from the same tumor, or across large groups of tumors (either from the same or different disease site). One example of data integration is evaluating RNA sequencing data to support a specific variant identified initially from tumor to normal DNA comparisons such as for a predicted fusion gene. In this example, the confirmed detection of the gene fusion in RNA provides confidence that the structural variant algorithm has identified a true positive. Such a result can also confirm cytogenetic results from conventional diagnostic assays. Similarly, the identification of a DNA level mutation that appears to introduce a protein truncating variant (frameshift or splice site mutation) can be evaluated by examining the RNA sequencing data for evidence of its transcription. Because these transcripts are often subject to nonsense-mediated decay (a surveillance pathway that reduces errors in gene expression by eliminating mRNA transcripts that contain premature stop codons), having RNA data to verify the transcript is present, and if so encodes the nonsense mutation, or is absent, can provide important information.
Hematologic malignancies have very specific considerations in experimental design and data analysis that should be noted. In particular, while high tumor cell content is typically derived from marrow biopsies, and therefore a majority of cells contributing DNA to NGS libraries are tumor cells, the matched normal sample can be problematic in the following regard. In patients with high circulating tumor cell content in the blood, the use of a skin, buccal swab, or mouthwash sample to provide the normal sample may have contaminating tumor cell content that will complicate the identification of somatic variants. Although consent to obtain a second normal sample once the patient achieves remission may be used to address this dilemma, not all patients achieve remission, and some patients will refuse the second biopsy because of discomfort. Flow sorting the blood or marrow to isolate a nonmalignant cell population (often normal T cells) can provide a matched normal if no alternative source is available.
The rapid and uncontrolled growth and cell division inherent to cancer cells often means that not all cancer cells in a patient will have the same somatic alterations. This has been demonstrated for leukemias and myelodysplastic syndromes and is referred to as genomic heterogeneity.29,30,31,32,33,34,35,36 In essence, every cancer cell carries the same set of founder mutations (sometimes referred to as “truncal”), but subclones can exist in the tumor cell population, each of which carries additional mutations unique to that subclone. As yet, the importance of heterogeneity has not been definitively demonstrated in the context of outcome, likelihood to relapse, resistance to therapy, or other possible clinical attributes. Tumor subclones can be defined by their somatic mutational landscape from high depth NGS, where the digital nature of the NGS data is exploited by algorithmic clustering of mutations that share the same variant allele fraction (VAF). In particular, the VAF of any mutation is defined as the fraction of sequencing reads that contain the somatic variant (as compared to the germline or inherited nucleotide at that locus). Changes in the heterogeneity of cancer cell populations can be studied by comparing data from temporal sampling of a patient, such as at diagnosis and disease relapse.
NEXT-GENERATION SEQUENCING–BASED COMPREHENSIVE GENOMICS: FROM STUDIES OF THE TRANSCRIPTOME TO DNA METHYLATION TO CHROMATIN ACCESSIBILITY AND MODIFICATIONS
The study of modern genomics by NGS methods is not limited to the sequencing of genomic DNA but also can include (1) the characterization of RNA transcripts, (2) the physical structure of genomes including chromatin organization and protein-DNA interactions, and (3) the identification of specific chemical modifications to nucleotides and histones.37
Analysis of the Transcriptome: RNA Sequencing
RNA sequencing (RNA-seq) involves the conversion of RNA into complementary DNA (cDNA) by reverse transcription followed by NGS library construction.38 RNA-seq uses the digital nature of NGS technology to quantify levels of RNA transcripts. Previously, microarrays (designed with a fixed content of gene-specific probes) were used to assay gene expression by hybridization to reverse-transcribed RNA isolates. By contrast, RNA-seq offers the advantages of comprehensive and less-biased data analysis, with a broader dynamic range for detection of high and low abundance transcripts. With the single base resolution provided by RNA-seq, one can determine the expression of specific mutant alleles present in the germline or in cancer samples, which may be highly relevant for implementing a small molecule or immunotherapy-based targeted therapeutic. RNA-seq data can be analyzed to detect the expression of alternatively spliced isoforms of transcribed genes or to detect the transcriptional product(s) of gene fusions in cancer cells. RNA-seq can be produced as either single- or paired-end reads, where the latter are better suited to detect alternative splicing and gene fusions. Additionally, RNA-seq data can identify strand specificity of the DNA template, wherein RNA derived from the antisense strand may play an important role in regulating gene expression. Finally, the insert size of the RNA-seq libraries can be targeted to enrich for different subsets of the transcriptome. Small fragment size libraries (approximately 15 to 70 bp) enrich for microRNA (miRNA), short-interfering RNA (siRNA) and PIWI-interacting RNA (piRNA), intermediate size libraries (approximately 70 to 200 bp) enrich for small nuclear (snRNA) and small nucleolar RNA (snoRNA), and larger fragment libraries (excluding fragments less than 200 bp) enrich for messenger RNA (mRNA) and long noncoding RNA (lncRNA).
There are many protocols for RNA-seq, including different commercially available kits that exploit the aforementioned experimental focus areas. For example, protocols to study the “transcriptome,” which is defined as all the expressed RNA from a given cell or cell population, are often optimized to preferentially target one (or more) types of RNA that are pertinent to a particular area of clinical or research interest. Thus, a researcher interested only in detecting gene expression of annotated mRNA transcripts would choose either an RNA-seq protocol that included ribosomal RNA (rRNA) depletion (rRNA may represent up to 60 percent of transcripts in a cell) or one that used an initial poly-A enrichment step (as rRNAs are not polyadenylated). By comparison, noncoding RNAs play a role in many cellular processes but are not polyadenylated, so even though poly-A enrichment would not be applied, a protocol that preserves strand specificity should be.
RNA is a less-stable molecule than DNA and hence assessing the quality of the isolated RNA prior to creating a sequencing library is of paramount importance. The source for the RNA may be fresh tissue, fresh-frozen tissue, or formalin-fixed, paraffin-embedded (FFPE) tissue, and each of these sources may influence the quality of the resulting RNA. RNA derived from FFPE tissue is often at least partially degraded because of formalin crosslinks with the RNA backbone that result in breakage. Similarly, the amount of RNA available from clinical specimens is often quite limited, making necessary the use of RNA amplification prior to library construction, or the use of hybrid capture probes to enrich the on-gene yield of sequencing data from low input sources.39
As the analysis of RNA-seq data is distinct in many ways compared to DNA sequencing data analysis, multiple software tools are available to characterize differential gene expression, differential splicing, gene fusion detection, and allele-specific expression.40,41 In regard to cancer-specific analyses of RNA, a paired “normal” comparator from adjacent nonmalignant cells is often not available (or even understood), which complicates the analysis and interpretation of RNA-seq data. However, efforts are now cataloguing expression in normal human tissues and providing these results in public databases for comparison purposes.
Next-Generation Sequencing–Based Studies of Chromatin Modifications
Chromatin immunoprecipitation followed by NGS-based whole-genome sequencing is known as ChIP-seq.42 When studying chromatin modifications (Chap. 12), the targets are often transcription factors or specific histone modifications (such as methylation or acetylation) that may be important for regulation of gene expression. In brief, ChIP-seq begins with standard chromatin immunoprecipitation: protein and DNA are crosslinked in growing cell culture, the fixed and crosslinked DNA–protein complexes are fragmented, immunoprecipitated with an antibody specific for the protein of interest, and the DNA isolated from the precipitated material. After DNA isolation, a standard NGS library is prepared by adapter ligation and sizing, and the DNA is sequenced by standard NGS methods. Given the digital nature of NGS, the number of reads aligning to a particular area of the genome is directly proportional to the amount of input DNA from that region. Thus, one can determine “peaks” with a statistically significant increased number of aligned reads and infer that the genomic regions underlying the peaks are the specific areas where the protein of interest was bound to the DNA.43,44 Antibody specificity and avidity remain key determinants for the validity of ChIP-seq data, as does identifying the appropriate coverage cutoff value that determines a “peak.”
Next-Generation Sequencing–Based Studies of Chromatin Accessibility
The interaction of DNA and proteins to form chromatin plays an increasingly recognized role in the study of genomics and epigenomics (Chap. 12). Several methods using NGS-based approaches can interrogate the physical structure of DNA. These methods, which fragment DNA based on the accessibility of chromatin, allow for the determination of nucleosome positioning and inferred protein–DNA binding sites. Although these studies are not a direct method for determining specific protein–DNA binding sites, one can use sequence from the inferred protein–DNA binding sites as an indirect method for assaying global transcription factor binding genome-wide without the limitations of ChIP-seq described above. NGS-based protocols used to determine chromatin accessibility differ in the approach to the DNA fragmentation step. Three commonly used protocols are DNase-seq, MNase-seq, and ATAC-seq. DNase-seq uses DNase I to fragment DNA based on DNase I hypersensitive sites as a marker of chromatin accessibility.45 MNase-seq uses micrococcal nuclease (MNase) to cleave the DNA at accessible sites.46 ATAC-seq uses the hyperactive Tn5 transposase to simultaneously fragment (with minimal sequence bias) and add sequencing adaptors to accessible DNA.47 Another approach to studying chromatin accessibility is known as FAIRE-seq, which involves formalin crosslinking of DNA to proteins prior to random fragmentation via sonication.48 A variation of this protocol, called chromosome conformational capture (or “3C”), in which chromatin domains are crosslinked, sequenced, and analyzed to determine higher-order structural associations, can provide details into the spatial organization of a genome.49
Next-Generation Sequencing-Based Studies of Chemical Modifications to DNA: DNA Methylation and Hydroxy-methylation
Unless otherwise specified, DNA methylation is generally synonymous with cytosine methylation. Cytosine can undergo methylation or hydroxymethylation at its C5 position to form 5-methylcytosine (5-mC) or 5-hydroxymethylcytosine (5-hmC). Both cytosine methylation and 5-hydroxymethylation typically occur when a 5′ cytosine is positioned directly adjacent to a downstream guanine (known as a CpG dinucleotide). There are approximately 26 million CpGs in the human genome. The first genome-wide platforms to detect DNA methylation changes at base pair resolution were microarrays designed to hybridize targeted CpGs across the genome (current methylation microarrays target approximately 500,000 CpGs). However, the design of CpG representation on a microarray was often biased toward gene promoters or other areas of predetermined interest.
Many protocols exist for differential fragmentation of a genome based on DNA methylation prior to array capture. For example, methylated cytosines are protected from cleavage by particular restriction enzymes: HpaII will cleave C-C-G-G but not C-5mC-G-G, whereas MspI will cleave both sites. By creating separate fragmentation libraries using each individual enzyme and then hybridizing each library to a separate array, differentially methylated sites can be determined.50 Alternatively, one can perform DNA methylation studies using the sodium bisulfite conversion of cytosine to uracil (which is read as a thymidine). Both 5-mC and 5-hmC do not undergo bisulfite conversion and are read out as cytosine in a downstream assay. Microarrays that were designed for bisulfite-treated DNA have distinct paired probe sets that are designed to capture specific differentially methylated CpGs. NGS has enabled the direct sequencing of bisulfite converted DNA for unbiased evaluation of methylation and hydroxymethylation genome-wide.51 In whole-genome bisulfite sequencing (WGBS), a standard sequencing library is prepared with methylated C–containing adaptors, followed by the bisulfite conversion of the library. WGBS is complicated by numerous factors, including (1) the large amount of input DNA necessary for sequencing (bisulfite conversion results in DNA degradation), (2) incomplete conversion of cytosine to uracil, and (3) the analytic challenge of determining accurately which sequencing reads have been converted because of the presence of cytosine methylation or hydroxymethylation. To determine if cytosines are methylated versus hydroxymethylated, researchers have designed alternative protocols with an added chemical or enzyme-mediated conversion step or antibody-mediated differential capture of 5-mC and 5-hmC prior to sequencing.52,53 Capture-based methods can also be used to target only 5-mC prior to library preparation and sequencing, which may allow for genome-wide methylation studies at base pair resolution using smaller amounts of input DNA than WGBS.54 A new transposase-based tagmentation method, similar to the approach used for ATAC-seq, also allows for WGBS with very small amounts of input DNA.55
APPROACHES TO DNA SEQUENCING FOR RESEARCH PURPOSES
The study of genomics for research purposes has also shifted as a result of NGS technology. Prior to the broad availability of NGS platforms, most genomics research studies were genome-wide association studies (GWASs) that used a microarray platform to assay for significant changes in allele frequency from the panel of single nucleotide polymorphisms (SNPs) included on the array (modern arrays often have probes to detect the genotype of more than 1 million SNPs).56 GWASs require large numbers of samples (cases and controls), and are powered to identify SNPs that are in linkage disequilibrium with an associated condition.57 It is unlikely that the true pathologic variant will be discovered via a GWAS. Instead, the results of a GWAS could provide the basis for a targeted sequencing study to determine the pathologic alteration(s). In the era of decreasing cost and broad availability of NGS, most genomics studies have shifted to a more inclusive discovery platform, such as whole-genome sequencing or exome sequencing. Using a platform with single-base resolution rather than a defined content microarray increases the power to identify a pathologic variant, and the number of samples may decrease. However, for complex genetic diseases, in which multiple genes may play a causative role, the number of samples required remains large and can be cost prohibitive. In these situations, investigators often use a combination of GWAS methods (with cheaper microarrays) to perform the initial discovery work followed by region-specific NGS discovery sequencing.
Several ethical issues complicate the NGS-based study of human genomes. First, sequencing data may be potentially “identifiable,” meaning that one could potentially determine another person’s identity based on sequencing results obtained by a genomic study, when compared to data from a second genotyping assay (such as for diagnostic or criminal purposes). The Genetic Information Nondiscrimination Act (GINA) of 2008 made it illegal in the United States for employers and health insurance providers to discriminate based on the results of genetic findings. However, persons enrolling into genomics research trials must be informed of this theoretical risk of identifiability and be properly consented. Another consequence of genomics studies is that researchers must consider the return of genetic results to patients. The return of results is divided into two general categories: incidental findings and findings pertinent to the condition being studied. There is no standard approach for return of results as the approach varies on a case-by-case basis, depending on the sequencing study and the result to be communicated; however, new guidelines are emerging.58 There is general consensus in the genomics research community that these issues, and how they will be handled for the particular study, must be clearly presented in the study protocol and the informed consent documentation.
The sequencing of cancer genomes, whether by whole-genome sequencing, exome sequencing, or multigene panels, is also associated with several specific considerations. Proper informed consent is again paramount. Proper sample banking is critical to avoid degradation of nucleic acids prior to their isolation, as high-quality DNA or RNA increases the likelihood of successful sequencing independent of the NGS platform used. For DNA sequencing studies of a cancer sample, a matched “normal” sample is often also sequenced to discern the somatic versus germline status of any identified alterations.
NEXT-GENERATION SEQUENCING AS A CLINICAL ASSAY: IMPLICATIONS FOR THE PRACTICING HEMATOLOGIST
Using NGS as a clinical assay platform offers many opportunities for new clinical tests and potential therapeutic interventions. Clinical sequencing requires the same high standards regarding sample banking, nucleic acid isolation and proper informed consent. Moreover, clinical sequencing must be done in an appropriately certified clinical laboratory environment. As the depth of coverage increases for NGS-based platforms, the statistical power to detect variation will increase up until the point that it is outweighed by the intrinsic error associated with the sequencing platform. For clinical NGS-based diagnostic tests, these error metrics must be predetermined for each protocol, whether it be whole-genome sequencing, a specific exome reagent, or a specific gene panel test.
NGS-based diagnostics have specific potential applications for both hematologic malignancies and nonmalignant hematologic conditions. For leukemias, lymphomas, myelodysplastic syndromes (MDS), myeloproliferative neoplasms and other hematologic malignancies (or premalignant conditions), the use of clinical sequencing can be employed to determine the spectrum of mutations that are driving the particular malignancy. For any given tumor, comprehensive sequencing may identify an “actionable” mutation that could lead to the use of a targeted therapy.
In the past, a clinician may have ordered a single gene test to determine if a particular molecular abnormality were present in a tumor sample, such as testing for FLT3 internal tandem duplications (FLT3-ITD) in patients with acute myeloid leukemia (AML). The presence of such an alteration has prognostic implications and may have therapeutic significance pending the results of ongoing clinical studies with FLT3 inhibitors. A single-gene or single “hot spot” assay that is designed to detect a specific alteration has several limitations that are now leading to wider use of NGS-based approaches to clinical diagnostics. To continue using FLT3-ITD alterations in AML as an example, the use of a more comprehensive sequencing platform could be used to discern the subclonal architecture of the cancer tissue. If the FLT3-ITD mutation was present only in a subclone but not in the founding clone, one would predict that a FLT3 inhibitor would only be active in eradicating the subclone. Therefore, clinicians would need to incorporate another therapy to eradicate the founding clone, so as to achieve remission or prevent disease relapse. Ideally, the choice of this therapy would be determined based on the other mutations identified in the founding clone.
As clinicians better understand the mutational drivers of any particular tumor, they may be able to use targeted therapies directed at particular pathways rather than individual gene mutations. For example, researchers may be able to develop a drug that is effective if a patient with AML or MDS harbors a mutation in any of the genes involved in the spliceosome complex or associated proteins. A similar scenario could be envisioned targeting mutations in genes in the cohesin complex (protein complex that regulates the separation of sister chromatids in cell division) or genes that alter the hydroxymethylation of cytosine in DNA. Clinical trials built around these concepts will be necessary to establish their validity. Additionally, comprehensive sequencing can identify somatic mutations that result in the formation of neo-antigens expressed on the tumor cells. Clinicians could then use tumor-specific immunotherapy to target the malignancy.59,60
Researchers are using NGS-based technology to detect minimal residual disease (MRD) in hematologic malignancies.61,62,63 An advantage of NGS-based methods of MRD detection is that the data not only provide information regarding the presence or absence of MRD but also may reveal the clonal architecture of persistent disease from the mutations detected. Finally, clinicians could use knowledge gained by sequencing an individual’s genome to determine the choice of therapy, whether for a malignant or non-malignant hematologic disease. In one such example, a therapy choice could be optimized based on pharmacogenomic studies in which the response or toxicity of a given drug is associated with underlying inherited genetic variation in the patient.64 In a second example, a genomic assay might identify a somatic alteration corresponding to a targeted therapy that might help a patient with residual MRD to achieve remission. These examples are illustrative of the translational potential of NGS from research tool into clinical care.