Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Functional genomic hypothesis generation and experimentation by a robot scientist

Profile image of Douglas Kell

2004, Nature

Related Papers

Konrad Koerding

genomic hypothesis generation

Iris Vilares , Hugo Fernandes

Deborah Gaebler-Spira

If the brain is inherently Bayesian, then behavior should show the signatures of Bayesian computation from an early stage in life without the need for learning. Children should integrate probabilistic information from prior and likelihood distributions to reach decisions and should be as statistically efficient as adults. To test this idea, we examined the integration of prior and likelihood information in a simple position estimation task comparing children aged 6-11 years and adults. During development, estimation performance became closer to the statistical optimum. Children use likelihood information as well as adults but are limited in their use of priors. This finding suggests that Bayesian behavior is not inherent but learnt over the course of development.

Vision Research

Keith Langley

Social Science Research Network

Perception as Bayesian Inference

Donald Hoffman

Journal of Neuroscience

Stefan Glasauer

RELATED PAPERS

Balaji Tour

JULIAN SURYA MAHENDRA

Cancer research

Jeff Cossman

Daniel Caceres

Nature Communications

Robert Guyer

Clinical Toxicology

E3S Web of Conferences

OMAR BEN LENDA

Jembatan : Jurnal Ilmiah Manajemen

Science Advances

DONGHWI CHO

EPJ Web of Conferences

Pierre Tamagno

Frontiers in Physiology

Kate Lambrechts

Great Lakes Entomologist

Robert Wolff

Abdelghani Oukhaled

Archives of Acoustics

Krzysztof Preś

Journal of Ergonomics

Dr. Surbhi Pranpal Singh

British journal of cancer

Manabu Fukumoto

Revista Del Cesla

Manuel Chust

SPIE Proceedings

Law & Society Review

Terence C Halliday

Bảo hành điện tử bluedata

Bảo Hành Điện Tử BlueData

Journal of youth and adolescence

Sara Pieters

Critical Times Journal

Juan Carlos Medel

See More Documents Like This

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024
  • Search Menu
  • Advance Articles
  • Collections
  • Focus Collections
  • Teaching Tools in Plant Biology
  • Browse by cover
  • High-Impact Research
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • Self-Archiving Policy
  • Why Publish with Us?
  • About The Plant Cell
  • About The American Society of Plant Biologists
  • Editorial Board
  • Advertising & Corporate Services
  • Journals on Oxford Academic
  • Books on Oxford Academic

The American Society of Plant Biologists

Article Contents

Early development of the concept of genetic balance, gene expression in aneuploids, balance at the gene level, relationship to quantitative traits, balance in sex chromosome dosage compensation, balance in evolutionary processes, why a balance in regulatory processes, future problems to address, acknowledgments.

  • < Previous

The Gene Balance Hypothesis: From Classical Genetics to Modern Genomics

www.plantcell.org/cgi/doi/10.1105/tpc.106.049338

  • Article contents
  • Figures & tables
  • Supplementary Data

James A. Birchler, Reiner A. Veitia, The Gene Balance Hypothesis: From Classical Genetics to Modern Genomics, The Plant Cell , Volume 19, Issue 2, February 2007, Pages 395–402, https://doi.org/10.1105/tpc.106.049338

  • Permissions Icon Permissions

The concept of genetic balance traces back to the early days of genetics. Additions or subtractions of single chromosomes to the karyotype (aneuploidy) produced greater impacts on the phenotype than whole-genome changes (ploidy). Studies on changes in gene expression in aneuploid and ploidy series revealed a parallel relationship leading to the concept that regulatory genes exhibited a stoichiometric balance, which if upset, would modulate target gene expression. The responsible regulatory genes for these types of effects primarily have been found to be members of signal transduction pathways or transcription factors of various types. Recent studies of retention of selected duplicate genes following diploidization of ancient polyplodization events have found that signal transduction and transcription factors have been preferentially maintained in a dosage-sensitive relationship. In this essay, we review the historical progression of ideas about genetic balance and discuss some challenges in this field for the future.

Chromosomal Imbalance.

Chromosomal Imbalance.

Diagrammatic representation of chromosome variations with the varied chromosome shown in enlarged form. Full or partial monosomics with only one copy of a chromosome or chromosomal segment are usually more severely affected than haploid plants. Trisomic plants with an extra chromosome often show a greater morphological change from the diploid than observed in polyploid plants, such as tetraploids. Secondary trisomics carrying an isochromosome composed of two copies of a chromosome arm show enhanced phenotypic changes for some characteristics typical of the corresponding primary trisomic. When an isochromosome is recovered in an otherwise haploid plant, the imbalance is most severe and results in highly defective plants ( Satina et al., 1937 ). These relationships led to the concept of genetic balance.

As this work continued, trisomics were found for all of the possible 12 chromosomes of Datura , but additional forms continued to arise, each with different characteristics. Some of these new forms accentuated a portion of the phenotypic characteristics of the original set of primary trisomics but were missing other changes from the normal diploid. The basis of these forms was revealed to be an extra chromosome that possessed two identical chromosome arms derived from misdivision of the centromere, manifested by a break in the middle of a centromere with fusion of two sister chromatids, to produce mirror image chromosomes (isochromosomes) generated from a single chromosome arm ( Belling and Blakeslee, 1924 ). In these cases, the extra chromosome, called a secondary trisome, would result in a total of four copies of the respective arm, thus providing an explanation for the accentuated partial phenotype compared with the respective trisomic from which it was derived ( Figure 1 ). These results bolstered the concept that a balance of genes was important for the normal phenotype. These and other aneuploid types were summarized in an article entitled “New Jimson Weeds from Old Chromosomes” ( Blakeslee, 1934 ), which cryptically conveys the message that the dosage of chromosomes is important in this case as opposed to genetic variation. Indeed, the more extreme the altered relative dosage of chromosomes, the greater the phenotypic change. This concept is aptly illustrated by the recovery of selected isochromosomes in otherwise haploid plants that had an extreme imbalance and severe phenotype ( Satina et al., 1937 ) ( Figure 1 ).

At about the same time as the early studies of Datura were conducted by Blakeslee, Richard Goldschmidt was studying the basis of sex determination in the gypsy moth ( Lymantria dispar ) ( Goldschmidt, 1920 ). Crosses between different geographic isolates produced progeny that had differing levels of intersexuality, which is a mixture of male and female morphologies. To explain the various results, Goldschmidt hypothesized different male and female factors of varying strengths whose balance was critical for proper sex determination. Within a particular geographic isolate, the sex determination processes are normal, but crosses between different strains produced the intersexual individuals because the various factors had varied in dosage potential between the diverged populations and were no longer matched. In retrospect, this result foreshadowed the concept of Muller-Dobzhansky coevolving gene complexes that interact within a species but cause sterility or inviability in hybrids.

In further parallel studies, Calvin Bridges identified a triploid female Drosophila and among her progeny found intersexual flies that possessed a mixture of male and female parts ( Bridges, 1925 ). The karyotype of these intersexual flies was composed of two X chromosomes with an otherwise triploid complement of chromosomes. However, flies with only one X chromosome that were otherwise triploid were male-like in phenotype. Of course, among diploid flies, the males have one X chromosome and the females have two, with all other chromosomes being present in duplicate in the two sexes. By comparing the various chromosomal constitutions of males, intersexes, and diploid and triploid females, Bridges concluded that a balance between the X and the autosomes (non-sex chromosomes) was the chromosomal basis of sex determination in Drosophila . This type of mechanism for sex determination has been postulated for many plant species as well, with the most well-documented case being Rumex acetosa ( Ono, 1935 ). When the X-to-A ratio is 1, a female is produced, but when it is 0.5, a male occurs. In polyploid Rumex , X-to-A intermediate ratios produce intersexes and hermaphrodites ( Dellaporta and Calderon-Urrea, 1993 ).

Similar results of chromosome additions to produce trisomics and tetrasomics as well as the reciprocal production of monosomic individuals have recapitulated the balance conclusions in scores if not hundreds of species in both the plant and animal kingdoms (e.g., Lee et al., 1996 ; see discussion in Birchler et al., 2001 ). Although great variation occurs, in general, both monosomics and trisomics cause detrimental effects on the phenotype, with the monosomics usually being more severe. Aneuploid studies in polyploids have further supported the balance idea because there is less effect of extra or missing chromosomes when the remainder of the genome is increased in copy number. The extensive studies of aneuploids in hexaploid bread wheat by Ernie Sears exemplify these conclusions ( Sears, 1953 , 1954 ).

Over the subsequent decades, the interpretation of genetic balance split into an interesting dichotomy of enzymatic/metabolic versus gene regulatory balance that lingers today. On the one hand, the involvement with sex determination mechanisms was viewed as a regulatory balance, while on the other hand, aneuploid syndromes were viewed as perturbations of metabolism resulting from altered relationships of enzymes involved with intermediary metabolism. Indeed, the basis of sex determination in Drosophila involves a dosage relationship between transcription factors on the X chromosome relative to those on the autosomes ( Erickson and Cline, 1993 ). While aneuploid syndromes no doubt have a complicated basis, the emerging evidence suggests that it is likely that most of these balance phenomena reflect gene regulatory mechanisms in some form, although other gene products involved with macromolecular complexes are likely to exhibit a balance as well, whether or not they are directly involved with gene expression (see below).

Several decades ago, one of us (J.A.B.) examined the levels of enzyme activities and proteins in aneuploid series of maize. A popular exercise at the time was the localization of the cytological position of genes encoding various enzymes by screening the genome using a set of trisomics or segmental trisomics created by overlapping translocations (e.g., Carlson, 1972 ; O'Brien and Gethman, 1973 ). Work on Datura played an inspiring role in this trend ( Carlson, 1972 ). The principle involved the concept that varying the dosage of a gene would produce a directly proportional amount of gene product ( Grell, 1962 ). Thus, reversing the procedure in a screen of trisomics covering the genome should locate the position of the structural loci for various enzymes because the increased dosage of the encoding gene would produce a greater amount of gene product. This approach was successful in some cases; the unsuccessful attempts remain unknown. Thus, to test ideas about the regulation of the Alcohol dehydrogenase-1 gene in maize ( Schwartz, 1971 ), a dosage series of the chromosome arm on which this gene is located was examined for the levels of ADH activity ( Birchler, 1979 ). Interestingly, the total amount of ADH present in a one to four dosage series was nearly equivalent to the diploid level. In other words, no gene dosage effect for Adh was found in the whole arm dosage series. However, a whole genome ploidy series involving monoploids, diploids, and tetraploids indicated that there was a directly proportional amount of ADH per cell through the series. When other enzymes encoded on other chromosomes were examined in the aneuploid series, their levels were modulated, either up or down, but the most common effect was an inverse correlation between the chromosomal dosage and the amount of activity present ( Birchler, 1979 ). These studies were extended to other proteins and other aneuploid and ploidy series with similar results ( Birchler and Newton, 1981 ). A hypothesis was formulated that the stoichiometry of regulatory genes was influential in modulating the levels of expression of the target genes studied ( Birchler and Newton, 1981 ).

The failure to find a dosage effect for ADH was referred to as dosage compensation. The basis of this response was determined to be the result of a structural gene dosage effect of Adh itself being modulated in an inverse manner by a different part of the same chromosome arm involved in the aneuploid series ( Birchler, 1981 ). In monosomics, the single Adh gene was upregulated approximately twofold. In trisomics, the three copies of Adh were each downregulated by two-thirds. Of course, dosage compensation of the X chromosome in males of Drosophila had been documented since the realization of its existence by Muller (1932) . That is, males with only one X chromosome produce about the same amount of gene product as do the two X chromosomes present in females. However, Devlin et al. (1982) found that compensation would also occur for large trisomics of autosomal chromosome arms consisting of ∼20% of the whole genome, in contrast with small trisomics surrounding a particular structural gene that vary only a very small fraction of the genome, as noted above. Later, they reported the extensive presence of the inverse dosage effect of these large trisomics on genes encoded on other chromosomes ( Devlin et al., 1988 ). The basis of autosomal dosage compensation was also found to involve the combination of a structural gene dosage effect and an inverse effect being produced simultaneously by a large trisomic segment surrounding the Adh gene in flies. In other words, a structural gene dosage effect is cancelled via an inverse effect produced by another part of the varied region ( Birchler et al., 1990 ).

In maize, a survey of aneuploids from many regions of the genome for six genes revealed that the RNA levels of any one gene could be modulated similarly by different dosage series ( Guo and Birchler, 1994 ). In other words, changes in dosage of several different segments of the genome could have the same effect on the monitored gene. The magnitude of these modulations depended on whether the tissue examined was diploid or triploid. Adding or subtracting a chromosome arm to the genotype at different ploidy levels produced an effect whose severity was coincident with the degree of dosage imbalance. Changes in gene expression in a whole genome ploidy series were not as prevalent ( Guo et al., 1996 ). Transcriptome studies of mammalian trisomics or highly aneuploid cancer cells indicate extensive transacting modulations of gene expression (e.g., Phillips et al., 2001 ; FitzPatrick et al., 2002 ; Saran et al., 2003 ; Tsafrir et al., 2006 ). Thus, the types and mode of effects parallel the classical phenotypic studies with regard to balance.

A simple interpretation of the cause of these transacting dosage effects is that they are caused by a gene or genes on the varied chromosome that exhibit a dosage effect themselves and that act in a regulatory fashion to modulate many targets. To screen for single gene mutations that would mimic the dosage effect, mutageneses were conducted in Drosophila to find mutations that could up- or downregulate the expression of the white eye color gene using a leaky phenotypic reporter called white-apricot . The eyes of these flies have a low level of pigment, so modulations of expression could be scored easily. An amazingly large number of mutations were recovered from these screens, totaling 47 at last count ( Birchler et al., 2001 ). The first reported example was Inverse regulator-a ( Rabinow et al., 1991 ). Mutations of this gene as a heterozygote upregulate the white gene twofold, mimicking how a monosomic condition would produce a similar aneuploid inverse effect. Trisomics of the region downregulate white . The presence of the mutation in an otherwise triploid fly upregulates white to ∼150% of the control triploids. Thus, Inr-a serves as an example of a single gene that produces a balance effect, in this case with an inverse dosage response.

The large number of modifiers of a single target gene must be understood in the context that developmental regulators often operate in a hierarchy. That is, one early developmental regulator might affect a downstream regulator and so on. If each regulator is dosage sensitive, the effect could potentially be passed along through the hierarchy. One must also appreciate that any one regulator will affect many targets, so there would be significant overlap of modifiers for different traits.

A summary of the collection of the various modifiers of white revealed that their molecular basis fell into two major classes: members of signal transduction pathways and transcription/chromatin factors ( Birchler et al., 2001 ). Thus, it was established that the dosage effects were in fact the result of regulatory processes. However, the diversity of regulatory molecules involved was mysterious at the time, and to some extent still is, but clues for why this is the case are emerging from studies of haploinsufficiency in yeast and humans (see below).

Returning to the phenotypic effects of trisomics, it is noted that any one characteristic of an organism can be affected by different trisomics. This realization suggests that multiple dosage-sensitive genes might be capable of modulating a particular phenotypic characteristic. We cannot summarize the field of quantitative genetics, but we discuss some parallels as previously noted ( Guo and Birchler, 1994 ; Birchler et al., 2001 , 2005 ). For any one quantitative trait, there can often be multiple loci that affect its expression ( Tanksley, 1993 ). Crosses between varieties that differ for such traits usually exhibit an intermediate (additive) phenotype to some degree ( Tanksley, 1993 ). Thus, the multiple loci act as if they are dosage sensitive. An example involves the quantitative genetic differences between domesticated and wild sunflower ( Burke et al., 2002 ). Many (78) loci were identified that were mostly of small effect and additive in mode of action. It seems reasonable, therefore, that the effects of aneuploids and of some quantitative trait loci (QTL) have a basis in common. QTL would be expected to be a heterogeneous group, but it is reasonable that variation in regulatory genes of sundry types would be expected to be a major contributor.

Within the past half decade, it became obvious that many clinical human conditions result not necessarily from a gain-of-function mutation or a homozygous recessive, but from haploinsufficiency of particular gene products. In other words, null mutations as a heterozygote would condition syndromes in a type of dosage effect. As the molecular basis of these conditions was revealed, they consisted primarily of transcription factors. To explain these results, one of us (R.A.V.) formulated mechanistic models of transcription factor assembly into molecular complexes, noting the importance of stoichiometry of the subunits for the action of the whole ( Veitia, 2002 ). This concept has been extended to genetic and biochemical networks ( Veitia, 2003 , 2004 , 2005 ). The balance between either the subunits of a complex or between proteins with opposing actions, such as transcriptional activators and repressors, needs to be maintained to some extent to avoid negative fitness consequences.

Experimental evidence of this type of relationship has not been systematically sought in plants. However, data have been obtained from the study of haploinsufficiency in yeast ( Papp et al., 2003 ) and humans ( Kondrashov and Koonin, 2004 ). Heterozygous gene knockouts were examined in diploid yeast for growth retardation. Those that produced a haploinsufficiency were overrepresented among classes of genes whose protein products are typically involved with molecular complexes. While these classes of genes extend beyond those involved with regulatory processes to some degree, regulatory gene products typically fall under this umbrella. The human study found that members of signal transduction and transcriptional functions were overrepresented among factors causing haploinsufficiency.

The concept of dosage balance predicts a relationship between the number of interactions (connectivity) of a component and to the possibility of dosage effects when under- or overexpressed. Lemos et al. (2004) have shown that the number of interactions a protein has within a network (i.e., connectivity) constrains genetic variation of gene expression in yeast and fruitfly populations. Specifically, they found a negative correlation between the variation of gene expression and the number of protein–protein interactions. As expected, the extent of variation in expression among genes encoding interactors was smaller than that of random pairs of genes, suggesting the existence of a balance relationship. Finally, the expression levels of interactors correlated positively across strains. High coexpression for proteins in the same complex has been reported independently several times ( Jansen et al., 2002 ; Papp et al., 2003 ). These results suggest the existence of a dosage balance, which could be a force shaping gene expression even at a small evolutionary scale, that is, within populations.

Protein–protein connectivity in yeast complexes (and likely in other organisms) follows approximately a power law distribution ( Hahn et al., 2004 ), which means that most components are poorly connected. Indeed, ∼30% of genes involved in yeast complexes encode separable components (i.e., only one link with the rest of the complex, such as A or C in complex A-B-C). By comparison, only ∼10% of components have 20 or more links (analysis of data from Fraser et al., 2003 ). Mutations in the poorly connected components are more likely to lead to less pleiotropic phenotypes, with higher chances of going undetected. Moreover, these separable components are expected to be less dosage-sensitive (unless they are represented several times per macromolecule as A in A-B-A) ( Veitia, 2002 , 2003 ). This concept might explain why overexpression of subunits of yeast complexes is usually well tolerated.

This is apparent from the study of Sopko et al. (2006) , who analyzed overexpression phenotypes in a vast array of yeast strains, each containing an inducible copy of a different gene. They suggest that overexpression phenotypes in yeast reflect specific regulatory imbalances. Accordingly, they found that overexpression of periodically expressed genes (i.e., during the cell cycle) is more likely to cause cell cycle arrest or abnormal morphology than constitutively expressed cell cycle genes. This is probably so because in many cases these factors participate in network modules involving opposing forces (i.e., a kinase and a phosphatase acting on a common though differentially modified substrate). Dosage effects in these networks are predicted by dosage balance. For instance, using a complex model of the cell cycle, Chen et al. (2004) found that >70% of the parameters can be changed at least 10-fold in either direction (i.e., under- or overexpression) without preventing cycling. However, the rest of the parameters do not exhibit this flexibility, and some are very sensitive to dosage changes. For example, for the synthesis of Cdc14, the boundaries are twofold up, otherwise there will be G1 arrest, and 0.5-fold down (i.e., heterozygous deletion), and the cell faces a telophase arrest. The contrary occurs for the degradation of Cdc14. The need for balance between synthesis (copy number or expression) of Cdc14 and its degradation (i.e., copy number or expression of a protease) must be reached to avoid cell cycle arrest. Note that altering these parameters within the relevant boundaries does not produce cell cycle arrest but does induce either faster or slower cycling (a quantitative character).

As noted above, the same dosage effects on target gene expression that result from genomic imbalance are responsible for dosage compensation of various genes in maize ( Birchler, 1979 ; Birchler and Newton, 1981 ; Guo and Birchler, 1994 ). The magnitude of expression modulation to account for dosage compensation of the various sex chromosome aneuploids in Drosophila is the inverse ratio of the X to autosome imbalance ( Birchler et al., 2006 ). Recent studies of global gene expression between males and females of Drosophila , nematodes, and mammals indicates that in each species the single X chromosome in males is upregulated approximately twofold ( Gupta et al., 2006 ), which would produce a total gene expression equivalent to the two X chromosomes in females (or hermaphrodites in nematodes). In an exhaustive comparison of average gene expression of X chromosomes and autosomes in various mammalian species and tissues ( Nguyen and Disteche, 2006 ), a potential balance relationship was revealed. In female mammals, one of the two X chromosomes is inactivated in any one developmental lineage. Nevertheless, the active X in females is upregulated approximately twofold so that on average the total expression from the single X is more or less equivalent to the average expression of the equivalent length of a pair of autosomes. The single X in males is likewise upregulated. However, in haploid tissues, the average expression of the single X chromosome per unit length is basically the same as a single autosome. Thus, it appears that the X upregulation only occurs when an X-to-autosomal imbalance is present. Sex chromosome dosage compensation has been subjected to natural selection and thus is likely to involve some modifications in mechanism to those cases observed in laboratory-constructed aneuploids ( Birchler et al., 2006 ), but some parallels do seem to exist.

With the availability of whole-genome sequences, it has become obvious that repeated cycles of polyploidization followed by diploidization have occurred in the lineages leading to the evolutionary crown of eukaryotic organisms present today ( Wolfe, 2001 ; Simillion et al., 2002 ; Bowers et al., 2003 ). Following the production of an allotetraploid from two related species, there is gene loss that leads back to the diploid level. An analysis of the functional classes of the genes retained in duplicate in Arabidopsis and rice indicates an overrepresentation of members of signal transduction components and transcription factors ( Blanc and Wolfe, 2004 ; Maere et al., 2005 ; Chapman et al., 2006 ; Freeling and Thomas, 2006 ; Thomas et al., 2006 ). These classes of genes are similar to those that exhibit transacting dosage effects, as noted above. These findings led to the hypothesis that the duplicates are retained because they are in balance with each other and selection against deletion of one member of a pair would prevent their rapid loss ( Birchler et al., 2005 ; Freeling and Thomas, 2006 ). In other words, deletion of one member of a balanced duplicate would mimic an aneuploid effect, which would diminish reproductive success. Genes not in an interacting balance relationship would be deleted at random over evolutionary time back to the diploid level.

Indeed, gene classes preferentially found in segmental duplications are the complement of those retained from whole-genome duplications ( Davis and Petrov, 2005 ; Maere et al., 2005 ), suggesting that segmental duplications that upset a regulatory balance would be selected against because they also would mimic an aneuploid effect. Of course, genes retained in segmental duplications could be selected to condition greater increments of a particular gene product ( Sharp et al., 2005 ; Redon et al., 2006 ). Not surprisingly, members of large gene families seldom encode components of macromolecular complexes in yeast and humans. Moreover, according to recent studies, duplicability of different genes decreases as the size of the complexes increases ( Papp et al., 2003 ; Yang et al., 2003 ).

The studies on duplicate retention in Arabidopsis and rice by necessity must deal with genomes that are substantially returned to the diploid state. Thus, the possibility exists that directed elimination of members of the singleton classes occurred shortly after polyploidization to produce the observed genomic arrangement. However, a recent analysis of the genome of Paramecium tetraurelia illuminates the processes of whole-genome duplication and subsequent gene loss ( Aury et al., 2006 ). The genomic content of this meager ciliate, with a predicted number of 39,642 genes, dwarfs that of humans. The reason for this number is that the whole genome has endured three duplications. By comparing the recent, intermediate, and old events, insight into the processes of gene loss could be gained. Because many pseudogenes could be recognized in various stages of deterioration, it seems likely that the gene losses are the result of attrition over evolutionary time rather than a concerted elimination event immediately following polyploidization. Moreover, the retained duplicates tend to be classes of genes that are involved in macromolecular complexes. In addition, the retained genes show evidence of purifying selection, suggesting that mutation of one member of a duplicated pair is selected against because it upsets the correct stoichiometric relationships of subunits of such complexes. The authors thus proposed that the stoichiometry of duplicates is important for their retention.

Freeling and Thomas (2006) have suggested that the retention of duplicate regulatory genes following repeated polyploidization events holds such regulators in the evolutionary lineage for sufficient time to allow eventual divergence that fosters increasing complexity over evolutionary time. The polyploidization event proliferates the copy number of all loci. Deletions over time deteriorate the copy number of most target metabolic genes to the diploid level. However, the stoichiometric constraints on regulatory duplicates will preserve them in the lineage. Eventual divergence of function of different members of a duplicate pair of regulators will create developmental and biochemical complexity. While there is evidence that such divergence occurs, concepts of how a new balance is achieved have been little explored. However, given such changes, Freeling and Thomas argue that the repeated cycles of tetraploidization and diploidization are a contributor to driving increasing complexity during the evolution of eukaryotes.

A regulatory system in which the components are sensitive to stoichiometric relationships provides a means for selection pressures to operate on new mutations in diploids and tetraploids while such mutations are still heterozygous. Also, a system of multiple genes affecting any one characteristic and subject to a dosage interaction will provide a means to modulate the phenotype in subtle ways via mutations in multiple genes. Because the regulatory system is dosage dependent, any new regulatory mutations have the potential to produce subtle effects as a heterozygote. If they are detrimental, they will be selected against rapidly. However, new semidominant mutations would be available in the heterozygous state for rapid adaptive changes as well. The detection of purifying selection on retained duplicates in Paramecium ( Aury et al., 2006 ) suggests that this type of situation is operating. The balance relationship provides for evolutionary changes to be of small magnitude and to result from many possible modifiers of a single phenotype, as noted for the sunflower example above. Thus, a balance relationship of regulatory complexes would optimize purifying and adaptive selection for organisms with diploid and higher levels of ploidy.

We have summarized above the historical and recent evidence suggesting a balance relationship of regulatory genes as a consequence of their membership in macromolecular complexes or networks for which the contributing members produce a stoichiometric effect on the function of the whole. There are several exciting research directions that such a hypothesis might inspire. One possibility is to examine the kinetic and interaction properties of macromolecular complexes in an attempt to understand the basis of how varying the dosage of individual components affects the function of the complex. This area will involve examining the association parameters of multiprotein regulatory complexes, the order in which they associate, and the topological connections between the different subunits. Such studies would test the proposition that highly connected proteins in stable complexes tend to display a higher dosage sensitivity and that this sensitivity can be modulated by the specific parameters of the association process.

Also, one would want eventually to understand the impact of overall gene expression on the phenotype. Such endeavors will no doubt be quite challenging. For example, using gene chip technology, it might be possible to assess correlations between patterns of modulation of thousands of genes and their corresponding phenotypic consequences. The known parallels between multigenic aneuploid effects and quantitative traits are noted above, but the nature of this relationship is quite obscure at present. Exhaustive studies of global patterns of gene expression will need to be conducted, both on aneuploid conditions and on quantitative trait variation, to test such potential connections. Indeed, modulations of morphologies by aneuploid syndromes and QTL might involve changes in cell division patterns as much as overall changes in gene expression, complicating the detection of the responsible genes. The major challenge for molecular quantitative genetics in the future is to tease out the relationship of total genomic expression patterns to phenotypic effects.

Such relationships will be important to gain a better understanding of the role of regulatory balance in evolution. One can imagine how this balance can provide many gradual changes in phenotype in succession, but what is needed for insight into this problem is to determine to what extent multiple regulatory mutations are additive, interactive, or epistatic to each other. The inclusion of regulatory genes among the retained duplicates on the return road from polyploidy to diploidy in diverse taxa has provided an exciting development for the study of evolutionary processes. In mammals, ultraconserved noncoding DNA elements are depleted among segmentally duplicated regions of the genome ( Derti et al., 2006 ). It will be of interest to examine whether they are retained following polyploidization events and whether they are related to the balance characteristics of regulatory mechanisms.

If the stoichiometry of regulatory factors contributes to increasing complexity and speciation, the individual components must be able to escape from a preexisting balance with other factors and establish a new balance. If such divergence is a contributing factor to evolutionary change, then the processes that allow shifts from one balance relationship to another must be explored. Many classical studies of species hybrid incompatibilities have suggested the presence of coevolving gene complexes ( Dobzhansky, 1937 ; Muller, 1942 ; see discussion in Birchler et al., 2005 ). Whether interacting regulatory molecular complexes have a relationship to these mechanisms would be an interesting avenue of exploration.

Funding on this topic in the Birchler laboratory is provided by a grant from the National Science Foundation Plant Genome Program (DBI 0501712) and by the National Institutes of Health (R01 GM068042). R.A.V. is funded by the Institut National de la Santé et de la Recherche Médicale, the Centre National de la Recherche Scientifique, and the University of Paris 7.

Aury, J.-M., et al . ( 2006 ). Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia . Nature   444   :   171 – 178 .

Belling, J., and Blakeslee, A.F. ( 1924 ). The configurations and sizes of the chromosomes in trivalents of 25-chromosome Daturas. Proc. Natl. Acad. Sci. USA   10   :   116 – 120 .

Birchler, J.A. ( 1979 ). A study of enzyme activities in a dosage series of the long arm of chromosome one in maize. Genetics   92   :   1211 – 1229 .

Birchler, J.A. ( 1981 ). The genetic basis of dosage compensation of Alcohol dehydrogenase-1 in maize. Genetics   97   :   625 – 637 .

Birchler, J.A., Bhadra, U., Pal-Bhadra, M., and Auger, D.L. ( 2001 ). Dosage dependent gene regulation in higher eukaryotes: Implications for dosage compensation, aneuploid syndromes and quantitative traits. Dev. Biol.   234   :   275 – 288 .

Birchler, J.A., Fernandez, H., and Kavi, H.H. ( 2006 ). Commonalities in compensation. Bioessays   28   :   565 – 568 .

Birchler, J.A., Hiebert, J.C., and Paigen, K. ( 1990 ). Analysis of autosomal dosage compensation involving the Alcohol dehydrogenase locus in Drosophila melanogaster . Genetics   124   :   677 – 686 .

Birchler, J.A., and Newton, K.J. ( 1981 ). Modulation of protein levels in chromosomal dosage series of maize: The biochemical basis of aneuploid syndromes. Genetics   99   :   247 – 266 .

Birchler, J.A., Riddle, N.C., Auger, D.L., and Veitia, R.A. ( 2005 ). Dosage balance in gene regulation: Biological implications. Trends Genet.   21   :   219 – 226 .

Blakeslee, A.F. ( 1921 ). Types of mutations and their possible significance in evolution. Am. Nat.   5   :   254 – 267 .

Blakeslee, A.F. ( 1934 ). New Jimson weeds from old chromosomes. J. Hered.   25   :   80 – 108 .

Blakeslee, A.F., and Avery, B.T. ( 1919 ). Mutations in the Jimson weed. J. Hered.   10   :   111 – 120 .

Blakeslee, A.F., Belling, J., and Farnham, M.E. ( 1920 ). Chromosomal duplication and Mendelian phenomena in Datura mutants. Science   52   :   388 – 390 .

Blanc, G., and Wolfe, K.H. ( 2004 ). Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell   16   :   1679 – 1691 .

Bowers, J.E., Chapman, B.A., Rong, J., and Paterson, A.H. ( 2003 ). Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature   422   :   433 – 438 .

Bridges, C.B. ( 1925 ). Sex in relation to chromosomes and genes. Am. Nat.   59   :   127 – 137 .

Burke, J.M., Tang, S., Knapp, S.J., and Rieseberg, L.H. ( 2002 ). Genetic analysis of sunflower domestication. Genetics   161   :   1257 – 1267 .

Carlson, P.S. ( 1972 ). Locating genetic loci with aneuploids. Mol. Gen. Genet.   114   :   273 – 280 .

Chapman, B.A., Bowers, J.E., Feltus, F.A., and Paterson, A.H. ( 2006 ). Buffering of crucial functions by paleologous duplicated genes may contribute cyclicality to angiosperm genome duplication. Proc. Natl. Acad. Sci. USA   103   :   2730 – 2735 .

Chen, K.C., Calzone, L., Csikasz-Nagy, A., Cross, F.R., Novak, B., and Tyson, J.J. ( 2004 ). Integrative analysis of cell cycle control in budding yeast. Mol. Biol. Cell   15   :   3841 – 3862 .

Crow, J.F. ( 1997 ). Birth defects, Jimson weeds and bell curves. Genetics   147   :   1 – 6 .

Dellaporta, S.L., and Calderon-Urrea, A. ( 1993 ). Sex determination in flowering plants. Plant Cell   5   :   1241 – 1251 .

Davis, J.C., and Petrov, D.A. ( 2005 ). Do disparate mechanisms of duplication add similar genes to the genome? Trends Genet.   21   :   548 – 551 .

Derti, A., Roth, F.P., Church, G.M., and Wu, C.T. ( 2006 ). Mammalian ultraconserved elements are strongly depleted among segmental duplications and copy number variants. Nat. Genet.   38   :   1216 – 1220 .

Devlin, R.H., Holm, D.G., and Grigliatti, T.A. ( 1982 ). Autosomal dosage compensation in Drosophila melanogaster strains trisomic for the left arm of chromosome 2. Proc. Natl. Acad. Sci. USA   79   :   1200 – 1204 .

Devlin, R.H., Holm, D.G., and Grigliatti, T.A. ( 1988 ). The influence of whole-arm trisomy on gene expression in Drosophila. Genetics   118   :   87 – 101 .

deVries, H. ( 1901 ). Die Mutationstheorie. Bd. 1. Die Entstehung der Arten durch Mutation, Vol. 1. (Leipzig, Germany: Veit & Co.).

Dobzhansky, T. ( 1937 ). Genetics and the Origin of Species. (New York: Columbia University Press).

Emerson, S.H., and Sturtevant, A.H. ( 1931 ). Genetic and cytological studies of Oenothera. III. The translocation interpretation. Z. Indukt. Abstammungs-Vererbungl.   59   :   395 – 419 .

Erickson, J.W., and Cline, T.W. ( 1993 ). A bZIP protein, sisterless-a , collaborates with bHLH transcription factors in Drosophila development to determine sex. Genes Dev.   7   :   1688 – 1702 .

FitzPatrick, D.R., Ramsay, J., McGill, N.I., Shade, M., Carothers, A.D., and Hastie, N.D. ( 2002 ). Transcriptome analysis of human autosomal trisomy. Hum. Mol. Genet.   11   :   3249 – 3256 .

Fraser, H.B., Wall, D.P., and Hirsh, A.E. ( 2003 ). A simple dependence between protein evolution rate and the number of protein-protein interactions. BMC Evol. Biol.   3   :   11 .

Freeling, M., and Thomas, B.C. ( 2006 ). Gene-balanced duplications, like tetraploid, provide predictable drive to increase morphological complexity. Genome Res.   16   :   805 – 814 .

Goldschmidt, R.B. ( 1920 ). Untersuchungen uber inter-sexualitat. Zeits. i. Abst. u. Vererb.   23   :   1 – 199 .

Grell, E.H. ( 1962 ). The dose effect of ma-l+ and ry+ on xanthing dehydrogenease activity in Drosophila melanogaster . Z. Vererbungsl.   93   :   371 – 377 .

Guo, M., and Birchler, J.A. ( 1994 ). Trans-acting dosage effects on the expression of model gene systems in maize aneuploids. Science   266   :   1999 – 2002 .

Guo, M., Davis, D., and Birchler, J.A. ( 1996 ). Dosage effects on gene expression in a maize ploidy series. Genetics   142   :   1349 – 1355 .

Gupta, V., Parisi, M., Sturgill, D., Nutfall, R., Doctolero, M., Dudko, O.K., Malley, J.D., Eastman, P.S., and Oliver, B. ( 2006 ). Global analysis of X-chromosome dosage compensation. J. Biol.   5   :   3 .

Hahn, M.W., Conant, G.C., and Wagner, A. ( 2004 ). Molecular evolution in large genetic networks: Does connectivity equal constraint? J. Mol. Evol.   58   :   203 – 211 .

Jansen, R., Greenbaum, D., and Gerstein, M. ( 2002 ). Relating whole-genome expression data with protein-protein interactions. Genome Res.   12   :   37 – 46 .

Kondrashov, F.A., and Koonin, E.V. ( 2004 ). A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet.   20   :   287 – 290 .

Lemos, B., Meiklejohn, C.D., and Hartl, D.L. ( 2004 ). Regulatory evolution across the protein interaction network. Nat. Genet.   36   :   1059 – 1060 .

Lee, E.A., Darrah, L.L., and Coe, E.H. ( 1996 ). Dosage effects on morphological and quantitative traits in maize aneuploids. Genome   39   :   898 – 908 .

Maere, S., DeBodt, S., Raes, J., Casneuf, T., Van Montagu, M., Kuiper, M., and Van de Peer, Y. ( 2005 ). Modeling gene and genome duplications in eukaryotes. Proc. Natl. Acad. Sci. USA   102   :   5454 – 5459 .

Muller, H.J. ( 1932 ). Further studies on the nature and causes of gene mutations. Proc. 6 th Int. Congr. Genetics   1 , 213 – 255 .

Muller, H.J. ( 1942 ). Isolating mechanisms, evolution and temperature. Biol Symp   6   :   71 – 125 .

Nguyen, D.K., and Disteche, C.M. ( 2006 ). Dosage compensation of the active X chromosome in mammals. Nat. Genet.   38   :   47 – 53 .

O'Brien, S.J., and Gethman, R.C. ( 1973 ). Segmental aneuploidy as a probe for structural genes in Drosophila: Mitochondrial membrane enzymes. Genetics   75   :   155 – 167 .

Ono, T. ( 1935 ). Chromosomen und sexualitat von Rumex acetosa . Sci. Rep. Tohoku Univ. (Sendai, Japan). Biology   10   :   41 – 210 .

Papp, B., Pal, C., and Hurst, L.D. ( 2003 ). Dosage sensitivity and the evolution of gene families in yeast. Nature   424   :   194 – 197 .

Phillips, J.L., et al . ( 2001 ). The consequences of chromosomal aneuploidy on gene expression profiles in a cell line model for prostate carcinogenesis. Cancer Res.   61   :   8143 – 8149 .

Rabinow, L., Nguyen-Huynh, A.T., and Birchler, J.A. ( 1991 ). A trans-acting regulatory gene that inversely affects the expression of the white , brown and scarlet loci in Drosophila melanogaster . Genetics   129   :   463 – 480 .

Redon, R., et al . ( 2006 ). Global variation in copy nuber in the human genome. Nature   444   :   444 – 454 .

Saran, N.G., Pletcher, M.T., Natale, J.E., Cheng, Y., and Reeves, R.H. ( 2003 ). Global disruption of the cerebellar transcriptome in a Down syndrome mouse model. Hum. Mol. Genet.   12   :   2013 – 2019 .

Satina, S., Blakeslee, A.F., and Avery, A.G. ( 1937 ). Balanced and unbalanced haploids in Datura. J. Hered.   28   :   192 – 202 .

Schwartz, D. ( 1971 ). Genetic control of alcohol dehydrogenase–A competition model for regulation of gene action. Genetics   67   :   411 – 425 .

Sears, E.R. ( 1953 ). Nullisomic analysis in wheat. Am. Nat.   87   :   245 – 252 .

Sears, E.R. ( 1954 ). The aneuploids of common wheat. Missouri Agric. Exp. Sta. Res. Bull. 572.

Sharp, A.J., et al . ( 2005 ). Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet.   77   :   78 – 88 .

Simillion, C., Vandepoele, K., Montagu, M.C., Zabeau, M., and Van de Peer, Y. ( 2002 ). The hidden duplication past of Arabidopsis thaliana . Proc. Natl. Acad. Sci. USA   99   :   13627 – 13632 .

Sopko, R., Huang, D., Preston, N., Chua, G., Papp, B., Kafadar, K., Snyder, M., Oliver, S.G., Cyert, M., Hughes, T.R., Boone, C., and Andrews, B. ( 2006 ). Mapping pathways and phenotypes by systemic gene over-expression. Mol. Cell   21   :   319 – 330 .

Tanksley, S.D. ( 1993 ). Mapping polygenes. Annu. Rev. Genet.   27   :   205 – 233 .

Thomas, B.C., Pedersen, B., and Freeling, M. ( 2006 ). Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Res.   16   :   934 – 946 .

Tsafrir, D., et al . ( 2006 ). Relationship of gene expression and chromosomal abnormalities in colorectal cancer. Cancer Res.   66   :   2129 – 2137 .

Veitia, R.A. ( 2002 ). Exploring the etiology of haploinsufficiency. Bioessays   24   :   175 – 184 .

Veitia, R.A. ( 2003 ). Nonlinear effects in macromolecular assembly and dosage sensitivity. J. Theor. Biol.   220   :   19 – 25 .

Veitia, R.A. ( 2004 ). Gene dosage balance in cellular pathways: Implications for dominance and gene duplicability. Genetics   168   :   569 – 574 .

Veitia, R.A. ( 2005 ). Gene dosage balance: Deletions, duplications and dominance. Trends Genet.   21   :   33 – 35 .

Yang, J., Lusk, R., and Li, W.H. ( 2003 ). Organismal complexity, protein complexity, and gene duplicability. Proc. Natl. Acad. Sci. USA   100   :   15661 – 15665 .

Wolfe, K.H. ( 2001 ). Yesterday's polyploids and the mystery of diploidization. Nat. Rev. Genet.   2   :   333 – 341 .

Author notes

Email alerts, citing articles via.

  • Recommend to Your Librarian
  • Advertising & Corporate Services
  • Awards & Funding
  • Plant Science Today
  • Plant Biology Meeting
  • Meeting Management Services
  • Plant Science Research Weekly
  • Taproot: A Plantae Podcast

Affiliations

  • Online ISSN 1532-298X
  • Print ISSN 1040-4651
  • Copyright © 2024 American Society of Plant Biologists
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Single-cell ATAC sequencing analysis: From data preprocessing to hypothesis generation

Affiliations.

  • 1 Department of Biotechnology, College of Life Science & Biotechnology, Yonsei University, Seoul 03722, Korea.
  • 2 Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul 03722, Korea.
  • PMID: 32637041
  • PMCID: PMC7327298
  • DOI: 10.1016/j.csbj.2020.06.012

Most genetic variations associated with human complex traits are located in non-coding genomic regions. Therefore, understanding the genotype-to-phenotype axis requires a comprehensive catalog of functional non-coding genomic elements, most of which are involved in epigenetic regulation of gene expression. Genome-wide maps of open chromatin regions can facilitate functional analysis of cis- and trans-regulatory elements via their connections with trait-associated sequence variants. Currently, Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) is considered the most accessible and cost-effective strategy for genome-wide profiling of chromatin accessibility. Single-cell ATAC-seq (scATAC-seq) technology has also been developed to study cell type-specific chromatin accessibility in tissue samples containing a heterogeneous cellular population. However, due to the intrinsic nature of scATAC-seq data, which are highly noisy and sparse, accurate extraction of biological signals and devising effective biological hypothesis are difficult. To overcome such limitations in scATAC-seq data analysis, new methods and software tools have been developed over the past few years. Nevertheless, there is no consensus for the best practice of scATAC-seq data analysis yet. In this review, we discuss scATAC-seq technology and data analysis methods, ranging from preprocessing to downstream analysis, along with an up-to-date list of published studies that involved the application of this method. We expect this review will provide a guideline for successful data generation and analysis methods using appropriate software tools and databases for the study of chromatin accessibility at single-cell resolution.

Keywords: ATAC sequencing; Chromatin accessibility; Single-cell ATAC sequencing; Single-cell RNA sequencing; Single-cell biology.

© 2020 The Author(s).

Publication types

Next-generation sequencing for hypothesis-free genomic detection of invasive tropical infections in poly-microbially contaminated, formalin-fixed, paraffin-embedded tissue samples – a proof-of-principle assessment

  • Hagen Frickmann   ORCID: orcid.org/0000-0002-8967-9528 1 , 2 ,
  • Carsten Künne 3 ,
  • Ralf Matthias Hagen 4 ,
  • Andreas Podbielski 2 ,
  • Jana Normann 2 ,
  • Sven Poppert 5 , 6 ,
  • Mario Looso 3 &
  • Bernd Kreikemeyer 2  

BMC Microbiology volume  19 , Article number:  75 ( 2019 ) Cite this article

3301 Accesses

23 Citations

1 Altmetric

Metrics details

The potential of next-generation sequencing (NGS) for hypothesis-free pathogen diagnosis from (poly-)microbially contaminated, formalin-fixed, paraffin embedded tissue samples from patients with invasive fungal infections and amebiasis was investigated. Samples from patients with chromoblastomycosis ( n  = 3), coccidioidomycosis ( n  = 2), histoplasmosis ( n  = 4), histoplasmosis or cryptococcosis with poor histological discriminability ( n  = 1), mucormycosis ( n  = 2), mycetoma ( n  = 3), rhinosporidiosis ( n  = 2), and invasive Entamoeba histolytica infections ( n  = 6) were analyzed by NGS (each one Illumina v3 run per sample). To discriminate contamination from putative infections in NGS analysis, mean and standard deviation of the number of specific sequence fragments (paired reads) were determined and compared in all samples examined for the pathogens in question.

For matches between NGS results and histological diagnoses, a percentage of species-specific reads greater than the 4th standard deviation above the mean value of all 23 assessed sample materials was required. Potentially etiologically relevant pathogens could be identified by NGS in 5 out of 17 samples of patients with invasive mycoses and in 1 out of 6 samples of patients with amebiasis.

Conclusions

The use of NGS for hypothesis-free pathogen diagnosis from contamination-prone formalin-fixed, paraffin-embedded tissue requires further standardization.

Reliable results of microbiological diagnostic approaches, in particular of cultural approaches, require suitable pre-analytical conditions as a prerequisite [ 1 ]. The intentional or unintentional inactivation of infectious agents can complicate diagnostic procedures. This is the case, for example, when the possibility of infection is not taken into account during initial sampling, so that the sample material is fixed for histopathological work-up in 4% buffered formalin for the purpose of preservation of tissue structure and subsequently embedded in paraffin in the pathology laboratory. If histology provides evidence of an infectious cause for an inflammatory reaction, cultural diagnostic approaches are no longer possible because of inactivation of microorganisms by formalin.

The sensitivity of molecular diagnostic methods, for example, of polymerase chain reaction (PCR), is significantly reduced by formalin due to nucleic acid and protein cross-linking, deamination of cytosine to uracil, strand breaks, and the difficulty of extracting DNA from paraffin-embedded tissues [ 2 , 3 , 4 , 5 , 6 , 7 ]. If the microscopic detection of pathogens proves inconclusive, the molecular detection of pathogens from formalin-fixed sample material is nevertheless the most promising approach if fresh sample material cannot be obtained or can only be obtained with a significant health risk for the patient [ 7 ].

The sensitivity of molecular pathogen detection from formalin-fixed, paraffin-embedded (FFPE) tissue is influenced by factors such as sample age and pathogen density [ 7 ]. Best results can be expected for PCRs that amplify very short fragments, since the formalin-induced strand breaks, cross-linking of DNA strands, and protein–DNA cross-links prevent the amplification of larger fragments. Such cross-linking events are—stochastically—expected about every 1000 base pairs and reduce the reliability of PCRs with longer amplicons. This is especially true if samples inherently include only small quantities of pathogen DNA [ 7 ].

A limitation of targeted PCR diagnostics is the fact that primer-based nucleic acid amplification detects nucleic acids of defined pathogens or groups of pathogens only. If symptoms of the patient are nonspecific and can be induced by a variety of potential pathogens, rational selection of applicable PCR panels that are both comprehensive and economical can represent a differential diagnostic challenge [ 8 ].

Pan-bacterial or pan-fungal ribosomal RNA gene PCRs with subsequent Sanger sequencing [ 9 ] for the sequence-based identification of bacteria and fungi in the sample material [ 10 ] are potential alternatives to genus- or species-specific PCR. These procedures are poorly standardized and therefore—especially in case of a negative result—doubtful in their diagnostic reliability [ 10 ], although they can provide valuable information in case of a positive result. There is complementary diagnostic value of this method mainly for sterile sample materials obtained from primary sterile compartments; for example, bioptic material of endocarditis patients [ 11 ]. In mixed cultures or samples with poly-microbial contamination, mixed sequences occur in Sanger sequencing that do not allow reliable pathogen identification [ 12 ]. However, such microbial contamination has to be regularly expected in formalin-fixed, paraffin-embedded sample material due to nonsterile storage of the paraffin blocks or contamination in the paraffin wax itself [ 13 ]. Consequently, the diagnostic value of such procedures is limited for FFPE materials.

The diagnostic application of NGS (next-generation sequencing) from primary material is a potential alternative. Hypothesis-free NGS has been used to successfully attribute etiologically unclear infection events to specific pathogens [ 14 ]. However, NGS is also suitable for the assessment of primary nonsterile sample materials. The assignment of etiological relevance with respect to the existing clinical symptoms can be based upon the relative frequency of pathogen-specific nucleic acid sequences [ 15 ] or on the pathogenicity of molecularly proven microorganisms. An example is the diagnosis of ornithosis by NGS-based demonstration of C. psittaci DNA in respiratory secretions of patients with severe respiratory infection of unknown origin [ 16 ].

The application of NGS with FFPE sample materials in general [ 17 ] and the purpose of pathogen detection and typing from such materials in particular [ 18 ] are the subjects of ongoing evaluation studies. The present study deals with NGS-based detection of invasive, mostly tropical, mycoses and invasive amebiasis from histological specimens. Matching between NGS and specific PCR for E. histolytica or panfungal PCR with subsequent Sanger sequencing as well as potential additional information on relevant etiologic pathogens provided by NGS are assessed.

The hypothesis of the study is that NGS may be more suitable for the hypothesis-free genomic detection of rare invasive infections in potentially poly-microbially contaminated, formalin-fixed, paraffin-embedded tissue samples than PCR with subsequent Sanger sequencing. The advantage of NGS is its suitability for parallel sequencing of virtually all DNA sequences within a biological sample, depending only on the depth of sequencing. If, in contrast, PCR primers with specificity for multiple pathogens, such as pan-fungal primers, lead to amplification of sequences of different pathogens within the same sample, overlays of different sequences can lead to non-interpretable results in Sanger sequencing.

Results of the NGS analyses

The number of evaluable sequence fragments (reads) per sample averaged 9,799,803 ± 6,662,643 (standard deviation) (lowest number 2,717,953 reads; highest number 29,225,435 reads) in the NGS examination. Among these reads, an average of 26% ± 19% (lowest percentage 1%; highest percentage 59%) could not be identified by the Kraken software.

No significant Spearman rank correlation between sample age and number of detected reads could be identified with Spearman r  = 0.2962 (corrected for ties), a 95% confidence interval of − 0.1449 to 0.6391, and a non-significant two-tailed P  = 0.1699 (calculated using the software GraphPad InStat, version 3.06, 32 bit for Windows, GraphPad Software Inc., San Diego, CA, USA).

The proportion of sequences of eukaryotic organisms in the sample averaged 39.7% ± 36.7%. The largest share consisted of human reads at 37.6% ± 37.2%. The proportion of fungal sequences was a mere 0.12% ± 0.16%. Bacterial sequences constituted an average of 23.9% ± 22.0%, viral sequences an average of 10.5% ± 7.2%. The identified sequences covered a wide spectrum of different species without clear relation to the histologically defined invasive infections. Among the bacterial sequences, Pseudomonas spp.-specific reads constituted 0.6% ± 0.6% of all reads, and Staphylococcus spp.-specific reads 0.01% ± 0.02% of all reads. Although some of the patients with invasive fungal infections had suffered from AIDS (personal communication with the Department of Pathology of the Bernhard Nocht Institute for Tropical Medicine Hamburg, which initially provided the samples), proviral DNA of HIV was undetectable in any of the samples.

The distribution of detectable reads is visualized in Table 1 .

Focusing on the proven fungal sequences in the samples of the patients with invasive fungal infections, etiologically relevant pathogens, i.e. Histoplasma capsulatum , Madurella mycetomatis , and Fusarium pseudograminearum , matching the histological diagnosis were detected in 3 out of 17 samples among the three most frequently detected fungal species. Among these, there were two cases of histoplasmosis and mycetoma that were also confirmed by pan-fungal PCR [ 13 ] (see below). Specifically, Histoplasma capsulatum sequences constituted the most frequent fungal reads in the histoplasmosis sample. In detail, the corresponding reads were 0.02% of total reads in the sample and 34% of fungal reads. Madurella mycetomatis –specific sequences amounted to 0.001% of total reads in the respective sample and 4% of fungal reads, corresponding to position 3 of the most frequently detected fungal sequences in the mycetoma sample. In another mycetoma sample, a Fusarium species, here Fusarium pseudograminearum , was on position 2 of the most frequently detected fungi with 0.02% of total reads in the sample and 16% of fungal reads. In all other samples studied, spores of fungi from the environment were on positions 1 to 3 of the most frequently detectable fungal reads. The frequently detected environmental fungi comprised Auricularia delicata , Botrytis cinerea , Coniosporium apollinis , Debaryomyces hansenii var. Hansenii, Eutypa lata , Gaeumannomyces graminis , Malassezia globosa , Marssonina brunnea , Meyerozyma guilliermondii , Neofusicoccum parvum , Parastagonospora nodorum , Penicillium rubens , Pestalotiopsis fici , Pseudozyma hubeiensis , Sordaria macrospora , Thielavia terrestris , Trametes versicolor , Verticillium alfalfae , and Wallemia ichthyophaga . Facultatively pathogenic species like Aspergillus flavus , Candida orthopsilosis , Candida parapsilosis, and Fusarium pseudograminearum without relation to the histologically diagnosed disease were also among the three most frequently detected species.

The abundance or absence of sequences of fungi with potential etiological relevance in line with the histological diagnoses of the fungal sample collection was also studied in all samples (see “Materials and Methods” for the selection of the assessed fungi). The species detected, the average percentage of the corresponding reads in all samples (± 1 standard deviation), and the average percentage of respective reads as a proportion of the fungal reads (± 1 standard deviation) are shown in Table 2 . If genera listed in the “Materials and Methods” section are not represented in Table 2 , no corresponding detectable reads were found in any of the assessed samples.

Since mycetoma can also be caused by bacteria, the same approach was adopted for relevant bacterial species. This is illustrated in Table 3 .

The results for Entamoeba spp., E. histolytica and E. dispar , are given in Table 4 .

In a diagnostic total genomic survey such as occurs in NGS analysis, relevant pathogens must be distinguished from random contamination events in the context of sample preparation. It was therefore investigated how the proportions of pathogen-specific reads in cases of etiologic relevance differ from accidental contamination events. For this, it was determined for which samples the detected percentage of reads per pathogenic species exceeded the 1st, 2nd, 3rd, or 4th standard deviation from the mean of all samples and whether the results were consistent with the histological diagnoses. The results of the screenings for pathogenic fungi in the patients with fungal infections are shown with the focus on the percentage of the total number of reads in Table 5 and on the percentage of fungus-specific reads in Table 6 . Table 7 provides a corresponding overview for the amebas.

For the assessment based on the total number of reads, detection of potentially relevant fungal species above the 4th standard deviation succeeded in 5 samples, above the 3rd standard deviation in 1 sample, above the 2nd standard deviation in 4 samples, and in 8 samples pathogens were detected above the 1st standard deviation above the mean. No such increased quantities were detected for 5 samples. In all of the 5 samples with fungus detection above the 4th standard deviation, the findings agreed with the histological result. The single detection above the 3rd standard deviation did not agree with the histological result. For the 4 samples with positive results above the 2nd standard deviation, there was a match in 1 case and a mismatch in the 3 other cases. For the 8 samples with fungal detection above the 1st standard deviation, matching was found in 1 case and mismatching in the other 7 cases (Table 5 ).

Of note, fungal sequences were also found in the 6 biopsies from the gut of the patients with invasive amebiasis. Compared with the total numbers of reads, detections above the 4th standard deviation occurred in 16 instances (0.02% Pythium ultimum , 0.000009% Exophiala pisciphila , 0.0001% Sporothrix schenckii , 0.00002% Mortierella verticillata , 0.00003% Cryptococcus stepposus , 0.002% Setosphaeria turcica , 0.002% Leptosphaeria maculans , 0.00002% Fusarium solani , 0.0001% Cryptococcus victoriae , 0.00002% Cryptococcus tronadorensis , 0.0002% Cryptococcus gattii , 0.0002% Cladosporium cladosporioides , 0.0003% Capronia coronata , 0.0009% Bipolaris zeicola , 0.001% Bipolaris sorokiniana , 0.001% Bipolaris oryzae ). Detections above the 3rd standard deviation succeeded in 4 instances (0.009% Aspergillus spp., 0.0003% Cladophialophora carrionii, 0.0008% Paracoccidioides brasiliensis , 0.00002% Acremonium chrysogenum ), above the 2nd standard deviation in 10 instances (0.0006% Capronia epimyces , 0.0004% Chaetomium thermophilum var. Thermophilum, 0.0002% Cryptococcus neoformans , 0.0002% Exophiala dermatitidis , 0.0005% Paracoccidioides sp. ‘lutzii’, 0.0002% Cladophialophora psammophila , 0.000005% Exophiala pisciphila , 0.0004% Coccidioides immitis , 0.0004% Coccidioides posadasii , 0.00001% Fusarium solani ), and above the 1st standard deviation in 11 instances (0.000008 and 0.00001% Aspergillus spp., respectively, 0.00008% Cladosporium cladosporioides , 0.0003% Coccidioides immitis , 0.0003% Coccidioides posadasii , 0.0002% (in three instances) Cyphellophora europaea , 0.0001% Fusarium graminearum , 0.0007% Leptosphaeria maculans , 0.0004% Paracoccidioides sp. ‘lutzii’).

On a comparison with the fungal reads only, there were 5 detections above the 4th standard deviation, 1 detection above the 3rd standard deviation, 4 detections above the 2nd standard deviation, and 6 detections above the 1st standard deviation. Although all detections above the 4th standard deviation and 2 out of 4 detections above the 2nd standard deviation matched the histological findings, no other results matched the histological diagnoses (Table 6 ).

Again, there were fungal sequences in the 6 biopsies from the gut of the patients with invasive amebiasis. Compared with the total numbers of fungal reads only, detections above the 4th standard deviation occurred in 8 instances (0.2% Sporothrix schenckii , 0.03% Mortierella verticillata , 0.03% Exophiala pisciphila , 0.1% Cryptococcus victoriae , 0.02% Cryptococcus tronadorensis , 0.06% Cryptococcus stepposus , 0.9% Coccidioides posadasii , 0.9% Bipolaris sorokiniana ), above the 3rd standard deviation in 14 instances (8% Aspergillus spp., 0.1% Acremonium chrysogenum , 0.9% Bipolaris oryzae , 0.8% Bipolaris zeicola , 0.3% Capronia coronata , 0.4% Cladosporium cladosporioides , 1% Coccidioides immitis , 0.2% Cryptococcus gattii , 0.7% Cyphellophora europaea , 0.3% Fusarium graminearum , 0.3% Fusarium solani , 2% Leptosphaeria maculans , 1% Paracoccidioides sp. ‘lutzii’, 2% Setosphaeria turcica ), above the 2nd standard deviation in 13 instances (0.8% Paracoccidioides sp. ‘lutzii’, 0.7% (twice) Paracoccidioides brasiliensis , 0.2% Fusarium solani , 0.2% Fusarium graminearum , 0.2% Exophiala dermatitidis , 0.5% Cyphellophora europaea , 0.3% Cryptococcus neoformans , 0.4% Cladophialophora psammophila , 0.4% Cladophialophora carrionii , 1% Capronia epimyces , 0.6% Bipolaris oryzae , 0.8% Acremonium chrysogenum ), and above the 1st standard deviation in 30 instances (4% Aspergillus spp., 0.4% Bipolaris oryzae , 0.3% Bipolaris sorokiniana , 0.4 and 0.5% Bipolaris zeicola , respectively, 0.2% (twice) Capronia coronata , 0.9% Capronia epimyces , 0.4 and 0.3% (three times) Chaetomium thermophilum var. Thermophilum, respectively, 0.3% Cladophialophora carrionii , 0.3% (twice) Cladophialophora psammophila , 0.1% Cryptococcus gattii , 0.2% (twice) Cryptococcus neoformans , 0.4% Cyphellophora europaea , 0.009% Exophiala pisciphila , 1 and 0.9% Leptosphaeria maculans , respectively, 0.4% (three times) Paracoccidioides brasiliensis , 0.5% Paracoccidioides sp. ‘lutzii’, 1% (twice) and 0.9% Setosphaeria turcica , respectively).

The partial mismatch between the comparisons with the whole of the reads and the comparisons with the fungal reads only in the fungal samples is due to the considerable differences in the proportions of assignable reads as well as eukaryotic, bacterial, and viral reads (see above). Matching of results above the fourth standard deviation was found for all reads and fungal reads only; only two other cases (samples 9 and 10) showed matches, and those above only the first standard deviation. There is striking concordance of the two positive detections in pan-fungal PCRs, histology, and NGS results (Tables 5 and 6 ). Even the species Madurella mycetomatis , which accounted for only 4% of fungal reads, was amplified preferentially in one of the pan-fungal PCRs. Examples such as Histoplasma capsulatum in sample 14 and Madurella mycetomatis in sample 17 also show that a high percentage of specific reads of a pathogen can give a hint on its potential etiologic relevance. This did not apply, however, to all cases under investigation (for example, samples 4, 5). In addition, the percentage of reads of fungi from the environment was quantitatively dominant in nearly all cases with the exception of sample 14. Thus, no compelling association between etiologic plausibility and quantitative proportion of detected reads was confirmed.

Only results above the first and second standard deviation above the mean value were observed for reads of relevant actinomycetoma-associated bacteria ( Nocardia and Streptomyces ). Based on the totality of reads, Nocardia brasiliensis (0.06%, histologically rhinosporidiosis) was measured once above the 2nd standard deviation. Above the 1st standard deviation, Nocardia brasiliensis was found in 7 cases (5 × 0.4%, 2 × 0.5%; histologically 1 chromoblastomycosis, 1 coccidioidomycosis, 1 histoplasmosis, 1 histoplasmosis or cryptococcosis, 1 mucormycosis, only 1 myzetoma, and 1 rhinosporidiosis), the same as for Nocardia cyriacigeorgica (7 × 0.2%; histologically 1 coccidioidomycosis, 1 histoplasmosis, 1 histoplasmosis or cryptococcosis, 1 mucormycosis, only 1 myzetoma, and 2 cases of rhinosporidiosis). For Nocardia farcinica 8 cases (6 × 0.04%, 2 × 0.05%) and for Streptomyces spp., eight cases (2 × 0.9%, 6 × 1%) (histologically 1 chromoblastomycosis, 1 coccidioidomycosis, 1 histoplasmosis, 1 histoplasmosis or cryptococcosis, 1 mucormycosis, only 1 myzetoma and 2 cases of rhinosporidiosis) were detected. Compared with the total number of reads, there were no detections above any standard deviation in the ameba samples. In relation to the bacteria-specific reads, there were detections above only the 1st standard deviation in the fungal samples. This involved Nocardia brasiliensis (6 × 0.2%) and Streptomyces spp. (6 × 4%) in 6 samples (histologically 2 chromoblastomycosis, 1 mucormycosis, 2 mycetoma, and 1 rhinosporidiosis); Nocardia cyriacigeorgica (5 × 0.07%) in 5 samples (histologically 1 chromoblastomycosis, 1 histoplasmosis, 1 mucormycosis, 1 mycetoma, and 1 rhinosporidiosis); and Nocardia farcinica in 3 samples (3 × 0.2%) (histologically 1 chromoblastomycosis, 1 mucormycosis, 1 mycetoma). In samples from patients with invasive amebiasis, Nocardia cyriacigeorgica (0.1%) was once above the 2nd standard deviation and Nocardia farcinica (0.2%) was once above the 1st standard deviation compared with the bacteria-specific reads.

Among the 6 assessed ameba samples, there were 2 samples with high ameba density microscopically in adjacent histological sections and positive E. histolytica PCR; 1 sample with only few amebas in histology in neighboring sections and only questionable positive PCR (cycle threshold value > 35); 1 sample with a positive microscopic result that was questionable due to a very low parasite density in adjacent histological sections and negative PCR; as well as 2 samples with negative histology in adjacent sections and negative PCR results. Sequences of E. histolytica (0.009%) and Entamoeba spp. (0.01%) were detected by NGS above the 4th standard deviation in comparison with the total number of reads in the samples in one of the strongly positive samples in histology and PCR. In the same sample, sequences were assigned to the phylogenetically closely related E. dispar (0.0004%) above the 2nd standard deviation. Entamoeba spp. sequences above the first standard deviation were also detected in the sample with a few histologically visible amebas and a questionable PCR result (0.0003%) and in one of the two samples with negative PCR and negative histology (0.0004%) (Table 7 ). Furthermore, there were Entamoeba spp.-specific sequences above the 2nd standard deviation (0.0005%) in one chromoblastomycosis sample and above the 1st standard deviation (0.0003%) in a mycetoma sample. E. dispar -specific sequences were detected in the latter two samples above the 1st standard deviation (each 0.0003%) as well.

The NGS technology offers a molecular biological diagnostic tool that allows pathogen detection in complex sample material without prior specific suspicion, if an adequate sequence depth can be guaranteed. The question of adequate sequence depth for metagenomic analyses is not easily answered, in particular, if the proportion of pathogen DNA within a sample is unknown. Most recently, it was suggested by Hillmann et al. ( https://www.biorxiv.org/content/biorxiv/early/2018/05/12/320986.full.pdf , last accessed on 1 August 2018) that shallow metagenomic analysis effectively probes the diversity of species down to a sequencing depth of ~ 500 k reads per sample. Even better sequence depth was achieved for all described samples by our sequencing approach.

The technological approaches of NGS are varied [ 9 , 19 , 20 , 21 , 22 , 23 , 24 , 25 ] and some are still in the stage of development or optimization. A descriptive overview on NGS for the diagnosis of infectious diseases was introduced by Hasman and colleagues [ 26 ]. In a previous study, an association between infectious agents and a disease of unknown origin was confirmed [ 14 ]. Further, NGS-based detection of bacterial pathogens from two-thirds of tested urine samples was demonstrated in a previous “proof-of-principle” investigation [ 26 ]. NGS is also suitable for the detection of poly-microbial infections, as was shown for sample material from brain abscesses [ 27 ]. The most reliable diagnostic information can be provided by NGS from primary sterile sample material, where few reads can be used for pathogen diagnostics. Thus Wilson and colleagues succeeded in demonstration of Leptospira -induced meningoencephalitis with NGS based on only 475 (out of more than 3 million) specific reads [ 28 ]. Pathogen identification with NGS-based analysis of RNA (ribonucleic acid) in the sample material is also possible and succeeded in recognizing RNA viruses such as influenza virus in respiratory samples in the so-called UMERS (“unbiased metagenomic nontargeted RNA sequencing”) approach [ 29 ].

Although the NGS technology is still expensive, sequencing costs have dropped dramatically. For example, the cost of sequencing a human genome was reduced from about 100,000 euros to about 1000 euros within a few years as a result of technological progress [ 9 ]. In particular, the introduction of small automated sequencers (about the size of laser printers) has made NGS technology interesting for diagnostic purposes. An earlier comparative evaluation of these small “workbench” sequencers showed that the MiSeq system (Illumina) that was used in this study is superior to the competitors Ion Torrent PGM (Life Technologies, Carlsbad, CA, USA) and the no-longer available 454 GS Junior (Roche, Basel, Switzerland) with focus on the rarity of sequencing errors [ 30 ].

The hitherto quite complex and non–user-friendly analysis of sequence information is currently one of the major limitations of wide diagnostic application of NGS technology [ 31 ]. Further automation and standardization are essential to overcome these problems for the application of NGS in diagnostic routine. This also applies to the quality and accessibility of underlying databases.

Although the application of NGS with formalin-fixed, paraffin-embedded tissue is not new [ 17 , 18 ], the NGS-based detection of etiologically relevant pathogens from such materials is a diagnostic challenge. In addition to previous experiments, we therefore conducted a real-life assessment with sample materials from patients with rare and tropical invasive infections, for which no similar experience is available. Non-pathogen-specific molecular diagnostic approaches such as NGS are easily affected by contamination due to environmental microorganisms that are, for example, cast along with the sample in wax. As shown for Bartonella spp. DNA some years ago [ 32 ], DNA cross-contamination during tissue processing in a multispecies histopathological laboratory is highly likely. In the current, still unpublished, EORTC (European Organization for Research and Treatment of Cancer) criteria (personal correspondence with Professor Ralf Bialek) for the detection of a fungal infection from paraffin-embedded tissue by means of PCR, it is explicitly pointed out that the detection of specific fungal DNA in paraffin-embedded tissues shall only be used as proof of infection if fungal elements are also seen in histopathological assessments. This is to make sure that possible contamination of paraffin with ubiquitous fungal spores, for example of Aspergillus spp., is not mistakenly used as evidence of invasive mycosis. Although protocols for optimizing the use of FFPEs in molecular epidemiology by reducing the contamination risk have been introduced [ 33 ], initial tissue processing and waxing had been performed in a histopathological standard laboratory, where no special precautions against DNA contamination had been enforced. During the cutting of the sections for the molecular analyses, protective procedures against contamination such as discarding the first cuts of each block had been enforced as detailed elsewhere [ 7 , 13 ]. However, such precautions cannot undo contamination with fungal spores or pathogen DNA that has already occurred during initial processing and waxing of the tissue. This problem was also evident in the present study, in terms of both pan-fungal PCRs and the NGS approach. Traces of DNA even of rare tropical pathogens could be identified within the samples. Species-specific PCRs [ 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 ] are potential alternatives to pan-fungal PCR approaches, but their selection requires a specific diagnostic suspicion.

Traditional histology is not always reliable in case of invasive fungal infections as well. Its reliability is influenced by a variety of factors, including the requirement for a critical minimum density of pathogens in the examined tissue and a high level of expertise of the physician. In comparative studies between histology and culture, the latter of which cannot be performed from formalin-fixed tissues, a match of less than 80% was demonstrated [ 42 ], so histological diagnoses of invasive mycoses have to interpreted with caution [ 36 ]. In this study, the histological evaluation was performed by experienced pathologists who were professionally experienced in tropical infectious diseases [ 13 ]. Particularly considering the large number of genera and species that—as shown in the “Material and Methods” section—may account for the assessed invasive fungal infections, one has to bear in mind that histologically indistinguishable findings may be caused by different agents. In most cases of invasive mycosis in this study, histology did not allow a species-specific diagnosis but only micro-morphological descriptions such as chromoblastomycosis, mucormycosis, or mycetoma. The lack of cultural and serological results makes the interpretation of such findings challenging, which is an undeniable limitation of this study. Molecular approaches can be very useful here if culture is not possible. Even when sampling conditions allow culture approaches, cultural growth is not possible for all invasive fungi and takes between several days and several weeks depending on the species, as summarized elsewhere [ 13 ]. These factors reduce the diagnostic value of fungal culture.

A first important precondition for the reliability of molecular diagnostic findings is the quality of the nucleic acid extraction, which in this study was unacceptable for several samples that had been stored for long times. In line with this, partial PCR inhibition was observed in some of the assessed samples, as shown elsewhere [ 13 ]. Comparative testing of alternative nucleic acid purification methods [ 43 , 44 ] might have contributed to a further optimization of nucleic acid preparation in this study, but this was impossible due to the small amount of sample material that was available, which is an undeniable limitation of the study. For the samples that could be included in the NGS assessment, no significant Spearman rank correlation between sample age and number of detected reads could be found. However, the heterogeneity of the sample materials used makes an interpretation difficult. Of note, no samples older than 31 years were included.

Since the paraffin blocks were stored with the formalin-fixed tissues for years without any special protective measures against the deposition of fungal spores, contamination with environmental fungal spores can be regarded as highly probable. Thus, the high levels of contamination with environmental fungi are not unexpected. Contamination of the paraffin is an alternative explanation.

The high degree of contamination, expected from the previously applied pan-fungal PCRs [ 13 ], was a challenge for the NGS analysis. Since NGS analysis is associated with a completely nonspecific analysis of DNA fragments, the challenge is the discrimination of contaminants and etiologically relevant pathogens. The histological results of the samples from patients with invasive mycosis provided hints but not etiological clarification at the species level.

To overcome this problem, each mean value and standard deviation of the percentages of specific sequence fragments (reads) of etiologically relevant species were determined in the assessed samples. Then, the standard deviation from the average at which matching with the histological results can be expected was investigated.

A high rate of matches between histology and NGS results was found only for percentages above the fourth standard deviation in relation to the total number of reads and the number of fungi-specific reads. In cases with percentages above the fourth standard deviation, clear similarities with histology were found. When the percentages in relation to the totality of the reads in the sample were compared with the percentages in relation to the fungal reads in the sample, there was a considerable deviation, which can be explained by the massive differences in the proportions of assignable reads as well as eukaryotic, bacterial, and viral reads. For samples in which none of the assessed species reached the 4th standard deviation, no reliable assignment of etiological relevance could be performed. In the 6 tested samples from patients with invasive amebiasis, NGS-based detection of E. histolytica succeeded in a single sample only, which had also been positive in histology and was clearly positive by PCR.

The approach of comparing NGS results from nonsterile samples of patients with results from a healthy population to define etiologic relevance is not new. A comparison with negative control samples, which was based on a specific subtraction of reads, has been proposed by other authors as a method for identifying pathogens of potential etiological relevance. In this way, the detection of shiga-toxin-producing Escherichia coli succeeded in 67% of stool samples of patients during an outbreak [ 15 ].

Another approach was chosen for the sample collection assessed in this study. Other than in the recently described study [ 15 ], historical sample materials were used in the real-life assessment presented here. Because the samples had not been stored and collected for study purposes but as part of the diagnostic routine, no matched standardized negative control samples had been prepared. The collection of corresponding materials from completely healthy control subjects would also have posed an ethical problem in instances where the materials were derived from severely invasive sampling procedures, e.g., in case of samples from lung tissue, spinous process tissue, or tricuspid valve tissue. In any case it is obviously impossible to retrospectively apply any sort of standardization to samples prepared, paraffinated, and stored under unknown, and presumably variable conditions in comparatively low-tech laboratory environments sometimes a considerable time in the past. Although randomly selected blocks from a similar time frame that were negative by histopathology might have helped to establish an expected background, such an approach was not chosen for the above-mentioned reasons.

To overcome the problem of the lack of standardized negative controls, the mean percentages of specific reads from all samples, including histologically positive and negative ones regarding the various assessed species, were considered as proxy-negative control values, representing an averaged background. The repeated summing of the standard deviation values and comparison with the individually measured percentages in each sample allowed an estimation of how many more specific reads were detected in each sample than in the proxy-negative control. Accordingly, a standard deviation-based and not a subtraction-based approach [ 15 ] was chosen.

The rationale of the standard deviation-based approach is the assumption that the likelihood of a real infection increases with the number of standard deviations of a percentage of measured specific reads in a specific sample above the proxy-negative control. With a value high above the mean value plus several standard deviations, the risk is low that this percentage is measured by chance, i.e., due to contamination. If bacteria and fungi were assessed, these comparisons were carried out not only with all reads within the samples but also with bacteria- or fungi-specific reads. This was done to reduce the effects of the slightly different proportions of viral, bacterial, fungal, and other eukaryotic reads specific to the sample materials. As amebae are neither fungi nor bacteria, such an approach was not possible for their assessment. As an indication of potential contamination, the percentages of specific reads for all species of the genus Entamoeba and also of specific reads for non-pathogenic amebae such as E. dispar were assessed.

For the fungi and bacteria that were assessed, comparisons of the species-specific reads with the total number of reads and with fungus-specific reads and bacteria-specific reads, respectively, led to slightly different results. For example, there were matches above the 2nd standard deviation for Cladophialophora psammophila compared with the total number of reads and for both Cladophialophora psammophila and Chaetomium globosum compared with the fungus-specific reads in a sample with the histological diagnosis of chromoblastomycosis. Such differences are mathematical artifacts resulting from slightly different proportions of fungus-specific reads in the different sample materials. Such examples demonstrate the vulnerability of the model, which is a particular problem with low sample numbers when slight variances show large effects.

An undeniable limitation of the standard deviation-based approach is the fact that the reliability of the proxy-negative control will depend on the number of assessed samples. However, subtraction-based approaches [ 15 ] are also susceptible to the problem of sample numbers in excluding major effects of variations by chance.

It is likely that the variety of anatomical source sites might influence the quality of the proxy-negative control. The fact that samples from primarily sterile body compartments were also severely contaminated with DNA of various non-human species suggests that the effects of procedures subsequent to sample acquisition, e.g., during processing, paraffination and storage, were more relevant to the measured contamination than was the anatomical sampling site. Accordingly, the anatomical site was not specifically considered in the definition of the proxy-negative control for the formalin-fixed, paraffin-embedded tissue samples that were assessed. For medical interpretation of the diagnostic NGS results, however, the natural occurrence of environmental microorganisms on primarily non-sterile sampling sites has to be considered. Thus NGS cannot do away with the need for medical validation and interpretation of diagnostic findings.

No target enrichment, e.g., by specific PCR, was attempted or evaluated because the performance of diagnostic NGS without specific suspicion was being assessed. Depletion of human DNA prior to the NGS runs was also not attempted, because the initial DNA quantities in the historical samples was so low that the appropriate technical strategies might also have affected the recovery of the residual target DNA. As an example of this concern over sensitivity, pro-viral DNA of HIV that would be anticipated to be present was never detected in any sample of the patients with invasive and tropical mycoses. The sensitivity concern is of particular importance, because various matches with the histological diagnoses were achieved with just the standard deviation-based approach for the attribution of etiological relevance, while the total numbers of specific reads were very low. In contrast, etiologically irrelevant environmental fungi dominated among the most frequently detected fungal reads in nearly all samples assessed.

Another pointer toward unlikely etiological relevance but increased likelihood of contamination is the frequent detection of very rare pathogens in various samples. An example is the frequent detection of Cladophialophora yegressii , which lives on living cactus plants [ 45 ]. Although Cladophialophora spp. can in rare cases be associated with human disease, i.e. chromoblastomycosis [ 45 ], the frequent occurrence of comparably high DNA concentrations in samples without any histological indications for chromoblastomycosis makes it more likely that there was contamination deriving from cactus plants in the diagnostic institute.

Further, interpretation can be difficult if increased quantities of sequences of a species are detected which has rarely or never been associated with clinical disease so far. Cryptococcus carnescens is such an example. C. carnescens is part of the Cryptococcus laurentii complex [ 46 ]. In a recent review on non-neoformans cryptococcal infections, only 20 cases of infection with C. laurentii complex were reported [ 47 ] and those were without detailed differentiation within the complex. The etiological relevance of the C. carnescens sequences, which were identified by NGS in sample 5 of a patient with the histological diagnosis of histoplasmosis or cryptococcosis, is therefore uncertain.

Although potentially useful diagnostic information for 5 out of 17 samples from patients with invasive fungal infection (29.4%) and for 1 out of 6 samples from patients with invasive amebiasis (16.7%) represents only a modest result, this result must be interpreted in relation to the complexity of the sample materials. The sensitivity of the procedure is, undeniably, still unacceptably poor. In comparison, the molecular gold standard method of pan-fungal PCRs with subsequent Sanger sequencing allowed conclusive detection of pathogens in only 2 out of 17 fungal samples (11.8%) and even that only in 3 out of 10 PCR reactions for those 2 samples [ 13 ]. In contrast, NGS analysis not only allowed confirmation of the pan-fungal PCR detections of Histoplasma capsulatum and Madurella mycetomatis but also gave hints of infections due to Rhizopus spp., Cryptococcus spp., and Fusarium spp. Particularly for assignments at genus and species levels, histology showed limited value for the diagnosis of invasive fungal infections [ 36 , 42 ], as in the study described here. For the detection of Entamoeba histolytica in intestinal biopsies, however, specific PCR proved to be superior to NGS analysis.

Accordingly, NGS analysis can help to improve the molecular discrimination of fungal pathogens in formalin-fixed, paraffin-embedded tissues in comparison with contamination-sensitive pan-fungal PCR with subsequent Sanger sequencing. However, the sensitivity appears inferior to that of specific PCR approaches, as the experiments with the ameba-containing samples suggest. For the invasive fungi, however, quality-controlled specific PCRs were available only for histoplasmosis and mucormycosis in the laboratories of the study participants. Specific analysis for all fungal pathogens could therefore not be performed—an admitted limitation of the study.

Focusing on samples for which results of specific PCR and Sanger sequencing were available, it is interesting that PCR with subsequent Sanger sequencing suggested Lichtheimia / Absidia corymbifera while NGS gave strong hints for Rhizopus oryzae in sample 4 of a patient with mucormycosis. Preferential amplification of Lichtheimia / Absidia corymbifera DNA by the PCR primers is a likely explanation, while the more abundant Rhizopus oryzae -specfic DNA was identified by NGS. Preferential primer binding affinities of multispecies primers to certain microorganisms is a well-known problem affecting amplification-based diagnostic approaches [ 48 ].

With focus on the hypothesis of the study, it could be shown that hypothesis-free genomic detection of rare invasive infections by NGS in poly-microbially contaminated, formalin-fixed, paraffin-embedded tissue samples is feasible and can provide hints on likely causative agents. Considering the cost of the technique, the demanding technical and bioinformatic procedures, and the uncertainties regarding the interpretation of the results, the technique at present is still subordinate in the diagnostic workflow and should be only considered if other, less demanding procedures do not lead to conclusive results.

It should be noted that assignment of potential etiological relevance based on a percentage of specific NGS reads is far from being standardized and requires further evaluation. Among other factors, the choice of the number of negative control samples in the calculation of the average of the percentage values of reads will necessarily have an impact on the size of the standard deviation and thus on the potential attribution of etiologic relevance in contaminated sample materials. So, standardization prior to diagnostic use is obligatory. From this perspective, the results presented here can only be considered as hypothesis-forming. Further studies are needed to define standards for medical interpretation of NGS-based pathogen identification directly from sample material. This applies even more strongly for contamination-prone sample materials such as formalin-fixed, paraffin-embedded tissue samples.

For such contamination-prone sample materials, there is considerable risk of false-positive spurious results, e.g., in case of contamination events that are restricted to the processing of individual samples. Such events cannot be controlled by the proxy-negative control-based standard deviation approach. Accordingly, the procedure we have introduced can only lead to hypothesis-forming results that will induce the clinician in charge to consider as differential diagnoses clinically matching infectious diseases that had not been considered prior to the non-specific NGS assessment. Without consideration of the clinical findings, the NGS results from such materials are not interpretable. If these limitations are accepted, however, NGS can help to suggest infectious agents as potentially etiologically relevant that were not considered during the initial clinical assessment of a patient. With this aim, the technique can be applied in situations when there are no clear candidates in the potential etiological background of clinical situations in infectious disease patients.

In conclusion, molecular diagnostic approaches from complex and potentially contaminated sample materials such as formalin-fixed, paraffin-embedded tissues remain a challenge. Similarly to previous studies [ 14 ], potentially etiologically relevant species that could not be detected by traditional molecular analysis were identified by NGS. The findings suggest the suitability of the use of NGS-based diagnostics on materials taken under sterile precautions from primary sterile compartments of the body even without a specific etiologic suspicion.

A major disadvantage of pathogen-nonspecific NGS analysis remains the low sensitivity in comparison with specific PCR, which was confirmed by the example of the ameba samples and which was also observed by other authors [ 49 ]. Another disadvantage, as addressed in this study, is the high susceptibility to contamination that is frequently observed in formalin-fixed, paraffin-embedded samples. It must further be considered that the procedure described is both expensive and time-consuming. The cost of the reagents employed for this proof-of-principle assessment was about 50,000 euros, apart from payment for the scientists and technical assistants. The diagnostic approaches including the nucleic acid processing, the NGS runs, the programming and application of the required bioinformatics, and the interpretation of the data occupied several weeks. Both the costs and the long time-to-result will impede the use of the procedure for routine-diagnostic purposes in the near future.

Following this “proof-of-principle” study, validations with larger numbers of samples should be performed to define reliable standards for the discrimination of the detection of etiologically relevant pathogens from the detection of nucleic acid contamination, in particular from difficult sample materials.

Sample materials

The materials assessed comprised residual extracted nucleic acids from two previous studies [ 7 , 13 ], that is, from 17 FFPE tissue samples with histological evidence of invasive mycosis by tropical or rare fungi and 34 samples from patients with invasive amebiasis. Information on the applied nucleic acid extraction procedures is summarized in Table 8 .

As detailed below, only 6 out of 34 amebic samples of the original collection [ 7 ] could be included into the NGS assessment on the grounds of sufficient quality and quantity of the DNA. The histopathological diagnoses of the patients with invasive mycoses were chromoblastomycosis ( n  = 3), coccidioidomycosis ( n  = 2), histoplasmosis ( n  = 4), histoplasmosis or cryptococcosis with histologically difficult discriminability ( n  = 1), mucormycosis ( n  = 2), mycetoma ( n  = 3), and rhinosporidiosis ( n  = 2) as detailed elsewhere [ 13 ] (Table 9 ).

Cultural and serological diagnostic results were not available. Further, there were no data on previous molecular diagnostic assessment from the time of sample acquisition or on microscopical assessments from other sample materials, e.g., for ova or parasites from stool samples.

The sample collection included biopsies of the intestinal mucosa ( n  = 6), lymph node tissues ( n  = 2), skin biopsies ( n  = 6), bioptic material from a nasal polyp ( n  = 1), cells from vaginal discharge ( n  = 1), lung tissue ( n  = 2), bone, muscle, and connective tissue from the spinous process of the third thoracic vertebra ( n  = 1), tissue of a tricuspid valve ( n  = 1), a lower lip biopsy ( n  = 1), an ethmoid sinus biopsy ( n  = 1), and bioptic material from a wound on a foot ( n  = 1). The sample age at the time of nucleic acid extraction varied between 1 and 31 years with a mean of 11.5 years (± 6.1) in a left-shifted distribution for the fungal samples and with a mean of 25.8 years (± 4.3) in a right-shifted distribution for the ameba-containing samples. All samples had been stored in the Department of Pathology of the Bernhard Nocht Institute for Tropical Medicine in Hamburg, Germany.

Nucleic acid extraction procedures have been described elsewhere [ 7 , 13 ]. Photometric nucleic acid quantification was done as described [ 13 ] with a Pico 100 Picodrop Microliter Spectrophotometer (Picodrop Ltd., Hinxton, UK) and indicated DNA concentrations of 123.6 (±166.5) ng/μl for the fungal samples and 25.1 (±30.2) ng/μl for the amebic samples.

In addition to microscopic assessment [ 7 , 13 ], all amebic samples had been assessed by E. histolytica -specific PCR [ 7 ], while the fungal samples had been characterized by five different pan-fungal PCRs with subsequent Sanger sequencing as well as Histoplasma spp.-specific and Mucorales-specific PCR, the latter also with Sanger sequencing [ 13 ] (Table 10 ). If sequences of environmental fungi in contradiction to the histological diagnosis were detected by pan-fungal PCR or if sequence overlays made the interpretation of Sanger sequencing results of pan-fungal PCR products impossible, contamination of the samples with environmental fungi was assumed. Table 10 lists all used primers and probes including the inhibition control PCR and the sample quality control PCR. Relevant sample inhibition was not shown for the samples, as detailed elsewhere [ 7 , 13 ]. Negative controls also assessing the nucleic acid extraction procedure and the master mixes using PCR-grade water were included in the PCR procedures. However, only the sample materials were assessed by NGS.

NGS and bioinformatics

Nonspecific NGS sequencing of the DNA elements within the samples was performed by an experienced medical-laboratory assistant using a MiSeq system (Illumina, San Diego, CA, USA) as described by the manufacturer. No target enrichment or human DNA depletion was attempted. In summary, DNA libraries were prepared using TruSeq® Nano DNA Sample Preparation kits (Illumina) employing the low sample (LS) protocol. Briefly, 100 ng of each genomic DNA from the samples was fragmented by Adaptive Focused Acoustics™ Technology (Covaris, Inc., Woburn, MA, USA) using a Covaris M220 with settings for fragment sizes in the 350 bp range (duty factor 20%, peak incident power 50 W, cycle per burst 200, duration 65 s, temperature 65 °C). Fragmented chromosomal DNA was cleaned up with bead technology. End repair was performed according to the TruSeq protocols. Further clean-up and size selection was done with bead technology. 3′-Ends were adenylated, Illumina adapters were ligated and DNA fragments were enriched. An Agilent DNA 7500 kit (Agilent Technologies, Inc., Santa Clara, CA, USA) was used as a quality check and for the confirmation of the intended fragment size after the application of the Covaris M220 fragmentation protocol and after Illumina adapter ligation. Visualization of a clearly defined peak in the expected size range was considered as proof of successful DNA fragmentation and adapter ligation. If no peak was visible after applying the fragmentation protocol, the sample was not further analyzed for downstream processing. Only samples with clearly visible peaks in the expected size range, both after fragmentation and after adapter ligation, were further analyzed by sequencing. No concentration determination by integrating the area under the peak was performed, because this was considered as not reliable and sensitive enough for sequencing. Actual library DNA concentration measurements were performed using Qubit dsDNA BR assay kits (Thermo Fisher Scientific, Waltham, MA, USA) prior to loading the sequencing cells.

All 17 fungal samples were included in the further analysis while only 6 amebiasis samples showed visible DNA peaks and could thus be included. The remaining 28 amebiasis samples with visible DNA peaks lacking in the Agilent system were discarded. Each individual library was adjusted to a 4 nmol/L stock solution and of these 6 pmol was used for each individual sequence run. Sequencing was performed using Reagent Kit MiSeq® v3 (600 cycle) runs (Illumina), with a complete v3 run used per sample. Between 3 million and 23 million read pairs were sequenced per sample. Considering an average trimmed read length of 250 bp (base pairs) and an assumed target genome of 30 Mb (mega bases), this would result in theoretical coverages in the range of 45 to 384-fold.

The bioinformatic processing of the resulting files was performed at the Max Planck Institute for Heart and Lung Research in Bad Nauheim, Germany. Paired-end reads were trimmed for adapter sequences using Cutadapt 1.41 [ 50 ]. All Truseq adapters were removed using default parameters. The resulting reads were further trimmed and filtered for quality using Trimmomatic 0.33 (LEADING:3, TRAILING:3, SLIDINGWINDOW:5:20, MINLEN:30) [ 51 ]. Start and end of reads were always trimmed by a fixed number of 3 nucleotides to remove pervasive low-quality data. Furthermore, reads were trimmed after a drop in average quality below Q20 in a window of 5 nucleotides. Only if both mates of a read-pair still contained more than 30 nucleotides after this, the pair was cleared for further analyses. Kraken version 0.10.6 was employed to classify metagenomic reads based on matching 31-kmers with a confidence threshold of 0.1 [ 52 ]. The reference database consisted of genomes (*.genomic.fna.gz) of all bacteria, viruses, fungi, and protozoa, as well as Homo sapiens available from the RefSeq database on September 14, 2015 ( ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ ). The classification was visualized using Krona 2.6 [ 53 ].

The analyzed sequence data sets were stored in the database Sequence Read Archive (Sequence Read Archive (SRA) study accession SRP091494). In addition, the data sets can be provided by Bernd Kreikemeyer on request.

The NGS results were visualized by means of the Krona software [ 53 ] and initially analyzed according to the following criteria: total number of reads (single sequence fragments); number and percentage of nonassignable reads in comparison with the NCBI RefSeq database; and finally the percentages of human, protozoan, bacterial, archaean, viral and fungal sequences. Among the fungal sequence reads, the three most frequent fungal species were identified in each sample material.

All samples were also searched for specific sequences that could be assigned by the software Kraken to pathogens that were potentially relevant as causal agents for the 23 study materials and their histologically diagnosed pathologies. In this assessment, disease patterns that are defined by their pathogens, such as E. histolytica -associated amebiasis, histoplasmosis (caused by Histoplasma capsulatum ), cryptococcosis (caused by Cryptococcu s spp.), and coccidioidomycosis (caused by Coccidioides spp.) were relatively easy to assign. Some of the invasive mycoses studied can be caused by a wide variety of potential pathogens. Because a possible etiologic relevance of E dispar in invasive amebiasis has been discussed repeatedly [ 54 , 55 , 56 , 57 ], this species was also included in the evaluation.

Potential relevant pathogens for poly-causal infectious diseases [ 58 , 59 , 60 , 61 ] are summarized in Table 11 .

The percentage of pathogen-specific sequence fragments (paired reads) in relation to all reads in the sample was calculated and, if applicable, also in relation to the fungus/bacteria-associated reads. To define the frequency of such verified reads with diagnostic significance as opposed to DNA contamination, i.e. influx from the environment, during sample preparation, averages of the percentages of the pathogen-specific reads were calculated including all samples. Then, it was established for which individual samples the detected percentages of pathogen-specific reads exceeded the 1st, 2nd, 3rd, and 4th standard deviations above the mean of all examined histologically positive and negative samples (in terms of potential cut-off values). In this context, “negative samples” mean samples with histological results that do not match the respective NGS-based identified pathogen. For these samples with larger than average proportions of specific pathogen sequences, NGS-based diagnosis was compared with histological diagnosis to examine the diagnostic reliability of such standard deviation–based thresholds.

Ethical approval for this retrospective study using residual materials was granted by the Ethics Committee of the Medical Association of Hamburg (document number WF-028/13) in line with national and ICH-GCP guidelines. Consent for the anonymous use of the materials was not demanded by the ethics committee. In detail, because the anonymized samples cannot be assigned to a human being, the project did not constitute a research project on humans according to the definitions of § 9 (2) of the Hamburg Medical Association Act for health professions and was also not restricted by § 15 (1) of the Professional Regulations for physicians in Hamburg, Germany.

Mauch H. Einfluss von Lagerung und Transport auf das Überleben von Infektionserregern. Bedeutung für eine schnelle mikrobiologische Diagnostik, die Ökonomie und den Patienten. Mikrobiologe. 2008;18:200–4.

Google Scholar  

Hagen RM, Gauthier YP, Sprague LD, Vidal DR, Zysk G, Finke EJ, et al. Strategies for PCR based detection of Burkholderia pseudomallei DNA in paraffin wax embedded tissues. Mol Pathol. 2002;55:398–400.

Article   CAS   Google Scholar  

Quach N, Goodman MF, Shibata D. In vitro mutation artifacts after formalin fixation and error prone translesion synthesis during PCR. BMC Clin Pathol. 2004;4:1–5.

Article   Google Scholar  

Huijsmans CJJ, Damen J, Linden JC, Van Der Savelkoul PHM, Hermans MHA. Comparative analysis of four methods to extract DNA from paraffin-embedded tissues: effect on downstream molecular applications. BMC Res Notes. 2010;3:239.

Lu K, Ye W, Zhou L, Collins LB, Chen X, Gold A, et al. Structural characterization of formaldehyde-induced cross-links between amino acids and desoxynucleotides and their oligomers. J Am Chem Soc. 2010;132:3388–99.

Dietrich D, Uhl B, Sailer V, Holmes E, Jung M, Meller S, et al. Improved PCR performance using template DNA from formalin-fixed and paraffin-embedded tissues by overcoming PCR inhibition. PLoS One. 2013;8:e77771.

Frickmann H, Tenner-Racz K, Eggert P, Schwarz NG, Poppert S, Tannich E, et al. Influence of parasite density and sample storage time on the reliability of Entamoeba histolytica -specific PCR from formalin-fixed and paraffin-embedded tissues. Diagn Mol Pathol. 2013;22:236–44.

Oosterheert JJ, van Loon AM, Schuurman R, Hoepelman AI, Hak E, Thijsen S, et al. Impact of rapid detection of viral and atypical bacterial pathogens by real-time polymerase chain reaction for patients with lower respiratory tract infection. Clin Infect Dis. 2005;41:1438–44.

Karger BL, Guttman A. DNA sequencing by capillary electrophoresis. Electrophoresis. 2009;30:S196–202.

Reischl U, Drosten C, Geißdörfer W, Göbel U, Hoffmann KS, Mauch H, et al. In: Podbielski A, Herrmann M, Kniehl E, Mauch H, Rüssmann H, editors. MiQ 1. Mikrobiologisch-infektiologische Qualitätsstandards. Nukleinsäure-Amplifikationstechniken (NAT). 3rd ed. München & Jena: Urban & Fischer; 2011. p. 51–3.

Habib G, Lancellotti P, Antunes MJ, Bongiorni MG, Casalta JP, Del Zotti F, et al. 2015 ESC guidelines for the management of infective endocarditis: the task force for the Management of Infective Endocarditis of the European Society of Cardiology (ESC). Endorsed by: European Association for Cardio-Thoracic Surgery (EACTS), the European Association of Nuclear Medicine (EANM). Eur Heart J. 2015;36:3075–128.

Frickmann H, Dekker D, Schwarz NG, Hahn A, Boahen K, Sarpong N, et al. 16S rRNA gene sequence-based identification of bacteria in automatically incubated blood culture materials from tropical sub-Saharan Africa. PLoS One. 2015;10:e0135923.

Frickmann H, Loderstaedt U, Racz P, Tenner-Racz K, Eggert P, Haeupler A, et al. Detection of tropical fungi in formalin-fixed, paraffin-embedded tissue: still an indication for microscopy in times of sequence-based diagnosis? Biomed Res Int. 2015;2015:938721.

PubMed   PubMed Central   Google Scholar  

Bhatt AS, Freeman SS, Herrera AF, Pedamallu CS, Gevers D, Duke F, et al. Sequence-based discovery of Bradyrhizobium enterica in cord colitis syndrome. N Engl J Med. 2013;369:517–28.

Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, et al. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. JAMA. 2013;309:1502–10.

Fischer N, Rohde H, Indenbirken D, Günther T, Reumann K, Lütgehetmann M, et al. Rapid metagenomic diagnostics for suspected outbreak of severe pneumonia. Emerg Infect Dis. 2014;20:1072–5.

Carrick DM, Mehaffey MG, Sachs MC, Altekruse S, Camalier C, Chuaqui R. Robustness of next generation sequencing in older formalin-fixed, paraffin-embedded tissue. PLoS One. 2015;10:e0127353.

Ambulos NP Jr, Schumaker LM, Mathias TJ, White R, Troyer J, Wells D, et al. Next-generation sequencing-based HPV genotyping assay validated in formalin-fixed, paraffin-embedded oropharyngeal and cervical cancer specimens. J Biomol Tech. 2016;27:46–52.

Drmanac S, Kita D, Labat I, Hauser B, Schmidt C, Burczak JD, et al. Accurate sequencing by hybridization for DNA diagnostics and individual genomics. Nat Biotechnol. 1998;16:54–8.

Ronaghi M, Pettersson B, Uhlén M, Nyrén P. PCR-introduced loop structure as primer in DNA sequencing. Biotechniques. 1998;25:876–8 880–2, 884.

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–80.

Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309:1728–32.

Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, et al. The potential and challenges of nanopore sequencing. Nat Biotechnol. 2008;26:1146–53.

Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, Braslavsky I, et al. Single-molecule DNA sequencing of a viral genome. Science. 2008;320:106–9.

Turcatti G, Romieu A, Fedurco M, Tairi AP. A new class of cleavable fluorescent nucleotides: synthesis and optimization as reversible terminators for DNA sequencing by synthesis. Nucleic Acids Res. 2008;36:e25.

Hasman H, Saputra D, Sicheritz-Ponten T, Lund O, Svendsen CA, Frimodt-Møller N, et al. Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples. J Clin Microbiol. 2014;52:139–46.

Kommedal Ø, Wilhelmsen MT, Skrede S, Meisal R, Jakovljev A, Gaustad P, et al. Massive parallel sequencing provides new perspectives on bacterial brain abscesses. J Clin Microbiol. 2014;52:1990–7.

Wilson MR, Naccache SN, Samayoa E, Biagtan M, Bashir H, Yu G, et al. Actionable diagnosis of neuroleptospirosis by next-generation sequencing. N Engl J Med. 2014;370:2234–44.

Fischer N, Indenbirken D, Meyer T, Lütgehetmann M, Lellek H, Spohn M, et al. Evaluation of unbiased next-generation sequencing of RNA (RNA-seq) as a diagnostic method in influenza virus-positive respiratory samples. J Clin Microbiol. 2015;53:2238–50.

Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012;30:434–9.

Sherry NL, Porter JL, Seemann T, Watkins A, Stinear TP, Howden BP. Outbreak investigation using high-throughput genome sequencing within a diagnostic microbiology laboratory. J Clin Microbiol. 2013;51:1396–401.

Varanat M, Maggi RG, Linder KE, Horton S, Breitschwerdt EB. Cross-contamination in the molecular detection of Bartonella from paraffin-embedded tissues. Vet Pathol. 2009;46:940–4.

Mena M, Lloveras B, Tous S, Bogers J, Maffini F, Gangane N, et al. Development and validation of a protocol for optimizing the use of paraffin blocks in molecular epidemiological studies: the example from the HPV-AHEAD study. PLoS One. 2017;12:e0184520.

Einsele H, Hebart H, Roller G, Löffler J, Rothenhöfer I, Müller CA, et al. Detection and identification of fungal pathogens in blood by using molecular probes. J Clin Microbiol. 1997;35:1353–60.

CAS   PubMed   PubMed Central   Google Scholar  

White PL, Barton R, Guiver M, Linton CJ, Wilson S, Smith M, et al. A consensus on fungal polymerase chain reaction diagnosis? A United Kingdom-Ireland evaluation of polymerase chain reaction methods for detection of systemic fungal infections. J Mol Diagn. 2006;8:376–84.

Balajee SA, Sigler L, Brandt ME. DNA and the classical way: identification of medically important molds in the 21st century. Med Mycol. 2007;45:475–90.

Hata DJ, Buckwalter SP, Pritt BS, Roberts GD, Wengenack NL. Real-time PCR method for detection of zygomycetes. J Clin Microbiol. 2008;46:2353–8.

Khot PD, Ko DL, Hackman R, Fredricks DN. Development and optimization of quantitative PCR for the diagnosis of invasive aspergillosis with bronchoalveolar lavage fluid. BMC Infect Dis. 2008;8:73.

Hrncirova K, Lengerova M, Kocmanova I, Racil Z, Volfova P, Palousova D, et al. Rapid detection and identification of mucormycetes from culture and tissue samples by use of high-resolution melt analysis. J Clin Microbiol. 2010;48:3392–4.

Hammond SP, Bialek R, Milner DA, Petschnigg EM, Baden LR, Marty FM. Molecular methods to improve diagnosis and identification of mucormycosis. J Clin Microbiol. 2011;49:2151–3.

Fleischhacker M, Schulz S, Jöhrens K, von Lilienfeld-Toal M, Held T, Fietze E, et al. Diagnosis of chronic disseminated candidosis from liver biopsies by a novel PCR in patients with haematological malignancies. Clin Microbiol Infect. 2012;18:1010–6.

Sangoi AR, Rogers WM, Longacre TA, Montoya JG, Baron EJ, Banaei N. Challenges and pitfalls of morphological identification of fungal infections in histologic and cytologic specimens. A ten-year retrospective review at a single institution. Am J Clin Pathol. 2009;131:364–75.

Munoz-Cadavid C, Rudd S, Zaki SR, Patel M, Moser SA, Brandt ME, et al. Improving molecular detection of fungal DNA in formalin-fixed paraffin-embedded tissues: comparison of five tissue DNA extraction methods using panfungal PCR. J Clin Microbiol. 2010;48:2147–53.

Obersteller S, Neubauer H, Hagen RM, Frickmann H. Comparison of five commercial nucleic acid extraction kits for the PCR-based detection of Burkholderia pseudomallei in formalin-fixed, paraffin-embedded tissues. Eu J Microbiol Immunol (Bd). 2016;6:244–52.

de Hoog GS, Nishikaku AS, Fernandez-Zeppenfeldt G, Padín-González C, Burger E, Badali H, et al. Molecular analysis and pathogenicity of the Cladophialophora carrionii complex, with the description of a novel species. Stud Mycol. 2007;58:219–34.

Takashima M, Sugita T, Shinoda T, Nakase T. Three new combinations from the Cryptococcus laurentii complex: Cryptococcus aureus , Cryptococcus carnescens and Cryptococcus peneaus . Int J Syst Evol Microbiol. 2003;53(Pt 4):1187–94.

Khawcharoenporn T, Apisarnthanarak A, Mundy LM. Non-neoformans cryptococcal infections: a systematic review. Infection. 2007;35:51–8.

Junier P, Kim OS, Hadas O, Imhoff JF, Witzel KP. Evaluation of PCR primer selectivity and phylogenetic specificity by using amplification of 16S rRNA genes from betaproteobacterial ammonia-oxidizing bacteria in environmental samples. Appl Environ Microbiol. 2008;74:5231–6.

Köser CU, Fraser LJ, Ioannou A, Becq J, Ellington MJ, Holden MT, et al. Rapid single-colony whole-genome sequencing of bacterial pathogens. J Antimicrob Chemother. 2014;69:1275–81.

Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal [Sl]. 2011;17:10–2.

Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20.

Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46.

Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a web browser. BMC Bioinf. 2011;12:385.

Shibayama M, Dolabella SS, Silva EF, Tsutsumi V. A Brazilian species of Entamoeba dispar (ADO) produces amoebic liver abscess in hamsters. Ann Hepatol. 2007;6:117–8.

PubMed   Google Scholar  

Costa CA, De Brito KN, Gomes MA, Caliari MV. Histopathological and immunohistochemical study of the hepatic lesions experimentally induced by Entamoeba dispar . Eur J Histochem. 2010;54:e39.

Costa CA, Fonseca TH, Oliveira FM, Santos JF, Gomes MA, Caliari MV. Influence of inflammation on parasitism and area of experimental amoebic liver abscess: an immunohistochemical and morphometric study. Parasit Vectors. 2011;4:27.

Dolabella SS, Serrano-Luna J, Navarro-García F, Cerritos R, Ximénez C, Galván-Moroyoqui JM, et al. Amoebic liver abscess production by Entamoeba dispar . Ann Hepatol. 2012;11:107–17.

CAS   PubMed   Google Scholar  

Sybren De Hoog G, Vitale RG. Bipolaris , Exophiala , Scedosporium , Sporothrix , and other dematiaceous fungi. In: Murray PR, Baron EJ, Jorgensen JH, Landry ML, Pfaller MA, editors. Manual of clinical microbiology. 9th ed. Washington, DC: ASM Press; 2007. p. 1899–900.

Richardson MD, Koukila-Kahkola P. Rhizopus , Rhizomucor , Absidia , and other agents of systemic and subcutaneous zygomycoses. In: Murray PR, Baron EJ, Jorgensen JH, Landry ML, Pfaller MA, editors. Manual of Clinical Microbiology. 9th ed. Washington, DC: ASM Press; 2007. p. 1840.

Wieser A, Löscher T, Huber K. Das Mycetom – diagnostische Herausforderung in Zeiten der globalen Migration. Mikrobiologe. 2015;3:107–12.

Mendoza L. Lacazia , Pythium , and Rhinosporidium . In: Murray PR, Baron EJ, Jorgensen JH, Landry ML, Pfaller MA, editors. Manual of Clinical Microbiology. 9th ed. Washington, DC: ASM Press; 2007. p. 1937.

Bialek R, Feucht A, Aepinus C, Just-Nübling G, Robertson VJ, Knobloch J, et al. Evaluation of two nested PCR assays for detection of Histoplasma capsulatum DNA in human tissue. J Clin Microbiol. 2002;40:1644–7.

Bialek R, Konrad F, Kern J, Aepinus C, Cecenas L, Gonzalez GM, et al. PCR based identification and discrimination of agents of mucormycosis and aspergillosis in paraffin wax embedded tissue. J Clin Pathol. 2005;58:1180–4.

Download references

Acknowledgments

Annett Michel and Simone Priesnitz are gratefully acknowledged for excellent technical assistance.

The next generation sequencing (NGS) analyses were funded by grant 12 K2-S-451315 “Optimizing of microbiological diagnostic pre-analytics for tropical conditions” of the German Ministry of Defense (MoD) awarded to Hagen Frickmann and Ralf Matthias Hagen. The purchase of the Illumina MiSeq system was kindly supported by the EU-EFRE (European Funds for Regional Development) program and funds from the University Medicine Rostock awarded to Bernd Kreikemeyer. The funding body did not have any role in the design of the study, in the collection, analysis, and interpretation of data, or in writing the manuscript.

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in the database Sequence Read Archive (Sequence Read Archive (SRA) study accession SRP091494). In addition, the data sets can be provided by Bernd Kreikemeyer on request.

Author information

Authors and affiliations.

Department of Microbiology and Hospital Hygiene, Bundeswehr Hospital Hamburg, Bernhard-Nocht Str. 74, 20359, Hamburg, Germany

Hagen Frickmann

Institute for Microbiology, Virology and Hygiene, University Medicine Rostock, Schillingallee 70, 18057, Rostock, Germany

Hagen Frickmann, Andreas Podbielski, Jana Normann & Bernd Kreikemeyer

Department of Bioinformatics, Max-Planck Institute for Heart and Lung Research Bad Nauheim, Parkstraße 1, 61231, Bad Nauheim, Germany

Carsten Künne & Mario Looso

Department of Preventive Medicine, Bundeswehr Medical Academy, Neuherbergstraße 11, 80937, Munich, Germany

Ralf Matthias Hagen

Swiss Tropical and Public Health Institute, Socinstrasse 57, 4051, Basel, Switzerland

Sven Poppert

Faculty of Medicine, University Basel, Socinstrasse 57, 4051, Basel, Switzerland

You can also search for this author in PubMed   Google Scholar

Contributions

HF, RMH and BK were in charge of conception and design of the study. AP, JN and BK were responsible for the acquisition of the data. HF, CK and ML performed analysis and interpretation of the data. All authors were jointly involved in drafting the manuscript and revising it critically for important intellectual content, gave final approval of the version to be published, participated relevantly in the work to take public responsibility for appropriate portions of the content, and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Hagen Frickmann .

Ethics declarations

Ethics approval and consent to participate, consent for publication.

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Frickmann, H., Künne, C., Hagen, R.M. et al. Next-generation sequencing for hypothesis-free genomic detection of invasive tropical infections in poly-microbially contaminated, formalin-fixed, paraffin-embedded tissue samples – a proof-of-principle assessment. BMC Microbiol 19 , 75 (2019). https://doi.org/10.1186/s12866-019-1448-0

Download citation

Received : 03 January 2018

Accepted : 28 March 2019

Published : 08 April 2019

DOI : https://doi.org/10.1186/s12866-019-1448-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Next-generation sequencing
  • Hypothesis-free diagnosis of infection
  • Invasive fungal infections
  • Invasive amebiasis
  • FFPE, formalin-fixed, paraffin-embedded samples
  • Molecular diagnostics
  • Tropical infectious diseases

BMC Microbiology

ISSN: 1471-2180

genomic hypothesis generation

  • Epigenetics

Genomic Imprinting

What is imprinting.

For most genes, we inherit two working copies -- one from mom and one from dad. But with imprinted genes, we inherit only one working copy. Depending on the gene, either the copy from mom or the copy from dad is epigenetically silenced. Silencing usually happens through the addition of methyl groups during egg or sperm formation.

The epigenetic tags on imprinted genes usually stay put for the life of the organism. But they are reset during egg and sperm formation. Regardless of whether they came from mom or dad, certain genes are always silenced in the egg, and others are always silenced in the sperm.

genomic hypothesis generation

Imprinted Genes Bypass Epigenetic Reprogramming

Soon after egg and sperm meet, most of the epigenetic tags that activate and silence genes are stripped from the DNA. However, in mammals, imprinted genes keep their epigenetic tags. Imprinted genes begin the process of development with epigenetic tags in place.

Imprinted genes are not the only genes that bypass epigenetic reprogramming in the early embryo. Studying imprinting may help researchers understand how other genes make it through reprogramming without losing their epigenetic tags.

LEARN MORE: EPIGENETICS AND INHERITANCE

genomic hypothesis generation

Imprinting is unique to mammals and flowering plants. In mammals, about 1% of genes are imprinted.

Imprinting is required for normal development

An individual normally has one active copy of an imprinted gene. Improper imprinting can result in an individual having two active copies or two inactive copies. This can lead to severe developmental abnormalities, cancer, and other problems.

Prader-Willi and Angelman syndrome are two very different disorders, but they are both linked to the same imprinted region of chromosome 15. Some of the genes in this region are silenced in the egg, and at least one gene is silenced in the sperm. So someone who inherits a defect on chromosome 15 is missing different active genes, depending on whether the chromosome came from mom or dad.

Prader-Willi syndrome • Symptoms include learning difficulties, short stature, and compulsive eating. • Individuals are missing gene activity that normally comes from dad. • Happens when dad's copy is missing, or when there are two maternal copies.

Angelman syndrome • Symptoms include learning difficulties, speech problems, seizures, jerky movements, and an unusually happy disposition. • Individuals are missing gene activity that normally comes from mom. • Happens when mom's copy is defective or missing, or when there are two paternal copies.

The Difficulty of Cloning Mammals

Mammals are notoriously difficult to clone. Researchers often need to go through the cloning procedure dozens or even hundreds of times in order to produce a single healthy clone. The epigenome, including problems with imprinted genes, is likely to be at the root of this difficulty.

The most common method of cloning is called somatic cell nuclear transfer (SCNT). SCNT involves removing a donor nucleus from a non-reproductive cell (often a skin cell or mammary cell) and placing it into an egg cell that has had its nucleus removed.

Clones have abnormal epigenomes, which can lead to a variety of problems. The epigenetic problems with clones likely arise for two reasons. First, the donor nucleus comes from a differentiated cell with epigenetic tags already in place. These tags keep genes switched on or off and allow the cell to perform its responsibilities. After the donor nucleus is transferred, the egg does its best to erase the epigenetic tags. But the process is faulty, delayed and incomplete.

Second, the epigenetic tags in the donor nucleus have been copied several times over. While the machinery that copies the DNA code is faithful (it makes about one error in half a billion), the epigenetic copying machinery is sloppy. In some cases, its error rate can be as high as 1 in 25. Miscopied epigenetic tags on even a very small number of imprinted genes in the donor nucleus could have serious consequences during the development of the resulting embryo.

Dolly

Dolly the sheep was the first mammal to be cloned by somatic cell nuclear transfer (SCNT). Learn more about SCNT in Click and Clone

Why Imprint? The Genetic Conflict Hypothesis

Mouse

Scientists have come up with a number of hypotheses to explain why imprinting happens in mammals. One of these, the Genetic Conflict hypothesis, supposes that imprinting grew out of a competition between males for maternal resources.

In some species, more than one male can father offspring from the same litter. A house cat, for example, can mate more than once during a heat and have a litter of kittens with two or more fathers. If one father's kittens grow larger than the rest, his offspring will be more likely to survive to adulthood and pass along their genes. So it's in the interest of the father's genes to produce larger offspring. The larger kittens will be able to compete for maternal resources at the expense of the other father's kittens.

On the other hand, a better outcome for the mother's genes would be for all of her kittens to survive to adulthood and reproduce. The mother alone will provide nutrients and protection for her kittens throughout pregnancy and after birth. She needs to be able to divide her resources among several kittens, without compromising her own needs.

It turns out that many imprinted genes are involved in growth and metabolism. Paternal imprinting favors the production of larger offspring, and maternal imprinting favors smaller offspring. Often maternally and paternally imprinted genes work in the very same growth pathways. This conflict of interest sets up an epigenetic battle between the parents -- a sort of parental tug-of-war.

Beckwith-Wiedemann Syndrome

The Igf2 gene (but not the Igf2 receptor gene) is also imprinted in humans. The Igf2 gene codes for a hormone that stimulates growth during embryonic and fetal development. Methyl tags normally silence the maternal Igf2 gene. But a DNA mutation or an "epimutation" (missing methyl tags) can activate it, resulting in two active copies of the gene.

Activation of the maternal Igf2 gene during egg formation or very early in development causes Beckwith-Wiedemann Syndrome (BWS). While children with BWS have a variety of symptoms, the most common and obvious feature is overgrowth. Babies with BWS are born larger than 95% of their peers. They also have an increased risk of cancer, especially during childhood.

BWS occurs once in about 15,000 births. However, in babies that were conceived in the laboratory with the help of artificial reproductive technology (ART), the rate of BWS may be as high as 1 in 4,000. This and other evidence of imprinting errors is prompting some to call for further investigation into the safety of common ART laboratory procedures.

Ligers and Tigons

Imprinted genes are under greater selective pressure than normal genes. This is because only one copy is active at a time. Any variations in that copy will be expressed. There is no "back-up copy" to mask its effects. As a result, imprinted genes evolve more rapidly than other genes. And imprinting patterns -- which genes are silenced in the eggs and sperm -- also evolve quickly. They can be quite different in closely related species.

Lions and tigers don't normally meet in nature. But they can get along very well in captivity, where they sometimes produce hybrid offspring. The offspring look different, depending on who the mother is. A male lion and a female tiger produce a liger - the biggest of the big cats. A male tiger and a female lion produce a tigon, a cat that is about the same size as its parents.

The difference in size and appearance between ligers and tigons is due in part to the parents' differently imprinted genes. Other animals can also hybridize, with similar results. For example, a horse and a donkey can produce a mule or a hinny.

Lions and Tigers

Imprinting patterns often differ even in closely related animals such as tigers and lions.

Imprinted Genes are Sensitive to Environmental Signals

genomic hypothesis generation

Imprinted genes are especially sensitive to environmental signals. Because imprinted genes have only a single active copy and no back-up, any epigenetic changes or "epimutations" will have a greater impact on gene expression.

Environmental signals can also affect the imprinting process itself. Imprinting happens during egg and sperm formation, when epigenetic tags are added to silence specific genes. Diet, hormones and toxins can all affect this process, impacting the expression of genes in the next generation.

Jaenisch, R. (1997). DNA methylation and imprinting: why bother? Trends in Genetics, 13 (8): 323-329 (subscription required).

Murrell, A. et al. (2008). Distinct methylation changes at the IGF2-H19 locus in congenital growth disorders and cancer . PLoS One, 3 (3): e1849)

geneimprint - Information about imprinting and a running list of imprinted genes in mouse and humans

Imprinting disorders and ART: world's largest study results are reassuring . Bio-Medicine Online article summarizing a research study of Beckwith-Wiedemann Syndrome and Artificial Reproductive Technology.

Bowdin, S. et al. (2007). A survey of assisted reproductive technology births and imprinting disorders . Human Reproduction, 22 (12): 3237-3240.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 11 June 2013

Genomic and epigenetic insights into the molecular bases of heterosis

  • Z. Jeffrey Chen 1 , 2  

Nature Reviews Genetics volume  14 ,  pages 471–482 ( 2013 ) Cite this article

18k Accesses

350 Citations

7 Altmetric

Metrics details

  • Developmental biology
  • Epigenetics

Century-old genetic models are limited in their ability to explain the molecular bases of heterosis.

Transcriptomic, proteomic, metabolic and epigenomic studies provide new insights into parental genomic interactions, leading to regulatory and network changes and heterosis.

Genetic and epigenetic reprogramming of individual genes, regulatory factors and their associated networks in hybrids promotes growth, stress tolerance and fitness.

Key regulators can be manipulated using biochemical and transgenic approaches to alter biological networks and heterosis.

Although heterosis is most extensively studied in plants, the principles uncovered in plants are likely to apply more broadly across organisms.

Heterosis, also known as hybrid vigour, is widespread in plants and animals, but the molecular bases for this phenomenon remain elusive. Recent studies in hybrids and allopolyploids using transcriptomic, proteomic, metabolomic, epigenomic and systems biology approaches have provided new insights. Emerging genomic and epigenetic perspectives suggest that heterosis arises from allelic interactions between parental genomes, leading to altered programming of genes that promote the growth, stress tolerance and fitness of hybrids. For example, epigenetic modifications of key regulatory genes in hybrids and allopolyploids can alter complex regulatory networks of physiology and metabolism, thus modulating biomass and leading to heterosis. The conceptual advances could help to improve plant and animal productivity through the manipulation of heterosis.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

176,64 € per year

only 14,72 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

genomic hypothesis generation

Reed, H. S. A Short History of the Plant Sciences 323 (Ronald Proess Co., 1942).

Google Scholar  

Darwin, C. R. The Effects of Cross- and Self-fertilisation in the Vegetable Kingdom , (John Murry, London, 1876).

Lippman, Z. B. & Zamir, D. Heterosis: revisiting the magic. Trends Genet. 23 , 60–66 (2007).

CAS   PubMed   Google Scholar  

Birchler, J. A., Auger, D. L. & Riddle, N. C. In search of the molecular basis of heterosis. Plant Cell 15 , 2236–2239 (2003).

CAS   PubMed   PubMed Central   Google Scholar  

Chen, Z. J. Molecular mechanisms of polyploidy and hybrid vigor. Trends Plant Sci. 15 , 57–71 (2010).

Hochholdinger, F. & Hoecker, N. Towards the molecular basis of heterosis. Trends Plant Sci. 12 , 427–432 (2007).

Kaeppler, S. Heterosis: many genes, many mechanisms - end the search for an undiscovered unifying theory. ISRN Bot. 2012 , 682824 (2012).

Goff, S. A. A unifying theory for general multigenic heterosis: energy efficiency, protein metabolism, and implications for molecular breeding. New Phytol. 189 , 923–937 (2011).

Chen, Z. J. & Birchler, J. A. Polyploid and Hybrid Genomics , (Wiley-Blackwell, 2013).

Birchler, J. A., Yao, H., Chudalayandi, S., Vaiman, D. & Veitia, R. A. Heterosis. Plant Cell 22 , 2105–2112 (2010).

Shull, G. H. What Is “heterosis”? Genetics 33 , 439–446 (1948).

Crow, J. F. 90 years ago: the beginning of hybrid maize. Genetics 148 , 923–928 (1998).

Duvick, D. N. Biotechnology in the 1930s: the development of hybrid maize. Nature Rev. Genet. 2 , 69–74 (2001).

Charlesworth, D. & Willis, J. H. The genetics of inbreeding depression. Nature Rev. Genet. 10 , 783–796 (2009).

Ng, D. W., Lu, J. & Chen, Z. J. Big roles for small RNAs in polyploidy, hybrid vigor, and hybrid incompatibility. Curr. Opin. Plant Biol. 15 , 154–161 (2012).

Bomblies, K. & Weigel, D. Hybrid necrosis: autoimmunity as a potential gene-flow barrier in plant species. Nature Rev. Genet. 8 , 382–393 (2007).

Schnable, P. S. & Springer, N. M. Progress toward understanding heterosis in crop plants. Ann. Rev. Plant Biol. 64 , 71–88 (2013).

CAS   Google Scholar  

Mingroni, M. A. Resolving the IQ paradox: heterosis as a cause of the Flynn effect and other trends. Psychol. Rev. 114 , 806–829 (2007).

PubMed   Google Scholar  

Woodley, M. A. Heterosis doesn't cause the Flynn effect: a critical examination of Mingroni (2007). Psychol. Rev. 118 , 689–693 (2011).

Koziel, S., Danel, D. P. & Zareba, M. Isolation by distance between spouses and its effects on children's growth in height. Am. J. Phys. Anthropol. 146 , 14–19 (2011).

Lewis, M. B. Why are mixed-race people perceived as more attractive? Perception 39 , 136–138 (2010).

Cassady, J. P., Young, L. D. & Leymaster, K. A. Heterosis and recombination effects on pig reproductive traits. J. Anim. Sci. 80 , 2303–2315 (2002).

Sagebiel, J. A. et al. Effect of heterosis and maternal influence on gestation length and birth weight in reciprocal crosses among Angus, Charolais and Hereford cattle. J. Anim. Sci. 37 , 1273–1278 (1973).

Ishikawa, A. Mapping an overdominant quantitative trait locus for heterosis of body weight in mice. J. Hered. 100 , 501–504 (2009).

Steinmetz, L. M. et al. Dissecting the architecture of a quantitative trait locus in yeast. Nature 416 , 326–330 (2002). A seminal study that mapped and cloned a QTL (containing three linked genes) that is associated with high-temperature growth vigour in hybrid yeast strains. However, neither the expression level nor the complementation of any gene within the locus could account for the growth phenotype.

Comai, L. The advantages and disadvantages of being polyploid. Nature Rev. Genet. 6 , 836–846 (2005).

Chen, Z. J. Genetic and epigenetic mechanisms for gene expression and phenotypic variation in plant polyploids. Annu. Rev. Plant Biol. 58 , 377–406 (2007).

Shen, H. et al. Genome-wide analysis of DNA methylation and gene expression changes in two Arabidopsis ecotypes and their reciprocal hybrids. Plant Cell 24 , 875–892 (2012). A comprehensive study using genome-wide analyses of small RNA, mRNA and methylome data in hybrids relative to the parents. In the hybrids, the authors found increased levels of small RNAs and DNA methylation and the repression of some genes.

Groszmann, M. et al. Changes in 24-nt siRNA levels in Arabidopsis hybrids suggest an epigenetic contribution to hybrid vigor. Proc. Natl Acad. Sci. USA 108 , 2617–2622 (2011).

Meyer, R. C. et al. Heterosis manifestation during early Arabidopsis seedling development is characterized by intermediate gene expression and enhanced metabolic activity in the hybrids. Plant J. 71 , 669–683 (2012). An extensive study of gene expression, metabolites and biomass in the seedlings of hybrids during the early stages of development. It suggested that there are maternal effects on metabolites early in development.

Miller, M., Zhang, C. & Chen, Z. J. Ploidy and hybridity effects on growth vigor and gene expression in Arabidopsis thaliana hybrids and their parents. G3 2 , 505–513 (2012).

Barth, S., Busimi, A. K., Friedrich Utz, H. & Melchinger, A. E. Heterosis for biomass yield and related traits in five hybrids of Arabidopsis thaliana L. Heynh. Heredity 91 , 36–42 (2003).

Meyer, R. C., Torjek, O., Becher, M. & Altmann, T. Heterosis of biomass production in Arabidopsis . Establishment during early development. Plant Physiol. 134 , 1813–1823 (2004).

Ozias-Akins, P. & van Dijk, P. J. Mendelian genetics of apomixis in plants. Annu. Rev. Genet. 41 , 509–537 (2007).

Marimuthu, M. P. et al. Synthetic clonal reproduction through seeds. Science 331 , 876 (2011).

Crow, J. F. Mid-century controversies in population genetics. Annu. Rev. Genet. 42 , 1–16 (2008).

East, E. M. Heterosis. Genetics 21 , 375–397 (1936).

Shull, G. H. The composition of a field of maize. Amer. Breeders Assoc. Rep. 4 , 296–301 (1908).

Bruce, A. B. The Mendelian theory of heredity and the augmentation of vigor. Science 32 , 627–628 (1910).

Jones, D. F. Dominance of linked factors as a means of accounting for heterosis. Genetics 2 , 466–479 (1917).

Crow, J. F. Alternative hypothesis of hybrid vigor. Genetics 33 , 477–487 (1948).

Li, L. et al. Dominance, overdominance and epistasis condition the heterosis in two heterotic rice hybrids. Genetics 180 , 1725–1742 (2008).

PubMed   PubMed Central   Google Scholar  

Li, Z. K. et al. Overdominant epistatic loci are the primary genetic basis of inbreeding depression and heterosis in rice. I. Biomass and grain yield. Genetics 158 , 1737–1753 (2001).

Xiao, J., Li, J., Yuan, L. & Tanksley, S. D. Dominance is the major genetic basis of heterosis in rice as revealed by QTL analysis using molecular markers. Genetics 140 , 745–754 (1995).

Yu, S. B. et al. Importance of epistasis as the genetic basis of heterosis in an elite rice hybrid. Proc. Natl Acad. Sci. USA 94 , 9226–9231 (1997).

Zhou, G. et al. Genetic composition of yield heterosis in an elite rice hybrid. Proc. Natl Acad. Sci. USA 109 , 15847–15852 (2012).

Duvick, D. N. & Cassman, K. G. Post-green revolution trends in yield potential of temperate maize in the North-Central United States. Crop Sci. 39 , 1622–1630 (1999).

Pauling, L. et al. Sickle cell anemia a molecular disease. Science 110 , 543–548 (1949).

Ingram, V. M. Gene mutations in human haemoglobin: the chemical difference between normal and sickle cell haemoglobin. Nature 180 , 326–328 (1957).

Redei, G. P. Single locus heterosis. Mol. Gen. Genet. 93 , 164–170 (1962).

Shpak, E. D., Berthiaume, C. T., Hill, E. J. & Torii, K. U. Synergistic interaction of three ERECTA-family receptor-like kinases controls Arabidopsis organ growth and flower development by promoting cell proliferation. Development 131 , 1491–1501 (2004).

Kim, G. T. et al. The ANGUSTIFOLIA gene of Arabidopsis , a plant CtBP gene, regulates leaf-cell expansion, the arrangement of cortical microtubules in leaf cells and expression of a gene involved in cell-wall formation. EMBO J. 21 , 1267–1279 (2002).

Schwartz, E. Single gene heterosis for alcohol dehydrogenase in maize: the nature of the subunit interaction. Theor. Appl. Genet. 43 , 117–120 (1973).

Krieger, U., Lippman, Z. B. & Zamir, D. The flowering gene SINGLE FLOWER TRUSS drives heterosis for yield in tomato. Nature Genet. 42 , 459–463 (2010). A painstaking genetic study of tomato introgression lines that provided an example of single-locus heterosis. The basis of heterosis might be complicated by developmental changes and other factors that act in trans in the genetic background of these introgression lines.

Blackman, B. K., Strasburg, J. L., Raduski, A. R., Michaels, S. D. & Rieseberg, L. H. The role of recently derived FT paralogs in sunflower domestication. Curr. Biol. 20 , 629–635 (2010).

Groose, R. W., Talbert, L. E., Kojis, W. P. & Bingham, E. T. Progressive eterosis in autotetraploid alfalfa - studies using 2 types of inbreds. Crop Sci. 29 , 1173–1177 (1989).

Bingham, E. T., Groose, R. W., Woodfield, D. R. & Kidwell, K. K. Complementary gene interactions in alfalfa are greater in autotetraploids than diploids. Crop Sci. 34 , 823–829 (1994).

Riddle, N. C., Jiang, H., An, L., Doerge, R. W. & Birchler, J. A. Gene expression analysis at the intersection of ploidy and hybridity in maize. Theor. Appl. Genet. 120 , 341–353 (2010). A microarray-based study of gene expression that dissected the effects of ploidy and hybridity on gene expression in maize hybrids at different ploidy levels.

Yao, H., Dogra Gray, A., Auger, D. L. & Birchler, J. A. Genomic dosage effects on heterosis in triploid maize. Proc. Natl Acad. Sci. USA 110 , 2665–2669 (2013).

Fujimoto, R., Taylor, J. M., Shirasawa, S., Peacock, W. J. & Dennis, E. S. Heterosis of Arabidopsis hybrids between C24 and Col is associated with increased photosynthesis capacity. Proc. Natl Acad. Sci. USA 109 , 7109–7114 (2012).

Andorf, S. et al. Enriched partial correlations in genome-wide gene expression profiles of hybrids ( A. thaliana ): a systems biological approach towards the molecular basis of heterosis. Theor. Appl. Genet. 120 , 249–259 (2010).

He, G. et al. Global epigenetic and transcriptional trends among two rice subspecies and their reciprocal hybrids. Plant Cell 22 , 17–33 (2010).

Swanson-Wagner, R. A. et al. All possible modes of gene action are observed in a global comparison of gene expression in a maize F1 hybrid and its inbred parents. Proc. Natl Acad. Sci. USA 103 , 6805–6810 (2006). A transcriptomic study that showed various possible modes of gene expression patterns in maize hybrids, including additivity, high- and low-parent dominance, underdominance and overdominance.

Guo, M. et al. Genome-wide transcript analysis of maize hybrids: allelic additive gene expression and yield heterosis. Theor. Appl. Genet. 113 , 831–845 (2006).

Stupar, R. M. & Springer, N. M. Cis-transcriptional variation in maize inbred lines B73 and Mo17 leads to additive expression patterns in the F1 hybrid. Genetics 173 , 2199–2210 (2006). A microarray-based study of allelic gene expression in reciprocal hybrids in maize that found minimal effects of the parent-of-origin and non-additive gene expression, thus suggesting a role for additive gene expression in maize heterosis.

Wang, Z., Ni, Z., Wu, H., Nie, X. & Sun, Q. Heterosis in root development and differential gene expression between hybrids and their parental inbreds in wheat ( Triticum aestivum L.). Theor. Appl. Genet. 113 , 1283–1294 (2006).

Shi, X. et al. Cis - and trans -regulatory divergence between progenitor species determines gene-expression novelty in Arabidopsis allopolyploids. Nature Commun. 3 , 950 (2012). An RNA-sequencing study of allelic gene expression in Arabidopsis allotetraploids. Cis and trans effects were detected. Some of the changes were associated with DNA methylation and histone modifications.

Wang, J. et al. Genomewide nonadditive gene regulation in Arabidopsis allotetraploids. Genetics 172 , 507–517 (2006). This first genome-wide study found non-additive gene expression and genomic dominance in resynthesized Arabidopsis allopolyploids. Altered gene expression was found in several biological pathways, including energy and metabolism, stress response and phytohormonal signalling.

Flagel, L., Udall, J., Nettleton, D. & Wendel, J. Duplicate gene expression in allopolyploid Gossypium reveals two temporally distinct phases of expression evolution. BMC Biol. 6 , 16 (2008).

Pumphrey, M., Bai, J., Laudencia-Chingcuanco, D., Anderson, O. & Gill, B. S. Nonadditive expression of homoeologous genes is established upon polyploidization in hexaploid wheat. Genetics 181 , 1147–1157 (2009).

Qi, B. et al. Global transgenerational gene expression dynamics in two newly synthesized allohexaploid wheat ( Triticum aestivum ) lines. BMC Biol. 10 , 3 (2012).

Hegarty, M. J. et al. Transcriptome shock after interspecific hybridization in Senecio is ameliorated by genome duplication. Curr. Biol. 16 , 1652–1659 (2006).

Buggs, R. J. et al. Transcriptomic shock generates evolutionary novelty in a newly formed, natural allopolyploid plant. Curr. Biol. 21 , 551–556 (2011).

Jackson, S. & Chen, Z. J. Genomic and expression plasticity of polyploidy. Curr. Opin. Plant Biol. 13 , 153–159 (2010).

Pikaard, C. S. The epigenetics of nucleolar dominance. Trends Genet. 16 , 495–500 (2000).

Chen, Z. J., Comai, L. & Pikaard, C. S. Gene dosage and stochastic effects determine the severity and direction of uniparental ribosomal RNA gene silencing (nucleolar dominance) in Arabidopsis allopolyploids. Proc. Natl Acad. Sci. USA 95 , 14891–14896 (1998).

Comai, L. et al. Phenotypic instability and rapid gene silencing in newly formed Arabidopsis allotetraploids. Plant Cell 12 , 1551–1568 (2000).

Wang, J. et al. Stochastic and epigenetic changes of gene expression in Arabidopsis polyploids. Genetics 167 , 1961–1973 (2004).

Chen, Z. J. & Pikaard, C. S. Epigenetic silencing of RNA polymerase I transcription: a role for DNA methylation and histone modification in nucleolar dominance. Genes Dev. 11 , 2124–2136 (1997).

Preuss, S. B. et al. Multimegabase silencing in nucleolar dominance involves siRNA-directed DNA methylation and specific methylcytosine-binding proteins. Mol. Cell 32 , 673–684 (2008).

Ni, Z. et al. Altered circadian rhythms regulate growth vigour in hybrids and allopolyploids. Nature 457 , 327–331 (2009). A breakthrough finding that linked epigenetic alteration of circadian-mediated expression networks to increased levels of photosynthesis and starch metabolism in Arabidopsis hybrids and allopolyploids. Similar changes were subsequently found in references 29, 30 and 124.

Hovav, R. et al. The evolution of spinnable cotton fiber entailed prolonged development and a novel metabolism. PLoS Genet. 4 , e25 (2008).

Guo, M. et al. Allelic variation of gene expression in maize hybrids. Plant Cell 16 , 1707–1716 (2004).

Todesco, M. et al. Natural allelic variation underlying a major fitness trade-off in Arabidopsis thaliana . Nature 465 , 632–636 (2010).

Tian, D., Traw, M. B., Chen, J. Q., Kreitman, M. & Bergelson, J. Fitness costs of R-gene-mediated resistance in Arabidopsis thaliana . Nature 423 , 74–77 (2003).

Guo, H. & Ecker, J. R. The ethylene signaling pathway: new insights. Curr. Opin. Plant Biol. 7 , 40–49 (2004).

Wittkopp, P. J., Haerum, B. K. & Clark, A. G. Evolutionary changes in cis and trans gene regulation. Nature 430 , 85–88 (2004).

Tirosh, I., Reikhav, S., Levy, A. A. & Barkai, N. A yeast hybrid provides insight into the evolution of gene expression regulation. Science 324 , 659–662 (2009).

Springer, N. M. & Stupar, R. M. Allelic variation and heterosis in maize: how do two halves make more than a whole? Genome Res. 17 , 264–275 (2007).

Marcon, C. et al. Nonadditive protein accumulation patterns in maize ( Zea mays L.) hybrids during embryo development. J. Proteome Res. 9 , 6511–6522 (2010).

Hoecker, N. et al. Comparison of maize ( Zea mays L.) F1-hybrid and parental inbred line primary root transcriptomes suggests organ-specific patterns of nonadditive gene expression and conserved expression trends. Genetics 179 , 1275–1283 (2008).

Dahal, D., Mooney, B. P. & Newton, K. J. Specific changes in total and mitochondrial proteomes are associated with higher levels of heterosis in maize hybrids. Plant J. 72 , 70–83 (2012). A comprehensive proteomic study in different tissues of maize hybrids compared with the parents. It found that altered protein abundance is involved in stress responses and in primary carbon and protein metabolism.

Wang, W. et al. Proteomic profiling of rice embryos from a hybrid rice cultivar and its parental lines. Proteomics 8 , 4808–4821 (2008).

Ng, D. W. et al. Proteomic divergence in Arabidopsis autopolyploids and allopolyploids and their progenitors. Heredity 108 , 419–430 (2012).

Meyer, R. C. et al. The metabolic signature related to high plant growth rate in Arabidopsis thaliana . Proc. Natl Acad. Sci. USA 104 , 4759–4764 (2007).

Lisec, J. et al. Identification of heterotic metabolite QTL in Arabidopsis thaliana RIL and IL populations. Plant J. 59 , 777–788 (2009).

Korn, M. et al. Predicting Arabidopsis freezing tolerance and heterosis in freezing tolerance from metabolite composition. Mol. Plant 3 , 224–235 (2010).

Schauer, N. et al. Comprehensive metabolic profiling and phenotyping of interspecific introgression lines for tomato improvement. Nature Biotechnol. 24 , 447–454 (2006).

Riedelsheimer, C. et al. Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nature Genet. 44 , 217–220 (2012).

Riedelsheimer, C. et al. Genome-wide association mapping of leaf metabolic profiles for dissecting complex traits in maize. Proc. Natl Acad. Sci. USA 109 , 8872–8877 (2012).

Fievet, J. B., Dillmann, C. & de Vienne, D. Systemic properties of metabolic networks lead to an epistasis-based model for heterosis. Theor. Appl. Genet. 120 , 463–473 (2010).

Steinfath, M. et al. Prediction of hybrid biomass in Arabidopsis thaliana by selected parental SNP and metabolic markers. Theor. Appl. Genet. 120 , 239–247 (2010).

Kacser, H. & Burns, J. A. The molecular basis of dominance. Genetics 97 , 639–666 (1981).

Wijnen, H. & Young, M. W. Interplay of circadian clocks and metabolic rhythms. Annu. Rev. Genet. 40 , 409–448 (2006).

McClung, C. R. Plant circadian rhythms. Plant Cell 18 , 792–803 (2006).

Harmer, S. L. The circadian system in higher plants. Annu. Rev. Plant Biol. 60 , 357–377 (2009).

Nagel, D. H. & Kay, S. A. Complexity in the wiring and regulation of plant circadian networks. Curr. Biol. 22 , R648–R657 (2012).

McClung, C. R. & Gutierrez, R. A. Network news: prime time for systems biology of the plant circadian clock. Curr. Opin. Genet. Dev. 20 , 588–598 (2010).

Dodd, A. N. et al. Plant circadian clocks increase photosynthesis, growth, survival, and competitive advantage. Science 309 , 630–633 (2005).

Harmer, S. L. et al. Orchestrated transcription of key pathways in Arabidopsis by the circadian clock. Science 290 , 2110–2113 (2000).

Michael, T. P. et al. Network discovery pipeline elucidates conserved time-of-day-specific cis -regulatory modules. PLoS Genet. 4 , e14 (2008).

Michael, T. P. et al. A morning-specific phytohormone gene expression program underlying rhythmic plant growth. PLoS Biol. 6 , e225 (2008).

Graf, A., Schlereth, A., Stitt, M. & Smith, A. M. Circadian control of carbohydrate availability for growth in Arabidopsis plants at night. Proc. Natl Acad. Sci. USA 107 , 9458–9463 (2010).

Cross, J. M. et al. Variation of enzyme activities and metabolite levels in 24 Arabidopsis accessions growing in carbon-limited conditions. Plant Physiol. 142 , 1574–1588 (2006).

Sulpice, R. et al. Starch as a major integrator in the regulation of plant growth. Proc. Natl Acad. Sci. USA 106 , 10348–10353 (2009).

Song, G. S. et al. Comparative transcriptional profiling and preliminary study on heterosis mechanism of super-hybrid rice. Mol. Plant 3 , 1012–1025 (2010).

Michael, T. P. et al. Enhanced fitness conferred by naturally occurring variation in the circadian clock. Science 302 , 1049–1053 (2003).

Salathia, N., Edwards, K. & Millar, A. J. QTL for timing: a natural diversity of clock genes. Trends Genet. 18 , 115–118 (2002).

Mikkelsen, M. D. & Thomashow, M. F. A role for circadian evening elements in cold-regulated gene expression in Arabidopsis . Plant J. 60 , 328–339 (2009).

Nakamichi, N. et al. Transcript profiling of an Arabidopsis PSEUDO RESPONSE REGULATOR arrhythmic triple mutant reveals a role for the circadian clock in cold stress response. Plant Cell Physiol. 50 , 447–462 (2009).

Dong, M. A., Farre, E. M. & Thomashow, M. F. CIRCADIAN CLOCK-ASSOCIATED 1 and LATE ELONGATED HYPOCOTYL regulate expression of the C-REPEAT BINDING FACTOR (CBF) pathway in Arabidopsis . Proc. Natl Acad. Sci. USA 108 , 7241–7246 (2011).

Roden, L. C. & Ingle, R. A. Lights, rhythms, infection: the role of light and the circadian clock in determining the outcome of plant-pathogen interactions. Plant Cell 21 , 2546–2552 (2009).

Goodspeed, D., Chehab, E. W., Min-Venditti, A., Braam, J. & Covington, M. F. Arabidopsis synchronizes jasmonate-mediated defense with insect circadian behavior. Proc. Natl Acad. Sci. USA 109 , 4674–4677 (2012).

Wang, W. et al. Timing of plant immune responses by a central circadian regulator. Nature 470 , 110–114 (2011).

Turck, F., Fornara, F. & Coupland, G. Regulation and identity of florigen: FLOWERING LOCUS T moves center stage. Annu. Rev. Plant Biol. 59 , 573–594 (2008).

Xue, W. et al. Natural variation in Ghd7 is an important regulator of heading date and yield potential in rice. Nature Genet. 40 , 761–767 (2008).

Hung, H. Y. et al. ZmCCT and the genetic basis of day-length adaptation underlying the postdomestication spread of maize. Proc. Natl Acad. Sci. USA 109 , E1913–E1921 (2012).

Ma, Q., Hedden, P. & Zhang, Q. Heterosis in rice seedlings: its relationship to gibberellin content and expression of gibberellin metabolism and signaling genes. Plant Physiol. 156 , 1905–1920 (2011).

Zhang, Y., Ni, Z., Yao, Y., Nie, X. & Sun, Q. Gibberellins and heterosis of plant height in wheat ( Triticum aestivum L.). BMC Genet. 8 , 40 (2007).

Lee, H. S. & Chen, Z. J. Protein-coding genes are epigenetically regulated in Arabidopsis polyploids. Proc. Natl Acad. Sci. USA 98 , 6753–6758 (2001).

Chen, M., Ha, M., Lackey, E., Wang, J. & Chen, Z. J. RNAi of met1 reduces DNA methylation and induces genome-specific changes in gene expression and centromeric small RNA accumulation in Arabidopsis allopolyploids. Genetics 178 , 1845–1858 (2008).

Kashkush, K., Feldman, M. & Levy, A. A. Transcriptional activation of retrotransposons alters the expression of adjacent genes in wheat. Nature Genet. 33 , 102–106 (2003).

Chandler, V. L. Paramutation's properties and puzzles. Science 330 , 628–629 (2010).

Malapeira, J., Khaitova, L. C. & Mas, P. Ordered changes in histone modifications at the core of the Arabidopsis circadian clock. Proc. Natl Acad. Sci. USA 109 , 21540–21545 (2012).

Perales, M. & Mas, P. A functional link between rhythmic changes in chromatin structure and the Arabidopsis biological clock. Plant Cell 19 , 2111–2123 (2007).

Asher, G. et al. SIRT1 regulates circadian clock gene expression through PER2 deacetylation. Cell 134 , 317–328 (2008).

Nakahata, Y. et al. The NAD + -dependent deacetylase SIRT1 modulates CLOCK-mediated chromatin remodeling and circadian control. Cell 134 , 329–340 (2008).

Nakahata, Y., Sahar, S., Astarita, G., Kaluzova, M. & Sassone-Corsi, P. Circadian control of the NAD + salvage pathway by CLOCK-SIRT1. Science 324 , 654–657 (2009).

James, A. B. et al. Alternative splicing mediates responses of the Arabidopsis circadian clock to temperature changes. Plant Cell 24 , 961–981 (2012).

Filichkin, S. A. et al. Genome-wide mapping of alternative splicing in Arabidopsis thaliana . Genome Res. 20 , 45–58 (2010).

Kim, E. D. & Chen, Z. J. Unstable transcripts in Arabidopsis allotetraploids are associated with nonadditive gene expression in response to abiotic and biotic stresses. PLoS ONE 6 , e24251 (2011).

Wang, J., Tian, L., Lee, H. S. & Chen, Z. J. Nonadditive regulation of FRI and FLC loci mediates flowering-time variation in Arabidopsis allopolyploids. Genetics 173 , 965–974 (2006).

Chapman, E. J. & Carrington, J. C. Specialization and evolution of endogenous small RNA pathways. Nature Rev. Genet. 8 , 884–896 (2007).

Chen, X. Small RNAs and their roles in plant development. Annu. Rev. Cell Dev. Biol. 25 , 21–44 (2009).

Molnar, A., Melnyk, C. & Baulcombe, D. C. Silencing signals in plants: a long journey for small RNAs. Genome Biol. 12 , 215 (2011).

Herr, A. J., Jensen, M. B., Dalmay, T. & Baulcombe, D. C. RNA polymerase IV directs silencing of endogenous DNA. Science 308 , 118–120 (2005).

Onodera, Y. et al. Plant nuclear RNA polymerase IV mediates siRNA and DNA methylation-dependent heterochromatin formation. Cell 120 , 613–622 (2005).

Law, J. A. & Jacobsen, S. E. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature Rev. Genet. 11 , 204–220 (2010).

Haag, J. R. & Pikaard, C. S. Multisubunit RNA polymerases IV and V: purveyors of non-coding RNA for plant gene silencing. Nature Rev. Mol. Cell Biol. 12 , 483–492 (2011).

Wassenegger, M., Heimes, S., Riedel, L. & Sanger, H. L. RNA-directed de novo methylation of genomic sequences in plants. Cell 76 , 567–576 (1994).

Ha, M. et al. Small RNAs serve as a genetic buffer against genomic shock in Arabidopsis interspecific hybrids and allopolyploids. Proc. Natl Acad. Sci. USA 106 , 17835–17840 (2009).

Barber, W. T. et al. Repeat associated small RNAs vary among parents and following hybridization in maize. Proc. Natl Acad. Sci. USA 109 , 10444–10449 (2012).

Kenan-Eichler, M. et al. Wheat hybridization and polyploidization results in deregulation of small RNAs. Genetics 188 , 263–272 (2011).

Alleman, M. et al. An RNA-dependent RNA polymerase is required for paramutation in maize. Nature 442 , 295–298 (2006).

Nobuta, K. et al. Distinct size distribution of endogeneous siRNAs in maize: Evidence from deep sequencing in the mop1 - 1 mutant. Proc. Natl Acad. Sci. USA 105 , 14958–14963 (2008).

Lee, Y. et al. MicroRNA genes are transcribed by RNA polymerase II. EMBO J. 23 , 4051–4060 (2004).

Park, W., Li, J., Song, R., Messing, J. & Chen, X. CARPEL FACTORY, a Dicer homolog, and HEN1, a novel protein, act in microRNA metabolism in Arabidopsis thaliana . Curr. Biol. 12 , 1484 (2002).

Voinnet, O. Origin, biogenesis, and activity of plant microRNAs. Cell 136 , 669–687 (2009).

Peragine, A., Yoshikawa, M., Wu, G., Albrecht, H. L. & Poethig, R. S. SGS3 and SGS2/SDE1/RDR6 are required for juvenile development and the production of trans -acting siRNAs in Arabidopsis . Genes Dev. 18 , 2368–2379 (2004).

Ng, D. W. et al. cis - and trans -Regulation of miR163 and target genes confers natural variation of secondary metabolites in two Arabidopsis species and their allopolyploids. Plant Cell 23 , 1729–1740 (2011).

Shivaprasad, P. V., Dunn, R. M., Santos, B. A., Bassett, A. & Baulcombe, D. C. Extraordinary transgressive phenotypes of hybrid tomato are influenced by epigenetics and small silencing RNAs. EMBO J. 31 , 257–266 (2012).

Greaves, I. K. et al. Trans chromosomal methylation in Arabidopsis hybrids. Proc. Natl Acad. Sci. USA 109 , 3570–3575 (2012).

Chodavarapu, R. K. et al. Transcriptome and methylome interactions in rice hybrids. Proc. Natl Acad. Sci. USA 109 , 12040–12045 (2012).

Gore, M. A. et al. A first-generation haplotype map of maize. Science 326 , 1115–1117 (2009).

Zilberman, D., Gehring, M., Tran, R. K., Ballinger, T. & Henikoff, S. Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription. Nature Genet. 39 , 61–69 (2007).

Zhang, X. et al. Genome-wide high-resolution mapping and functional analysis of DNA methylation in Arabidopsis . Cell 126 , 1189–1201 (2006).

Kurihara, Y. et al. Identification of the candidate genes regulated by RNA-directed DNA methylation in Arabidopsis . Biochem. Biophys. Res. Commun. 376 , 553–557 (2008).

Madlung, A. et al. Remodeling of DNA methylation and phenotypic and transcriptional changes in synthetic Arabidopsis allotetraploids. Plant Physiol. 129 , 733–746 (2002).

Jenuwein, T. & Allis, C. D. Translating the histone code. Science 293 , 1074–1080 (2001).

Li, B., Carey, M. & Workman, J. L. The role of chromatin during transcription. Cell 128 , 707–719 (2007).

Banaei Moghaddam, A. M. et al. Additive inheritance of histone modifications in Arabidopsis thaliana intra-specific hybrids. Plant J. 67 , 691–700 (2011).

Ha, M., Ng, D. W., Li, W. H. & Chen, Z. J. Coordinated histone modifications are associated with gene expression variation within and between species. Genome Res. 21 , 590–598 (2011).

Ferguson-Smith, A. C. Genomic imprinting: the emergence of an epigenetic paradigm. Nature Rev. Genet. 12 , 565–575 (2011).

Mosher, R. A. et al. Uniparental expression of PolIV-dependent siRNAs in developing endosperm of Arabidopsis . Nature 460 , 283–286 (2009).

Lu, J., Zhang, C., Baulcombe, D. C. & Chen, Z. J. Maternal siRNAs as regulators of parental genome imbalance and gene expression in endosperm of Arabidopsis seeds. Proc. Natl Acad. Sci. USA 109 , 5529–5534 (2012).

Somers, D. E., Devlin, P. F. & Kay, S. A. Phytochromes and cryptochromes in the entrainment of the Arabidopsis circadian clock. Science 282 , 1488–1490 (1998).

Huang, W. et al. Mapping the core of the Arabidopsis circadian clock defines the network structure of the oscillator. Science 336 , 75–79 (2012).

Gendron, J. M. et al. Arabidopsis circadian clock protein, TOC1, is a DNA-binding transcription factor. Proc. Natl Acad. Sci. USA 109 , 3167–3172 (2012).

Kim, W. Y. et al. ZEITLUPE is a circadian photoreceptor stabilized by GIGANTEA in blue light. Nature 449 , 356–360 (2007).

Jones, M. A. et al. Jumonji domain protein JMJD5 functions in both the plant and human circadian systems. Proc. Natl Acad. Sci. USA 107 , 21623–21628 (2010).

Lu, S. X. et al. The Jumonji C domain-containing protein JMJ30 regulates period length in the Arabidopsis circadian clock. Plant Physiol. 155 , 906–915 (2011).

Portoles, S. & Mas, P. The Functional Interplay between Protein Kinase CK2 and CCA1 transcriptional activity is essential for clock temperature compensation in Arabidopsis . PLoS Genet. 6 , e1001201 (2010).

Daniel, X., Sugano, S. & Tobin, E. M. CK2 phosphorylation of CCA1 is necessary for its circadian oscillator function in Arabidopsis . Proc. Natl Acad. Sci. USA 101 , 3292–3297 (2004).

Nusinow, D. A. et al. The ELF4-ELF3-LUX complex links the circadian clock to diurnal control of hypocotyl growth. Nature 475 , 398–402 (2011).

Pruneda-Paz, J. L., Breton, G., Para, A. & Kay, S. A. A functional genomics approach reveals CHE as a component of the Arabidopsis circadian clock. Science 323 , 1481–1485 (2009).

Download references

Acknowledgements

I am grateful to former and current members of the Laboratory of Polyploidy, Heterosis and Epigenetics for their contributions to this work. I apologize for omitting or glossing over some relevant studies owing to the space limitation. Funding for the research is provided by the US National Science Foundation (grants IOS1238048, IOS1025947 and MCB1110857), the US National Institutes of Health (grant GM067015), the Cotton Incorporated (grant 07161) and the National Natural Science Foundation of China (grant No. 31290213).

Author information

Authors and affiliations.

Institute for Cellular and Molecular Biology and Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, 78712, Texas, USA

Z. Jeffrey Chen

National Laboratory of Plant Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, 210095, China

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Z. Jeffrey Chen .

Ethics declarations

Competing interests.

The author declares no competing financial interests.

Related links

Further information.

Z. Jeffrey Chen's homepage

PowerPoint slides

Powerpoint slide for fig. 1, powerpoint slide for fig. 2, powerpoint slide for fig. 3, powerpoint slide for fig. 4, powerpoint slide for fig. 5, powerpoint slide for fig. 6.

An organism or individual that contains two or more sets of genetically distinct chromosomes, usually through hybridization between different species. A disomic allopolyploid (also known as amphidiploid) is a type of allopolyploid in which bivalents form within each chromosome set.

(Also known as hybrid vigour). When hybrids display increased levels of growth, survival or fitness relative to their parents.

A phenomenon that transmits genes and genomes from only one parent (usually the female) to the offspring.

A scenario in which the phenotype of alleles displays fully when they are present in the heterozygous or heterokaryotic state.

(Also known as monohybrid heterosis). The phenomenon of heterozygotes having a more extreme phenotype than either homozygote.

When the heterozygote genotype has a higher relative fitness than either the homozygote dominant or homozygote recessive genotype.

Non-reciprocal interactions between non-allelic genes, which cannot be easily explained by quantitative genetic models.

A phenomenon of overdominance that is associated with the complementation of two or more linked dominant and recessive alleles in repulsion, in which the dominant and recessive alleles are located on opposite homologues of the two genes, acting as overdominance.

(QTL). A genetic locus that contributes to variation in quantitative phenotypes. The effects may also vary under certain environmental conditions.

Polyploids created by the multiplication of one basic set of chromosomes (usually within the same species).

Phenomena whereby the expression of a gene is dependent on the parental origin. This is usually synonymous to imprinting but could be different from imprinting in cases in which the parent-of-origin effect can be caused by cytoplasmic–nuclear gene interactions (known as maternal effects) in plants, whereas imprinting occurs between two alleles in the nucleus with the same maternal parent.

Chromosomes or genes in the related species that are derived from the same ancestor and coexist in an allopolyploid.

The time for the completion of an oscillation cycle from one peak to the next or from one trough to the next, which is usually 24 hours.

The difference between the level of a peak (or trough) and the mean value of a wave. For symmetrical waves, the amplitude is half the value of the range of oscillation.

A light–dark cycle in a given day. Long-day plants, such as Arabidopsis thaliana and wheat, respond to lengthening days and they flower in spring. Short-day plants, such as rice and maize, respond to shortening days and flower in late summer or autumn.

An epigenetic phenomenon discovered in maize in which one allele influences the expression of another allele at the same locus when the two alleles are combined in a heterozygote.

(siRNAs). A class of 20–25 nucleotide-long small RNAs that repress gene expression or induce epigenetic processes. They are normally derived from transposable elements and repetitive DNA.

(miRNAs). A class of 21–23 nucleotide-long small RNAs that have functions in transcriptional and post-transcriptional regulation of gene expression, usually through mRNA degradation or translational repression through complementarity with the target transcripts.

(ta-siRNAs). A class of small RNAs that are generated from target mRNAs, in a process triggered by specific microRNAs (miRNAs), thus leading to a series of consecutive 21-nucleotide small interfering RNAs (siRNAs), called 'phasing'. These secondary siRNAs can act in trans to regulate their target transcripts through mRNA degradation.

(RdDM). An epigenetic process to establish DNA methylation through the biogenesis of siRNAs that guide the methylation of homologous loci. The process is known as de novo DNA methylation and is predominately found in plants and fungi.

The release of genome-wide chromatin constraints of gene expression, including the activation of transposons in response to environmental changes and genomic hybridization. The term was first used by Barbara McClintock in 1984.

Expression of only the maternal or paternal allele of a gene in the offspring; it is an epigenetic phenomenon involving DNA methylation, chromatin modifications and non-coding RNAs.

(GWASs). Examinations of many common genetic variants (usually SNPs in linkage disequilibrium) of different individuals to test if any variant is associated with a phenotypic trait.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Chen, Z. Genomic and epigenetic insights into the molecular bases of heterosis. Nat Rev Genet 14 , 471–482 (2013). https://doi.org/10.1038/nrg3503

Download citation

Published : 11 June 2013

Issue Date : July 2013

DOI : https://doi.org/10.1038/nrg3503

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Establishment of four types of allotetraploids derived from crassostrea gigas and c. angulata and their breeding potential.

  • Gaowei Jiang
  • Chengxun Xu

Aquaculture International (2024)

Physiology and transcriptomics reveal that hybridization improves the tolerance of poplar photosynthetic function to salt stress

  • Ding Changjun
  • Zhang Huihui

Trees (2024)

Pervasive under-dominance in gene expression underlying emergent growth trajectories in Arabidopsis thaliana hybrids

  • Fiona Beitel
  • Detlef Weigel

Genome Biology (2023)

Genetic diversity and resilience in benthic marine populations

  • Cynthia Vásquez
  • Renato A. Quiñones
  • Eduardo Hernández-Miranda

Revista Chilena de Historia Natural (2023)

The puzzle of plant hybridisation: a high propensity to hybridise but few hybrid zones reported

  • Gonzalo Nieto Feliner
  • David Criado Ruiz
  • Irene Villa-Machío

Heredity (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

genomic hypothesis generation

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Hum Genomics
  • v.8(1); 2014

Logo of humgenomics

Changing genetic paradigms: creating next-generation genetic databases as tools to understand the emerging complexities of genotype/phenotype relationships

Bruce gottlieb.

1 Lady Davis Institute for Medical Research, 3755 Côte Ste Catherine Road, Montreal, QC H3T 1E2, Canada

2 Segal Cancer Centre, Jewish General Hospital, 3755 Côte Ste Catherine Road, Montreal, QC H3T 1E2, Canada

3 Department of Human Genetics, McGill University, Montreal, QC, Canada

Lenore K Beitel

4 Department of Medicine, McGill University, Montreal, QC, Canada

Mark Trifiro

Understanding genotype/phenotype relationships has become more complicated as increasing amounts of inter- and intra-tissue genetic heterogeneity have been revealed through next-generation sequencing and evidence showing that factors such as epigenetic modifications, non-coding RNAs and RNA editing can play an important role in determining phenotype. Such findings have challenged a number of classic genetic assumptions including (i) analysis of genomic sequence obtained from blood is an accurate reflection of the genotype responsible for phenotype expression in an individual; (ii) that significant genetic alterations will be found only in diseased individuals, in germline tissues in inherited diseases, or in specific diseased tissues in somatic diseases such as cancer; and (iii) that mutation rates in putative disease-associated genes solely determine disease phenotypes. With the breakdown of our traditional understanding of genotype to phenotype relationships, it is becoming increasingly apparent that new analytical tools will be required to determine the relationship between genotype and phenotypic expression. To this end, we are proposing that next-generation genetic database (NGDB) platforms be created that include new bioinformatics tools based on algorithms that can evaluate genetic heterogeneity, as well as powerful systems biology analysis tools to actively process and evaluate the vast amounts of both genomic and genomic-modifying information required to reveal the true relationships between genotype and phenotype.

Introduction

The problem of understanding the relationships between genotype and phenotype has become very much more complicated with the explosion of genetic information produced by next-generation sequencing (NGS). This information has greatly complicated not only our ability to understand complex traits, but also our understanding of monogenic traits is no longer quite so straight forward. Indeed, recent articles have suggested the need to develop new approaches to come to grips with the ever-expanding complexity of genotype/phenotype relationships, such as ‘systems genetics’ [ 1 ] and ‘particle genetics’ [ 2 ].

However, perhaps the most confusing from a ‘traditional’ genetics standpoint has been the revelation of unexpected amounts of genetic variation in normal individuals, e.g., through the 1000 Genomes Project Consortium [ 3 , 4 ] ( http://www.1000genomes.org ), and The Cancer Genome Atlas ( http://www.cancergenome.nih.gov ) projects. Further, multiple sequence comparisons both between and within an individual's tissues have revealed extensive inter- and intra - tissue genetic heterogeneity [ 5 - 7 ]. These discoveries have raised some fundamental questions about our most basic genetics assumptions, among which are the following: (i) Can genetic studies still rely on a unique DNA or RNA sequence derived from blood or diseased tissue to determine phenotype?; (ii) Does a definitive and practical human genome reference sequence really exist, or at least can the reference sequence adopted by the NCBI (RefSeqGen) be practically useful in determining genotype/phenotype relationships?; and (iii) Does genetic heterogeneity in normal and diseased tissues imply that in certain tissues an individual's genome will naturally undergo somatic changes from conception to death as suggested in Figure  1 . In particular, newly revealed genetic heterogeneity data could help explain the long observed, but poorly understood concepts of variable expressivity and reduced penetrance. Traditionally, their effects on phenotypic differences have been considered to be relatively insignificant, particularly so for variable expressivity. To further complicate matters, phenotypic variations have been found, where identical gene alterations have been associated with (i) considerably different disease phenotypes, e.g., in phenylalanine hydroxylase deficiency (PAH) [ 8 ], or (ii) in a more extreme manner in the androgen receptor (AR) gene, with both androgen insensitivity syndrome (AIS) and prostate cancer [ 9 ].

An external file that holds a picture, illustration, etc.
Object name is 1479-7364-8-9-1.jpg

Factors that can affect an individual's genome from conception to death. (1) We have postulated that somatic mutations may occur during embryogenesis, and are then selected for later in life, to emphasize both the importance of identifying mutations early in development and the role of selection in determining phenotype.

In addition, there has also been an increase in the discovery of significant phenotype-modifying events, including epigenetic modifications, RNA editing, and protein interactions that can clearly influence transcriptional and non-transcriptional events involved in determining the phenotype. Thus, these complex influences are also likely to render our traditional understanding of the relationship between genotype and phenotype problematical. Further, a recent review of genotype/phenotype dissociation that discussed the possible molecular basis of reduced penetrance in human inherited disease, highlighted 12 molecular events that can influence reduced penetrance [ 10 ], some of which are also likely involved in situations of variable expressivity. In Figure  2 , we have suggested a model that incorporates some of these processes, and how they might influence phenotype, with special emphasis on the influence of intra-organismal and intra-tissue genetic heterogeneity. Traditionally, genetic databases have been the tools of choice in determining genotype/phenotype relationships; however, in their present form, they are totally inadequate to deal with these issues. Therefore, we are suggesting that it is time to create next-generation genetic databases (NGDB) that will be able to incorporate and analyze all of the factors that can contribute to the dissociation of genotype from phenotype, including those that may contribute to reduced penetrance and variable expressivity.

An external file that holds a picture, illustration, etc.
Object name is 1479-7364-8-9-2.jpg

Phenotypic modifying factors. (1) Somatic mutations can include both single nucleotide variants and structural alterations such as copy number variations that can then result in somatic and clonal mosaicism. (2) Cellular microenvironment selection pressure can work at the (i) DNA level, i.e., due to somatic mutations or (A) DNA editing; (ii) RNA level, i.e., due to (B) RNA editing, (C) interacting RNAs, or (D) epigenetic factors, etc.; or (iii) protein level, i.e., due to (E) protein-protein interactions. (3) Tissue microenvironment selection pressure can select a different protein product. Crossing arrows reflect the fact that selection can go in either direction.

Factors that have been shown to influence phenotype

Somatic mutations that result in intra-organismal and intra-tissue genetic heterogeneity.

Until recently, it has been assumed that somatic mutations are almost exclusively associated with cancers and are uniform within an individual neoplasm. However, different sets of somatic mutations have been found within a single individual's cancer tissues, as in a recent study of primary high-grade serous ovarian cancers that revealed a considerable amount of intra-tumor genetic heterogeneity [ 11 ].

Somatic sequence variants in normal tissues have also been examined in relation to oncogenesis. One study concluded that somatic sequence variants in normal cell populations could be the earliest stage of oncogenesis [ 12 ]. Evidence that altered mammary gland development and predisposition to breast cancer is due to in utero exposure to endocrine disruptors has suggested that selection of cells with different phenotypic properties, presumably as a result of very early somatic mutations, may take place at the very earliest stages of breast tissue development [ 13 ]. Thus, we may need to reconsider whether accumulation of a critical number of oncogenic mutations, e.g., the buildup of driver somatic mutations, is the reason that many cancers occur later in life. Rather, it has been proposed that while the genetic origins of cancer may occur early in fetal development, later selection pressure could explain the relationship between aging and cancer [ 14 ]. Interestingly, a possible mechanism to produce very early somatic mutations, namely the temporarily deferring of the repair of DNA lesions encountered during tissue replication, that has been termed damage bypass, has been identified as responsible for somatic hypermutation of the immunoglobin gene [ 15 ]. Regardless of which oncogenesis hypothesis is eventually proven, the implications for construction of NGDB for cancers is likely to be profound, as NGDBs will need to consider incorporating sequence data from much earlier stages in a tissue development, particularly from tissues that have the potential to become cancerous. Obviously, the ability to do so at the moment is not practical, but it is possible to envision that in the future, new micro-sampling techniques, together with the continued dramatic decline in the cost of NGS, will make such an approach much more realistic.

In addition, as specific tissues are being sequenced routinely, the number of other diseased tissues in which somatic mutations have been found has increased considerably [ 16 ]. More detailed studies have also reported somatic mosaicism in a number of other conditions, including the Proteus syndrome [ 17 ] and hemimegalencephaly [ 18 ].

Further, a study of copy-number variants (CNVs) in somatic human tissues revealed a significant number of intra-individual genomic changes between tissues [ 19 ]. Other studies of chromosomal abnormalities, including CNVs have revealed clonal mosaicism associated with aging and cancer [ 14 ], as well as related it to a higher risk of hematological cancer [ 20 ].

DNA editing

At the present state of our knowledge, this process is still considered to be extremely rare and of little phenotypic significance [ 21 ].

RNA editing

Recent, though controversial, evidence has suggested that RNA editing occurs more frequently than previously thought [ 22 , 23 ], although questions of how common it actually is in normal tissues and the validity of the original report have arisen [ 24 - 26 ]. However, there do appear to be cases where modifications of disease phenotypes are related to RNA editing [ 27 , 28 ].

Coregulators: non-coding RNAs

In recent years, non-coding RNAs (ncRNA) have been found to play an important role in the phenotypic expression of the transcribed genomic output. This family of untranslated RNAs includes small nucleolar RNAs (snoRNAs), which facilitate mRNA splicing, regulate transcription factors, and repress gene expression [via microRNAs (miRNAs)]. Small nuclear RNAs (snRNAs) that alter cellular proliferation and apoptosis by means of small interfering RNAs (siRNAs) have also been identified [ 29 ]. Long non-coding RNAs (lncRNAs) have also been identified as possible regulators of gene transcription and expression. Thus, the use of NGS to infer transcript expression levels in general, specifically via ncRNAs, is becoming increasingly common in molecular and clinical laboratories [ 30 ]. Therefore, it is not surprising that ncRNAs have been implicated as being responsible for a number of disease phenotypes [ 31 ].

Epigenetic factors

Epigenetics describes chromatin-based events that regulate DNA-templated processes and result in stable reprogramming of gene expression in response to transient external stimuli. Primary epigenetic factors include modifications to DNA and histones that are dynamically added and removed by chromatin-modifying enzymes in a highly regulated manner. Epigenetic mechanisms identified include DNA methylation, phosphorylation, ubiquitylation, sumoylation, RNA interference, and histone variance. Further, such epigenetic modifications play a critical role in the regulation of DNA-based processes such as transcription, DNA repair and replication, which can affect phenotype expression. Thus, abnormal expression patterns or genomic changes in chromatin regulators can have profound effects on human disease processes [ 32 ]. Indeed, epigenetics is considered a unifying factor in the etiology of some complex traits [ 33 ].

Regulators and other types of interacting proteins

Over the past few years, phenotypic expression has also found to be influenced by interacting proteins. Alterations in the interacting surfaces of a specific molecule [ 34 ] or the interacting proteins themselves can result in faulty protein-protein interactions and contribute to a disease phenotype [ 35 ].

Selection pressure by cellular and tissue microenvironments

It has been proposed that tumor morphology and phenotype are driven by selective pressure from the tissue microenvironment [ 36 , 37 ]. This hypothesis has been expanded to include other genetically determined diseased and non-diseased phenotypes [ 38 ]. The ability to perform ultra-deep sequencing using next-generation sequencers has revealed many more variants of a gene within tissues and thus the possibility that evolution at the tissue level contributes to disease phenotypes such as cancer [ 37 , 38 ].

Genotype/phenotype disconnects and possible mechanisms

In light of all the potential phenotype-modifying factors (Figure  2 ), which are generally not documented in traditional genetic databases, it is easy to understand why such databases, in their attempt to link a defined genotype with a specific phenotype, tend to avoid commenting on genotype/phenotype disconnects, due to the lack of information regarding the mechanisms that could produce such effects. However, a recent review highlighted the importance of understanding these disconnects, with over 650 references cited in proposing 12 molecular mechanisms to explain reduced penetrance [ 10 ]. Similarly, a number of possible mechanisms have been suggested to explain variable expressivity, e.g., somatic mosaicism [ 39 ], modifier genes [ 40 ], microRNA [ 41 ], epigenetic processes [ 42 ], and allelic heterogeneity [ 43 ]. Originally, the concept of reduced penetrance was based on studies of well-known genetic conditions in which a family tree predicted a disease phenotype, but this phenotype was not observed. While in most cases, the likelihood of reduced penetrance was small, it did serve a useful purpose in calculating the possibility of an individual having a diseased phenotype. The concept was further expanded when large-scale studies started to record the presence of mutations in specific genes associated with multifactorial diseases, such as cancer, a prime example being the breast cancer BRCA genes. In these cases, predicting penetrance was considered important in assessing the risk of disease. What has further complicated the issue, as we have noted, has been recent data from the 1000 Genomes Project and other large scale sequencing projects, which have reported that normal individuals can contain tens of potentially severe disease-associated alleles [ 10 ]. Thus, rather than talk about reduced penetrance of a pathogenic variant in a cohort that is known to express the disease phenotype, we now have to consider why these pathogenic variants are non-penetrant in a significant number of normal healthy individuals.

Redefining the human genome reference sequence

Clearly, the arrival of relatively inexpensive whole genome sequencing, and the subsequent sequencing of large numbers of non-diseased individuals, has revealed the increasing presence of known disease-associated gene variants within non-diseased individuals. This was initially shown when the first Korean genome sequence was compared to other Asian genomes [ 44 ]. More detailed studies found sequence variants in genes associated with specific genetic disorders, in individuals with normal phenotypes. Such examples were recently discovered in a genomic analysis of 10 healthy individuals, where each individual had what was said to be ‘healthy variance’ in 19 to 31 OMIM genes, as they did not exhibit any of the signs, symptoms, or phenotypes of the associated genetic disorders [ 45 ]. However, it should be noted that not all sequence variants in OMIM genes are always pathogenic, as has recently been comprehensively reported [ 10 ]. Nevertheless, a systematic survey of loss-of-function (LoF) variants identified 26 known and 21 predicted severe disease-causing variants in analysis of 2,951 putative LoF variants obtained from 185 human genomes [ 46 ]. What is even more problematic is that our own work has identified specific pathogenic sequence variants in the AR gene in individuals with completely normal phenotypes, i.e., exactly the same AR variants as found in diseased individuals [ 9 ].

We believe this data calls into question the validity of our present methods of defining the so-called normal human genome. In particular, normal tissue genotype/phenotype disconnects have clearly created questions regarding the practicality of relying on a single unique reference sequence as the definitive predictor of phenotype. The Human Genome Variation Society (HGVS) nomenclature committee has studied this issue ( http://www.hgvs.org/mutnomen/refseq.html ) and recommended that the NCBI RefSeqGen be used and that the reference sequence guidelines should follow the Locus Reference Genomic (LRG) sequence format [ 47 ], which suggests using a single-file record containing a unique stable reference sequence. These recommendations were appropriate at the start of NGS, when the extent of variance in normal individuals, was relatively unknown. Naturally, we understand that a definitive reference sequence is important in defining exonic, intronic, and other structural parameters of genes. However, the issue of correlating phenotype with a specific sequence has clearly become much more complex.

To deal with this issue, the increasing amount of sequence variability in normal individuals has been incorporated into the latest version of the NCBI RefSeqGen (GRC37p13) ( http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human ), with the idea that these variants could be used as a contextual filter to determine the relationship between genotype and phenotype. Furthermore, additional tools have been set up to deal with the issue of normal variance, such as considering population-specific references where the major alleles are included at every location, or generating a reference sequence where all the alleles have been identified as part of the common ancestral lineage of modern humans. However, we would argue that just integrating normal human variance, however nuanced, into an overall version of the RefSeqGen fails to deal with the increasing problem of the association of the same gene variant with both normal and diseased phenotypes. Thus, relying solely on a DNA-based reference sequence, however sophisticated, will make it very difficult to distinguish between benign and disease-causing gene alterations, at least in traditional genetic databases, where the phenotypic classification of specific gene variants is based on having a unique reference sequence that is exclusively associated with a normal phenotype.

Possible organization of next-generation genetic databases

As an overlying principle, NGDBs need to be organized to take into consideration, particularly for multifactorial diseases, the overall genetic context of any identified mutation. However, context involves both intra-organismal genetic heterogeneity as well as other phenotype-modifying factors (Figure  2 ). These modifying factors also need to be considered in the context of ‘pathway analysis’ [ 48 ]. In light of the many contextual factors that can affect the genotype/phenotype expression, it seems reasonable that future of locus-specific databases (LSDBs) should be organized to take into account as much specific phenotype information as possible, including genotype-modifying factors, as opposed to most present LSDBs that are primarily genotype centered.

The issue of how to deal with the increasing identification of somatic mutations and intra-organismal genetic heterogeneity also needs to be investigated. Traditionally, somatic mutations have not been associated with databases unless a cancer phenotype was involved. At present, most disease-based databases associated with common multifactorial diseases such as cancer, diabetes and cardiovascular diseases often lack tissue and individual specific data. Indeed, only the COSMIC database [ 49 ] lists a comprehensive spectrum of somatic mutations associated with specific tissues and individual samples. Furthermore, currently, there is no description of the germline susceptibility variants found in matching control tissues, therefore making it difficult to draw definitive conclusions as to the significance of many somatic mutations. The situation will become even more complex when inter- and intra-tumor genetic heterogeneity data is added. Clearly, traditional flat-file databases will be unable to deal with such data and what are needed are radically different database structures that include much more powerful analysis tools. In particular, it will be necessary to incorporate complex ‘system analysis tools’ that can analyze the intricate relationships between genotypic and phenotypic ontology [ 50 ]. Such analysis tools will need to incorporate extremely powerful knowledge analysis engines, possibly similar in design and organization to those developed by Google and other search engine companies.

These knowledge engines, for ‘systems genetics analysis’, will require the creation of powerful new bioinformatics tools and tremendously expanded database resources, particularly for disease-based databases. In particular, they will be required to analyze integrated genetic and non-genetic variation across many datasets, from different ethnic sub-groups or geographic populations, with the ultimate goal of integrating all genetic and non-genetic databases for a particular condition, especially if an initial population-based analysis fails to generate any significant insights into genotype/phenotype relationships. At the moment, such a task is clearly far beyond our capabilities; however, initial studies using mice have started to generate the bioinformatics tools and database resources required to create such NGDBs [ 51 ]. As NGDBs will include inter- and intra-tissue genetic heterogeneity, one factor that needs to be considered is the importance of quantifying variants that result in genetic heterogeneity, particularly if they are present within individual genes, rather than simply recording their presence. Indeed, we recently analyzed intra-tissue genetic heterogeneity in the AR gene in both cancer and non-cancer tissues taken from breast tumors and quantified AR variants in individual tissue samples using a new NGS technique [ 52 ]. Another approach has been to consider what has been termed ‘particle genetics’, where every cell is considered to be genetically unique, using probabilistic trait loci (PTL) to link genomic regions to probabilities of cellular characteristics [ 2 ].

Taking all of these factors into consideration, we would propose a NGDB model that integrates separate databases for each of the potential genome-modifying factors, together with a genotype database that incorporates genetic heterogeneity, with all of the individual databases linked to an associated phenotype database, and the data is then processed and analyzed through a very sophisticated knowledge engine (Figure  3 ).

An external file that holds a picture, illustration, etc.
Object name is 1479-7364-8-9-3.jpg

A model for next-generation genetic databases. (1) Genotype Database: (A) genetic heterogeneity within blood tissues and (B, C, and D) within other tissues in an organism. Each of the following databases contains specific information associated with phenotype differences: (2) DNA editing database, (3) RNA editing database, (4) Coregulators database, (5) Epigenetic database, and (6) Interacting proteins database. (7) Microenvironment selective pressure for different phenotypes.

Summary of possible actions required to create NGDBs

The following are some of the most significant actions that need to be undertaken in creating NGDBs:

1. Work in conjunction with the 1000 Genomes Project consortium and the Human Variome Project (HVP) to define the limits and significance of normal genome variation.

2. Incorporate individual inter- and intra-individual genetic heterogeneity into NGDBs.

3. Establish guidelines as to the significance of the number of reads needed to confirm a particular variant. Note, that initial NGS sequencing depth started at 4× to 10× coverage and rapidly rose to where 30× to 50× coverage is considered normal. However, recent studies show that increased coverage is likely to result in increased detection of variants [ 53 , 54 ], which in the case of tumor diagnostics coverage has now reached up to 20,000 reads.

4. Determine how the different frequency of occurrence of multiple gene variants within individuals should be incorporated into NGDBs. It should be noted that, at the moment, such frequencies are generally not incorporated into databases, particularly not into LSDBs. It would also clearly help to integrate structural variant data such as CNVs into LSDBs.

5. Incorporate expression data effectively into phenotype data parameters in NGDBs. Note that examples of tissue-specific variations in gene expression have now been reported [ 55 ]. In addition, data from the Genotype-Tissue Expression project [ 56 ] could be invaluable in determining relationships between tissue gene expression and disease phenotype.

6. Finally, research the bioinformatics and data parameters required to construct NGDBs that can incorporate and analyze all of the above data. To be truly effective, we believe that this effort should involve experts in genetics, bioinformatics, and systems biology-based search and knowledge engines, as well as a worldwide effort to collect genetic variation as for instance, proposed by the HVP.

Suggestions for future actions to be taken by the HVP

We believe that HVP is an organization that could play a leading role in developing NGDBs first by creating a special committee to look into future genetic database designs to deal with some of the issues raised in this article. Such a committee might include not only nomenclature experts, but also experts in creating both the algorithms required to design the databases, as well as the search and analytical engines. Based on the recommendations of this committee, the HVP could then set up an Institute for Genetic Database Research, which in addition to being responsible for NGDB design, could create a working model of the infrastructure required to run such databases on a worldwide scale. In particular, it will be important to establish a universal design structure so that all NGDBs will have a high degree of compatibility, and we believe that if such a design is coordinated through HVP, which already plays such a role in genetic nomenclature, it is much more likely to be accepted. Finally, in the age of data clouds and sophisticated communication platforms, such an institution need not have a physical structure, but rather could be a virtual institute, that would then allow experts from all over the world to participate.

For many years, genetics and related medical research have been based on the concept that genetic diseases are the result of alterations to a basically stable human genome that has limited natural variation within individuals, so that single or, in the case of multifactorial diseases, a number of very rare alterations to the human genome are directly responsible for specific diseases. Our initial response to the discovery of increased genetic complexity, particularly in multifactorial diseases, has been to use statistical-based approaches, such as GWAS to try to identify significant rare variants. However, most of these studies have yet to produce the breakthroughs initially predicted, perhaps because they are still analyzing ‘silos of genetic information’ and ignoring the fact that the genomic makeup and phenotypic modifications of every individual are both complex and dynamic. Indeed, the increasing use of NGS, together with more accurate expression and pathway analysis tools, is further broadening our understanding of genotype/phenotype relationships, by revealing that the new genetic landscape is infinitely more complex, not only between individuals, but also within individuals. In such a genetic scenario, multifaceted worldwide NGDBs are likely to be essential tools in our fight to treat genetic-based disease.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BG conceived and drafted the article, LKB and MT contributed to the discussion of the concepts and ideas presented and helped edit the text. All authors read and approved the final manuscript.

Acknowledgements

The authors acknowledge the support to BG of an operating grant from the Weekend to End Breast Cancer Fund of the Segal Cancer Centre of the Jewish General Hospital, Montreal, Quebec, Canada.

  • Civelek M, Lusis A. Systems genetics approaches to understand complex traits. Nat Rev Genet. 2014; 15 :34–48. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yvert G. ‘Particle genetics’: treating every cell as unique. Trends Genet. 2014; 30 :49–56. doi: 10.1016/j.tig.2013.11.002. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • The Thousand Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010; 467 :1061–1073. doi: 10.1038/nature09534. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • The Thousand Genomes Project Consortium. An integrated map of genetic variation from 1092 human genomes. Nature. 2012; 491 :56–65. doi: 10.1038/nature11632. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bertos NR, Park M. Breast cancer – one term, many entities. J Clin Invest. 2011; 121 :3789–3796. doi: 10.1172/JCI57100. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Russnes HG, Navin N, Hicks J, Borresen-Dale A-L. Insight into the heterogeneity of breast cancer through next-generation sequencing. J Clin Invest. 2011; 121 :3810–3818. doi: 10.1172/JCI57088. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, Martinez P, Matthews N, Stewart A, Tarpey P, Varela I, Phillimore B, Begum S, McDonald NQ, Butler A, Jones D, Raine K, Latimer C, Santos CR, Nohadani M, Eklund AC, Spencer-Dene B, Clark G, Pickering L, Stamp G, Gore M, Szallasi Z, Downward J, Futreal PA, Swanton C. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med. 2012; 366 :883–892. doi: 10.1056/NEJMoa1113205. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zhu T, Ye J, Han L, Qiu W, Zhang H, Liang L, Gu X. Variations in genotype-phenotype correlations in phenylalanine hydroxylase deficiency in Chinese Han population. Gene. 2013; 529 :80–87. doi: 10.1016/j.gene.2013.07.079. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gottlieb B, Beitel LK, Nadarajah A, Paliouras M, Trifiro M. The androgen receptor gene mutations database: 2012 update. Hum Mutat. 2012; 33 :887–894. doi: 10.1002/humu.22046. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cooper DN, Krawczak M, Polychronakos C, Tyler-Smith C, Kehrer-Sawatzki H. Where genotype is not predictive of phenotype: towards an understanding of the molecular basis of reduced penetrance in human inherited disease. Hum Genet. 2013; 132 :1077–1130. doi: 10.1007/s00439-013-1331-2. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bashashati A, Ha G, Tone A, Ding J, Prentice LM, Roth A, Rosner J, Shumansky K, Kalloger S, Senz J, Yang W, McConechy M, Melnyk N, Anglesio M, Luk MT, Tse K, Zeng T, Moore R, Zhao Y, Marra MA, Gilks B, Yip S, Huntsman DG, McAlpine JN, Shah SP. Distinct evolutionary trajectories of primary high-grade serous ovarian cancers revealed through spatial mutational profiling. J Pathol. 2013; 231 :21–34. doi: 10.1002/path.4230. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Howk CL, Voller Z, Beck BB, Dai D. Genetic diversity in normal cell populations is the earliest stage of oncogenesis leading to intra-tumor heterogeneity. Front Oncol. 2013; 3 :61. doi:10.3389/fonc.2013.00061. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Soto AM, Brisken C, Schaeberle C, Sonnenschein C. Does cancer start in the womb? Altered mammary gland development and predisposition to breast cancer due to in utero exposure to endocrine disruptors. J Mammary Gland Biol Neoplasia. 2013; 18 :199–208. doi: 10.1007/s10911-013-9293-5. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Jacobs KB, Yeager M, Zhou W, Wacholder S, Wang Z, Rodriguez-Santiago B, Hutchinson A, Deng X, Liu C, Horner M-J, Cullen M, Epstein CG, Burdett L, Dean MC, Chatterjee N, Sampson J, Chung CC, Kovaks J, Gapstur SM, Stevens VL, Teras LT, Gaudet MM, Albanes D, Weinstein SJ, Virtamo J, Taylor PR, Freedman ND, Abnet CC, Goldstein AM, Hu N. et al. Detectable clonal mosaicism and its relationship to aging and cancer. Nat Genet. 2012; 44 :651–658. doi: 10.1038/ng.2270. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sale JE, Batters C, Edmunds CE, Philips LG, Simpson LJ, Szuts D. Timing matters: error–prone gap filling and translation synthesis in immunoglobin gene hypermutation. Philos Trans R Soc Lond B Biol Sci. 2009; 364 :595–603. doi: 10.1098/rstb.2008.0197. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Erickson RP. Somatic gene mutation and human disease other than cancer: An update. Mutat Res. 2010; 705 :96–106. doi: 10.1016/j.mrrev.2010.04.002. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lindhurst MJ, Sapp JC, Teer JK, Johnston JJ, Finn EM, Peters K, Turner J, Cannons JL, Bick B, Blackemore L, Blumhorst C, Brockman K, Calder P, Cherman N, Deardorff MA, Everman DB, Golas G, Greenstein RM, Kato BM, Keppler-Noreuil KM, Kuznetsov SA, Miyamoto RT, Newman K, Ng D, O’Brien K, Rothenberg S, Schwartzentruber DJ, Singhal V, Tirabosco R, Upton J. et al. A mosaic activating mutation is associated with the Proteus syndrome. N Engl J Med. 2011; 365 :611–619. doi: 10.1056/NEJMoa1104017. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Evrony GD, Cai X, Lee E, Hills LB, Elhosary PC, Lehmann HS, Parker JJ, Atabay KD, Gilmore EC, Poduri A, Park PJ, Walsh CA. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell. 2012; 151 :483–496. doi: 10.1016/j.cell.2012.09.035. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • O'Huallachain M, Karczewski KJ, Weissman SM, Urban AE, Snyder MP. Extensive genetic variation in somatic human tissues. Proc Natl Acad Sci USA. 2012; 109 :18018–18023. doi: 10.1073/pnas.1213736109. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Laurie CC, Laurie CA, Rice K, Dohey KF, Zelnick LR, McHugh CP, Ling H, Hetrick KN, Pugh EW, Amos C, Wei Q, Wang L-E, Lee JE, Barnes KC, Hansel NN, Mathias R, Daley D, Beaty TH, Scott AF, Ruczinski I, Scharpf RB, Bierut LJ, Hartz SM, Landi MT, Freedman ND, Goldin LR, Ginsburg D, Li J, Desch KC, Strom SS. et al. Detectable clonal mosaicism from birth to old age and its relationship to cancer. Nat Genet. 2012; 44 :642–650. doi: 10.1038/ng.2271. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zaranek AW, Levanon EY, Zecharia T, Clegg T, Church GM. A survey of genomic traces reveals a common sequencing error, RNA editing, and DNA editing. PLoS Genet. 2010; 6 :e1000954. doi: 10.1371/journal.pgen.1000954. doi:10.1371/journal.pgen.1000954. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011; 333 :53–58. doi: 10.1126/science.1207018. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Li M, Cheung VG. Response to comment on “Widespread RNA and DNA sequence differences in the human transcriptome” Science. 2012; 335 :335–1302F. doi: 10.1126/science.1212443. [ CrossRef ] [ Google Scholar ]
  • Kleinman CL, Majewski J. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome” Science. 2012; 335 :335–302c. doi: 10.1126/science.1212443. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pickrell JK, Gilad Y, Pritchard JK. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome” Science. 2012; 335 :335–1302e. doi: 10.1126/science.1212443. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lin W, Piskol R, Tan MH, Li JB. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome” Science. 2012; 335 :335–1302e. doi: 10.1126/science.1212443. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gottlieb B, Chalifour LE, Mitmaker B, Sheiner N, Obrand D, Abraham C, Meilleur M, Sugahara T, Bkaily G, Schweitzer M. BAK1 gene variation and abdominal aortic aneurysms. Hum Mutat. 2009; 30 :1043–1047. doi: 10.1002/humu.21046. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 2013; 12 :134–142. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hauptman N, Glavac D. MicroRNAs and long non-coding RNAs: prospects in diagnostics and therapy of cancer. Radiol Oncol. 2013; 47 :311–318. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Isakov O, Roy Ronen R, Kovarsky J, Gabay A, Gan I, Modai S, Shomron N. Novel insight into the non-coding repertoire through deep sequencing analysis. Nucleic Acids Res. 2012; 40 :e86. doi: 10.1093/nar/gks228. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Taft RJ, Pang KC, Mercer TR, Dinger M, Mattick JS. Non-coding RNAs: regulators of disease. J Pathol. 2010; 220 :126–139. doi: 10.1002/path.2638. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Relton C, Smith DG. Epigenetic epidemiology of common complex disease: prospects for prediction, prevention, and treatment. PLoS Med. 2010; 7 :e1000356. doi: 10.1371/journal.pmed.1000356. doi:10.1371/journal.pmed.1000356. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Petronis A. Epigenetics as a unifying principle in the aetiology of complex traits. Nature. 2010; 465 :712–727. [ PubMed ] [ Google Scholar ]
  • Schuster-Böckler B, Bateman A. Protein interactions in human genetic diseases. Genome Biol. 2008; 9 :R9. doi: 10.1186/gb-2008-9-1-r9. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Al-Khoury R, Coulombe B. Defining protein interactions that regulate disease progression. Expert Opin Ther Targets. 2009; 13 :13–17. doi: 10.1517/14728220802631876. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Anderson AR, Weaver AM, Cummings PT, Quaranta V. Tumor morphology and phenotypic evolution driven by selective pressure from the microenvironment. Cell. 2006; 127 :905–915. doi: 10.1016/j.cell.2006.09.042. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Greaves M, Maley CC. Clonal evolution in cancer. Nature. 2012; 481 :306–313. doi: 10.1038/nature10762. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gottlieb B, Beitel LK, Trifiro MA. Selection and mutation in the “new” genetics: an emerging hypothesis. Hum Genet. 2010; 31 :491–501. [ PubMed ] [ Google Scholar ]
  • Gottlieb B, Beitel LK, Trifiro MA. Post-zygotic mutations and somatic mosaicism in androgen insensitivity syndrome. Trends Genet. 2001; 17 :628–632. doi: 10.1016/S0168-9525(01)02518-5. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nadeau JH. Modifier genes in mice and humans. Nat Rev Genet. 2001; 2 :165–174. [ PubMed ] [ Google Scholar ]
  • Ahluwalia JK, Hariharan M, Bargaje R, Pillai B, Brahmachan V. Incomplete penetrance and variable expressivity: is there a microRNA connection? Bioessays. 2009; 31 :981–992. doi: 10.1002/bies.200900066. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Blewitt M, Whitelaw E. The use of mouse models to study epigenetics. Cold Spring Harb Perspect Biol. 2013; 5 :a017939. doi: 10.1101/cshperspect.a017939. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kuo DS, Labelle-Dumais C, Mao M, Jeanne M, Kauffman WB, Allen J, Favor J, Gould DB. Allelic heterogeneity contributes to variability in ocular dysgenesis, myopathy and brain malformations caused by Col4a1 and Col4a2 . Hum Mol Genet. 2014; 23 :1709–1722. doi: 10.1093/hmg/ddt560. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim DS, Kim BC, Kim SY, Kim WY, Kim C, Park D, Lee YS, Kim S, Reja R, Jho S, Kim CG, Cha JY, Kim KH, Lee B, Bhak J, Kim SJ. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009; 19 :1622–1629. doi: 10.1101/gr.092197.109. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Moore B, Hu H, Singleton M, De La Vega FM, Reese MG, Yandell M. Global analysis of disease-related DNA sequence variation in 10 healthy individuals: implications for whole genome-based clinical diagnostics. Genet Med. 2011; 13 :210–217. doi: 10.1097/GIM.0b013e31820ed321. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012; 335 :823–828. doi: 10.1126/science.1215040. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Daglish R, Fileck P, Cunningham F, Atashyn A, Tully RE, Proctor G, Chen Y, McLaren WM, Larsson P, Vaughan BW, Beroud C, Dobson G, Lehväslaiho H, Taschner PE, den Dunnen JT, Devereau A, Birney E, Brookes AJ, Maglott DR. Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome Med. 2010; 2 :24. doi: 10.1186/gm145. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Khatri P, Sirota M, Butte AT. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012; 8 :e1002375. doi: 10.1371/journal.pcbi.1002375. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucl Acids Res. 2011; 39 :D945–D950. doi: 10.1093/nar/gkq929. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Robinson PN, Mundlos S. The human phenotype ontology. Clin Genet. 2010; 77 :525–534. doi: 10.1111/j.1399-0004.2010.01436.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Durrant C, Swertz Alberts R, Arends D, Moller S, Mott R, Primns P, van der Velde KJ, Jansen RC, Schughart K. Bioinformatics tools and database resources for systems genetics analysis in mice – a short review and an evaluation of future needs. Brief Bioinform. 2012; 13 :135–142. doi: 10.1093/bib/bbr026. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gottlieb B, Alvarado C, Wang C, Gharizadeh B, Babrzadeh F, Richards B, Batist G, Basik M, Beitel LK, Trifiro M. Making sense of intra-tumor genetic heterogeneity: altered frequency of androgen receptor CAG repeat length variants in breast cancer tissues. Hum Mutat. 2013; 34 :610–618. [ PubMed ] [ Google Scholar ]
  • Li M, Stoneking M. A new approach for detecting low-level mutations in next-generation sequencing data. Genome Biol. 2012; 13 :R34. doi: 10.1186/gb-2012-13-5-r34. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, Loeb LA. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A. 2012; 109 :14508–14513. doi: 10.1073/pnas.1208715109. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fu J, Wolfs MGM, Deelen P, Westra H-J, Fehrmann RSN, te Meerman GJ, Buurman WA, Rensen SSM, Groen HJM, Weersma RK, van den Berg LH, Veldink J, Ophoff RA, Snieder H, van Heel D, Jansen RC, Hofker MH, Wijmenga C, Franke L. Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene expression. PLoS Genet. 2012; 8 :e1002431. doi: 10.1371/journal.pgen.1002431. doi:10.1371/journal.pgen. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • The GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013; 45 :580–585. doi: 10.1038/ng.2653. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

IMAGES

  1. Steps in the hypothesis Generation

    genomic hypothesis generation

  2. Functional Genomic Hypothesis Generation and Experimentation by a

    genomic hypothesis generation

  3. Genome Editing

    genomic hypothesis generation

  4. PPT

    genomic hypothesis generation

  5. Landmark Study: Sequencing of 64 Full Human Genomes to Better Capture

    genomic hypothesis generation

  6. Flow diagram of the HyGene model of hypothesis generation, judgment

    genomic hypothesis generation

VIDEO

  1. Genetic analysis technology is wild already

  2. Genome analysis

  3. Enabling genomic & phenotypic data sharing for the benefit of human health @ BioHackathon2018

  4. AI in Hypothesis Generation

  5. Life Cycle Hypothesis: A Revolution in Economic Understanding

  6. Discussion about the genomic failure hypothesis 2024 03 09 AntiAging Reading Group

COMMENTS

  1. Functional genomic hypothesis generation and experimentation ...

    The Robot Scientist hypothesis-generation and experimentation loop. A widely accepted view of science is that it follows a 'hypothetico-deductive' process 1. Scientific expertise and ...

  2. Biophysics and the Genomic Sciences

    "Robot Scientist": In 2004, King et al. published a seminal study entitled: "Functional Genomic Hypothesis Generation and Experimentation by a Robot Scientist." This bold work showed that scientific discoveries could be made by a computer linked to an ordinary pipetting robot and operated in a closed-loop fashion under the control of a ...

  3. Hypothesis-generating research and predictive medicine

    Finally, the paradigm of hypothesis-generating research is conceptually similar to the underpinning of predictive genomic medicine, which has the potential to shift medicine from a primarily population- or cohort-based activity to one that instead uses individual susceptibility, prognostic, and pharmacogenetic profiles to maximize the efficacy ...

  4. Defining Phenotypes from Clinical Data to Drive Genomic Research

    Hypothesis generation; drug repurposing; interpretation of novel genomic loci: Open in a separate window * PheWAS, Phenome-wide association studies; NLP=natural language processing. 4.1. Logical Constraint-Based Approaches ... Genomic discovery has historically lacked in ancestral diversity (123, 124). Research collaboration using these large ...

  5. Functional genomic hypothesis generation and experimentation ...

    The system automatically originates hypotheses to explain observations, devises experiments to test these hypotheses, physically runs the experiments using a laboratory robot, interprets the results to falsify hypotheses inconsistent with the data, and then repeats the cycle. Here we apply the system to the determination of gene function using ...

  6. [PDF] Functional genomic hypothesis generation and experimentation by a

    DOI: 10.1038/nature02236 Corpus ID: 4428725; Functional genomic hypothesis generation and experimentation by a robot scientist @article{King2004FunctionalGH, title={Functional genomic hypothesis generation and experimentation by a robot scientist}, author={Ross D. King and Ken E. Whelan and Ffion Mair Jones and Philip G. K. Reiser and Christopher H. Bryant and Stephen H. Muggleton and Douglas ...

  7. Hypothesis Generation from Literature for Advancing Biological

    Hypothesis Generation is a literature-based discovery approach that utilizes existing literature to automatically generate implicit biomedical associations and provide reasonable predictions for future research. Despite its potential, current hypothesis generation methods face challenges when applied to research on biological mechanisms. ...

  8. (PDF) Functional genomic hypothesis generation and experimentation by a

    Some aspects of the nonlinear relationship between lateral shift and lateral deviation (Fig. 3a) can be understood Functional genomic hypothesis intuitively. When the sensed shift is zero, the actual shift is equally likely to be to the right or the left and, on average, there should be no deviation from the target.

  9. PDF Human Comprehensible Active Learning of Genome-Scale Metabolic Networks

    genome hypothesis generation and validation more accessible. We implement active learning to select auxotrophic mutant experiments that provide the maximal information entropy to ... Hypothesis generation and abduction which are core to scientific discovery, are facilitated by ILP [7]. Logic programs are interpretable representations

  10. Gene Balance Hypothesis: From Classical Genetics to Modern Genomics

    The genomic content of this meager ciliate, with a predicted number of 39,642 genes, dwarfs that of humans. The reason for this number is that the whole genome has endured three duplications. By comparing the recent, intermediate, and old events, insight into the processes of gene loss could be gained.

  11. PDF arXiv:2105.07758v1 [cs.AI] 17 May 2021

    throughput automated platform for genetic design, in which a learning machine is used to iteratively optimise the system through a design-build-test-learn (DBTL) cycle. ... Oliver, S. G. 2004. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427:247-252. [Schmidt and Lipson 2009]Schmidt, M., and Lipson ...

  12. Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A

    Further, we also classify the gene set analysis approaches and tools by the type of genomic study, null hypothesis, sampling model and nature of the test statistic, etc. Rather than reviewing the gene set analysis approaches individually, we provide the generation-wise evolution of such approaches for microarrays, RNA-sequencing and genome wide ...

  13. The evolution of genomic imprinting: theories, predictions and ...

    Recent advances in next generation sequencing technology mean that genome-wide estimates of both the numbers of imprinted loci as well as the direction of expression in these will soon be ...

  14. Single-cell ATAC sequencing analysis: From data preprocessing to

    Most genetic variations associated with human complex traits are located in non-coding genomic regions. ... Single-cell ATAC sequencing analysis: From data preprocessing to hypothesis generation Comput Struct Biotechnol J. 2020 Jun 12:18 ... We expect this review will provide a guideline for successful data generation and analysis methods using ...

  15. ‪Ross D. King‬

    Functional genomic hypothesis generation and experimentation by a robot scientist. RD King, KE Whelan, FM Jones, PGK Reiser, CH Bryant, SH Muggleton, ... Nature 427 (6971), 247-252, 2004. 827: 2004: The automation of science. RD King, J Rowland, SG Oliver, M Young, W Aubrey, E Byrne, M Liakata, ...

  16. Next-generation sequencing for hypothesis-free genomic detection of

    The potential of next-generation sequencing (NGS) for hypothesis-free pathogen diagnosis from (poly-)microbially contaminated, formalin-fixed, paraffin embedded tissue samples from patients with invasive fungal infections and amebiasis was investigated. Samples from patients with chromoblastomycosis (n = 3), coccidioidomycosis (n = 2), histoplasmosis (n = 4), histoplasmosis or cryptococcosis ...

  17. Genomic Imprinting

    The Genetic Conflict Hypothesis. ... impacting the expression of genes in the next generation. References. References. Jaenisch, R. (1997). DNA methylation and imprinting: why bother? Trends in Genetics, 13 (8): 323-329 (subscription required). Murrell, A. et al. (2008). Distinct methylation changes at the IGF2-H19 locus in congenital growth ...

  18. Genomic imprinting: theories and data

    The genetic conflict hypothesis has implications for our understanding of the 'units-of-selection debate,' too, as Gardner (2014) shows here. In his paper he observes that, under this hypothesis, imprinting evolves as the result of selection among individuals and yet the interpretation is one of conflict among differently derived elements of ...

  19. Next-Generation Multitarget Stool DNA Test for Colorectal Cancer

    The next-generation multitarget stool DNA test was positive for 26 of the 32 colorectal cancers (81%) that were undetected by FIT, including for 23 of 29 screening-relevant colorectal cancers (79% ...

  20. Genomic and epigenetic insights into the molecular bases of ...

    Additional changes also occur from the F1 generation to apomictic or allotetraploid progeny, as shown by the different colours. ... Crow, J. F. Alternative hypothesis of hybrid vigor. Genetics 33 ...

  21. Changing genetic paradigms: creating next-generation genetic databases

    A model for next-generation genetic databases. (1) Genotype Database: (A) genetic heterogeneity within blood tissues and (B, C, and D) within other tissues in an organism. ... Selection and mutation in the "new" genetics: an emerging hypothesis. Hum Genet. 2010; 31:491-501. [Google Scholar] Gottlieb B, Beitel LK, Trifiro MA. Post-zygotic ...