• U.S. Department of Health & Human Services

National Institutes of Health (NIH) - Turning Discovery into Health

  • Virtual Tour
  • Staff Directory
  • En Español

You are here

The nih almanac, national human genome research institute (nhgri).

  • Important Events

Major Programs

As a leading authority in the field of genomics, the National Human Genome Research Institute (NHGRI) strives to accelerate scientific and medical breakthroughs that improve human health. NHGRI drives cutting-edge research, developing new technologies, and studying the impact of genomics on society. The Institute collaborates with the scientific and medical communities to enhance genomic technologies that accelerate breakthroughs and improve lives.

NHGRI was established originally as the National Center for Human Genome Research in 1989 to lead the International Human Genome Project. NHGRI is part of the National Institutes of Health (NIH), the nation’s medical research agency. The Human Genome Project, which had as its primary goal the sequencing of the 3 billion DNA letters that make up the human genetic instruction book, was successfully completed in April 2003.

Since completion of the Human Genome Project, NHGRI has funded and conducted research to uncover the role that the genome plays in human health and disease. (A genome is an organism's complete set of DNA, including all of its genes. Each genome contains all of the information needed to build and maintain that organism.) This research occurs across a spectrum: basic research to shed light on the structure and function of the genome; translational research to decipher the molecular bases of human diseases; and clinical research to establish how to use genomic information to advance medical care.

NHGRI also supports exploration of the complex ethical, legal, and social implications of genomics, and is committed to ensuring that the knowledge and benefits generated from genomics research are disseminated widely, both to fuel current and future researchers and to benefit the general public and promote genomic literacy.

External research guidance and advice related to NHGRI grants comes from the National Advisory Council for Human Genome Research, which meets three times a year in Rockville, Maryland. Members include representatives from health and science disciplines, public health, social sciences, and the general public. Portions of the council meetings are open to the public and webcast on GenomeTVLive . In addition, the Division of Intramural Research Board of Scientific Counselors reviews and evaluates NHGRI’s intramural program and the work of individual investigators within the Division.

Important Events in NHGRI history

1988 — Program advisory committee on the human genome is established to advise NIH on all aspects of research in the area of genomic analysis.

1988 — The Office for Human Genome Research is created within the NIH Office of the Director. Also, NIH and the Department of Energy (DOE) sign a memorandum of understanding, outlining plans for cooperation on genome research.

1988 — NIH Director James Wyngaarden, M.D., assembles scientists, administrators, and science policy experts in Reston, Virginia, to lay out an NIH plan for the Human Genome Project.

1989 — The program advisory committee on the human genome holds its first meeting in Bethesda, Maryland.

1989 — The NIH-DOE Ethical, Legal and Social Implications (ELSI) working group is created to explore and propose options for the development of the ELSI component of the Human Genome Project.

1989 — The National Center for Human Genome Research (NCHGR) is established to carry out the NIH's component of the Human Genome Project. James Watson, Ph.D., co-discoverer of the structure of DNA, is appointed as NCHGR’s first director.

1990 — The first five-year plan with specific goals for the Human Genome Project is published.

1990 — The National Advisory Council for Human Genome Research (NACHGR) is established.

1990 — The Human Genome Project officially begins.

1991 — NACHGR meets for the first time in Bethesda, Maryland.

1992 — James Watson resigns as first director of NCHGR. Michael Gottesman, M.D., is appointed acting center director.

1993 — The center's Division of Intramural Research is established.

1993 — Francis S. Collins, M.D., Ph.D., is appointed NCHGR director.

1993 — The Human Genome Project revises its five-year goals and extends them to September 1998.

1994 — The first genetic linkage map of the human genome is achieved one year ahead of schedule. Such maps consist of DNA patterns, called markers, positioned on chromosomes, and help researchers search for disease-related genes.

1995 — Task Force on Genetic Testing is established as a subgroup of the NIH-DOE Ethical, Legal, and Social Implications (ELSI) working group.

1996 — Human DNA sequencing begins with pilot studies at six U.S. universities.

1996 — An international team completes the DNA sequence of the first eukaryotic genome , Saccharomyces cerevisiae , or common brewer's yeast. (A eukaryote is any organism whose cells contain a nucleus and other organelles enclosed within membranes.)

1996 — The Center for Inherited Disease Research, a project co-funded by eight NIH institutes and centers to study the genetic components of complex disorders, is established on the Johns Hopkins Bayview Medical Center campus in Baltimore, Maryland.

1996 — Scientists from government, university, and commercial laboratories around the world reveal a map that pinpoints the locations of more than 16,000 genes in human DNA.

1996 — NCHGR and other researchers identify the location of the first gene associated with Parkinson's disease.

1996 — NCHGR and other researchers identify the location of the first major gene that predisposes men to prostate cancer.

1997 — Department of Health and Human Services Secretary Donna E. Shalala signs documents elevating NCHGR to an NIH institute, the National Human Genome Research Institute.

1997 — A federal government-citizen group – the NIH-DOE ELSI Working Group and the National Action Plan on Breast Cancer (NAPBC) – suggests policies to limit genetic discrimination in the workplace.

1997 — NHGRI and other scientists show that three specific alterations in the breast cancer genes BRCA1 and BRCA2 are associated with an increased risk of breast, ovarian and prostate cancers.

1997 — A map of human chromosome 7 is completed. Changes in the number or structure of chromosome 7 occur frequently in human cancers.

1997 — NHGRI and other researchers identify an altered gene that causes Pendred syndrome, a genetic disorder that causes early hearing loss in children.

1998 — Vice President Al Gore announces that the Clinton administration is calling for legislation to bar employers from discriminating against workers in hiring or promotion because of their genetic makeup.

1998 — At a meeting of the Human Genome Project’s main advisory body, project planners present a new five-year plan to produce a “finished” version of the DNA sequence of the human genome by the end of year 2003, two years ahead of its original schedule. The Human Genome Project plans to generate a “working draft” that, together with the finished sequence, will cover at least 90 percent of the genome in 2001. The “working draft” will be immediately valuable to researchers and form the basis for a high-quality, “finished” genome sequence.

1998 — A major international collaborative research study finds the site of a gene for susceptibility to prostate cancer on the X chromosome. This is the first time a gene for a common type of cancer is mapped to the X chromosome.

1998 — NHGRI and other Human Genome Project-funded scientists sequence the genome of the tiny roundworm Caenorhabditis elegans . It marks the first time scientists have spelled out the instructions for a complete animal that, like humans, has a nervous system, digests food and has sex.

1999 — The pilot phase of the Human Genome Project is completed. A large-scale effort to sequence the human genome begins.

1999 — NHGRI, DOE, and the Wellcome Trust, a global charity based in London, hold a celebration of the completion and deposition of 1 billion base pairs of the human genome DNA sequence into GenBank (http://www.ncbi.nlm.nih.gov/genbank/). GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.

1999 — For the first time, NHGRI and other Human Genome Project-funded scientists unravel the genetic code of an entire human chromosome (chromosome 22). The findings are reported in Nature .

2000 — President Clinton signs an Executive Order to prevent genetic discrimination in the federal workplace. NHGRI programs on the ethical, legal and social implications of the Human Genome Project played a role in the development of policy principles on this issue.

2000 — Public consortium of scientists and a private companyelease a substantially complete genome sequence of the fruit fly, Drosophila melanogaster . Science publishes the findings.

2000 — Scientists in Japan and Germany report that they have unraveled the genetic code of human chromosome 21, known to be involved with Down syndrome, Alzheimer's disease, Usher syndrome, and amyotrophic lateral sclerosis, also known as Lou Gehrig's disease. Nature publishes these findings.

2000 — President Bill Clinton, NHGRI Director Francis Collins, British Prime Minister Tony Blair (via satellite), and Craig Venter, president, Celera Genomics Corp., announce the completion of the first survey of the human genome in a White House ceremony.

2000 — An international team led by NHGRI scientists discover a genetic “signature” that may help explain how malignant melanoma, a deadly form of skin cancer, can spread to other parts of the body. The findings are reported in Nature .

2000 — The NIH, the Wellcome Trust, and three private companies collaborate to form the Mouse Sequencing Consortium to accelerate the sequencing of the mouse genome.

2001 — The ELSI Research Programs of NHGRI and DOE cosponsor a conference to celebrate a decade of research and consider the impact of the new science on genetic research, health and policy.

2001 — The Human Genome Project publishes the first analysis of the human genome sequence, describing how it is organized and how it evolved. The analysis, published in the journal Nature , reveals that the human genome only contains 30,000 to 40,000 genes, far fewer than the 100,000 previously estimated.

2001 — NHGRI and Human Genome Project-funded scientists find a new tumor suppressor gene on human chromosome 7 that is involved in breast, prostate and other cancers. A single post-doctoral researcher, using the “working draft” data, pins down the gene in weeks. In the past, the same work would have taken several years and contributions from many scientists.

2001 — Researchers from NHGRI and Sweden's Lund University develop a method of accurately diagnosing four complex, hard-to-distinguish childhood cancers using DNA microarray technology and artificial neural networks. Nature Medicine publishes the results.

2001 — NHGRI creates the Centers for Excellence in Genomic Sciences (CEGS) program, which supports interdisciplinary research teams that use data sets and technologies developed by the Human Genome Project. The initial CEGS grants for innovative genomic research projects are awarded to the University of Washington and Yale University.

2002 — NHGRI scientists and collaborators at Johns Hopkins Medical Institution in Baltimore and The Cleveland Clinic identify a gene on chromosome 1 that is associated with an inherited form of prostate cancer in some families. Nature Genetics publishes the findings.

2002 — NHGRI and the NIH Office of Rare Diseases launch a new information center – the Genetic and Rare Diseases Information Center (GARD) — to provide accurate, reliable information about genetic and rare diseases to patients and their families.

2002 — NHGRI launches a redesigned Web site, www.genome.gov , which provides improved usability and easy access to new content for a wide range of users.

2002 — NHGRI launches the International HapMap Project, a $100 million, public-private effort to create a new type of genome map that will chart genetic variation among human populations. The HapMap serves as a tool to speed the search for the genes involved in common disorders such as asthma, diabetes, heart disease and cancer. The SNP Consortium, a collaborative effort among industry, academic centers and the Wellcome Trust, helps provide an instrumental public catalog of genetic variation.

2002 — NHGRI names Alan E. Guttmacher, M.D., as its new deputy director. It selects Eric D. Green, M.D., Ph.D., as its new scientific director, and William A. Gahl, M.D., Ph.D., as its new intramural clinical director.

2003 — NHGRI launches the ENCyclopedia of DNA Elements (ENCODE) pilot project to identify all functional elements in human DNA.

2003 — NHGRI celebrates the successful completion of the Human Genome Project — two years ahead of schedule and under budget. The event coincides with the 50th anniversary of the description of DNA’s double helix and the 2003 publication of the vision document for the future of genomics research.

2003 — NHGRI researchers identify the gene that causes the premature aging disorder progeria. Nature publishes the findings .

2003 — A detailed analysis of the sequence of the human Y chromosome is published in the journal Nature .

2003 — A detailed analysis of the sequence of chromosome 7 uncovers structural features that appear to promote genetic changes that can cause disease. The findings by a multinational team of scientists are reported in the journal Nature .

2003 — A team of researchers, led by NHGRI, compares the genomes of 13 vertebrate animals. The results, published in Nature , suggest that comparing a wide variety of species' genomes will illuminate genomic evolution and help identify functional elements in the human genome.

2003 — NHGRI establishes the Education and Community Involvement Branch to engage the public in understanding genomics and accompanying ethical, legal and social issues.

2003 — NHGRI announces the first grants in a three-year, $36 million scientific program called ENCyclopedia Of DNA Elements (ENCODE) , aimed at discovering all parts of the human genome that are crucial to biological function.

2003 — NHGRI selects five centers to carry out a new generation of large-scale genome sequencing projects to realize the promise of the Human Genome Project and expand understanding of human health and disease.

2003 — NHGRI announces formation of the Social and Behavioral Research Branch within its Division of Intramural Research .

2003 — NHGRI announces the first draft version of the chimpanzee genome sequence and its alignment with the human genome.

2004 — NHGRI announces that the first draft version of the honey bee genome sequence has been deposited into free public databases.

2004 — The Genetic and Rare Disease Information Center announces efforts to enable healthcare workers, patients and families who speak Spanish to take advantage of its free services.

2004 — NHGRI's Large-Scale Sequencing Research Network announces it will begin genome sequencing of the first marsupial, the gray short-tailed South American opossum, and more than a dozen other model organisms to further understanding of the human genome.

2004 — NHGRI announces that the first draft version of the chicken genome sequence has been deposited into free public databases.

2004 — The International Rat Genome Sequencing Project Consortium announces the publication of a high-quality draft sequence of the rat genome. The publication is important because of the rat’s ubiquitous use as a disease research model.

2004 — NHGRI announces that the first draft version of the dog genome sequence has been deposited into free public databases.

2004 — NHGRI launches the NHGRI Policy and Legislative Database, an online resource to enable researchers, health professionals, and the public to locate information on laws and policies related to genetic discrimination and other genomic issues .

2004 — NHGRI's Large-Scale Sequencing Research Network announces a comprehensive strategic plan to sequence 18 additional organisms, including the African savannah elephant, the domestic cat, and the orangutan to help interpret the human genome.

2004 — NHGRI launches four interdisciplinary Centers for Excellence in Ethical, Legal and Social Implications Research to address some of the most pressing societal questions raised by recent advances in genetic and genomic research .

2004 — NHGRI announces that the first draft version of the cow genome sequence has been deposited into free public databases.

2004 — NHGRI awards more than $38 million in grants to develop new genome sequencing technologies to accomplish the near-term goal of sequencing a mammalian-sized genome for $100,000, and the longer-term challenge of sequencing an individual human genome for $1,000 or less. These are the first grants from the Advanced Sequencing Technology Program .

2004 — The International Human Genome Sequencing Consortium, led in the United States by NHGRI and the Department of Energy, publishes its scientific description of the finished human genome sequence. The analysis, published in Nature, reduces the estimated number of human protein-coding genes from 35,000 to only 20,000-25,000, a surprisingly low number for our species.

2004 — The ENCODE Consortium publishes a paper in Science that sets forth the scientific rationale and strategy behind its quest to produce a comprehensive catalog of all parts of the human genome crucial to biological function.

2005 — NIH hails the first comprehensive analysis of the sequence of the human X chromosome. The work, some of which was carried out as part of the Human Genome Project, is published in Nature. It provides sweeping new insights into the evolution of sex chromosomes and the biological differences between males and females.

2005 — The International HapMap Consortium publishes a comprehensive catalog of human genetic variation. This landmark achievement published in Nature , will serve to accelerate the search for genes involved in common diseases, such as asthma, diabetes, cancer, and heart disease.

2005 — NHGRI and the National Cancer Institute (NCI) launch The Cancer Genome Atlas (TCGA), a comprehensive effort to accelerate understanding of the molecular basis of cancer through the application of genome analysis technologies .

2006 — The Genetic Association Information Network (GAIN), a public-private partnership led by NHGRI, is established to help find the genetic causes of common diseases by conducting large-scale genomic studies and making their results broadly available to researchers worldwide.

2006 — NIH launches the Genes, Environment and Health Initiative (GEI) to understand the interactions of genetics and environment in common conditions and disease. It is managed by NHGRI and the National Institute of Environmental Health Sciences.

2007 — The Electronic Medical Records and Genomics (eMERGE) Network is announced in September 2007 . Researchers use DNA biorepositories and electronic medical records in large-scale studies to better understand the underlying genomics of disease .

2007 — In a White House Ceremony, NHGRI Director Francis S. Collins is awarded the Presidential Medal of Freedom by President George W. Bush for his leadership of and contributions to the Human Genome Project.

2007 — To better understand the role that bacteria, fungi, and other microbes play in human health, NIH launches the Human Microbiome Project. The human microbiome is all microorganisms present in or on the human body. NHGRI, the National Institute of Allergy and Infectious Diseases, and the National Institute of Dental and Craniofacial Research lead the project on behalf of NIH.

2008 — The NIH Genome-Wide Association Studies (GWAS) data sharing policy goes into effect to promote access to genomics research data while ensuring research participant protections.

2008 — An international research consortium announces the establishment of the 1000 Genomes Project. This effort will involve sequencing the genomes of at least 1000 people from around the world to create the most detailed and medically useful picture to date of human genetic variation. NHGRI is a major funder of the 1000 Genomes Project .

2008 — NHGRI and the National Institute of Environmental Health Sciences collaborate with the U.S. Environmental Protection Agency to begin testing the safety of chemicals, ranging from pesticides to household cleaners . The initiative uses the NIH Chemical Genomics Center's high-speed, automated screening robots to test suspected toxic compounds using cells and isolated molecular targets instead of laboratory animals.

2008 — President George W. Bush signs into law the Genetic Information Nondiscrimination Act (GINA) that will protect Americans against discrimination based on their genetic information when it comes to health insurance and employment. The bill passed the Senate unanimously and the House by a vote of 414 to 1.

2008 — Francis S. Collins steps down as NHGRI director. Alan E. Guttmacher is named acting director of NHGRI.

2008 — The TCGA Research Network reports the first results of its large-scale, comprehensive study of the most common form of brain cancer, glioblastoma. In a paper published in Nature , the TCGA team describes the discovery of new genetic mutations and other types of DNA alterations with potential implications for the diagnosis and treatment of glioblastoma.

2008 — The NIH Human Microbiome Project, collaborating with scientists around the globe, announces they will form the International Human Microbiome Consortium, an effort that will enable researchers to characterize the relationship of the human microbiome in the maintenance of health and in disease.

2008 — An international consortium including NHGRI researchers, in search of the genetic risk factors for obesity, identifies six new genetic variants associated with BMI, or body mass index, a measurement that compares height to weight. The results, funded in part by NIH, are published online in the journal Nature Genetics .

2009 — A team led by NHGRI scientists identifies a gene that suppresses tumor growth in melanoma, the deadliest form of skin cancer. The finding is reported in the journal Nature Genetics as part of a systematic genetic analysis of a group of enzymes implicated in skin cancer and many other types of cancer.

2009 — NHGRI announces the release of the first version of PhenX, a free online toolkit aimed at standardizing measurements of research subjects' physical characteristics and environmental exposures. The tools give researchers more power to compare data from multiple studies, accelerating efforts to understand the complex genetic and environmental factors that cause cancer, heart disease, depression and other common diseases.

2009 — The U.S. Department of Agriculture and NIH announce that an international consortium of researchers has completed an analysis of the genome of domestic cattle, the first livestock mammal to have its genetic blueprint sequenced and analyzed. The landmark research, which received major support from NHGRI, bolsters efforts to produce better beef and dairy products and will lead to a better understanding of the human genome.

2009 — NIH launches the first integrated drug development pipeline to produce new treatments for rare and neglected diseases. The $24 million program, whose laboratory operations are managed by NHGRI at the NIH Chemical Genomics Center, jumpstarts a trans-NIH initiative called the Therapeutics for Rare and Neglected Diseases program.

2009 — NHGRI researchers studying the skin's microbiome publish an analysis in Science revealing that our skin is home to a much wider array of bacteria than previously thought. The study, done in collaboration with other NIH researchers, also shows the bacteria that live under your arms are likely to be more similar to those under another person's arm than they are to the bacteria that live on your forearm.

2009 — An NIH research team led by NHGRI researchers finds that a single evolutionary event appears to explain the short, curved legs that characterize all of today's dachshunds, corgis, basset hounds and at least 16 other breeds of dogs. The unexpected discovery provides new clues about how physical differences may arise within species and suggests new approaches to understanding a form of human dwarfism. The results are reported in Science .

2009 — NIH researchers report in the online issue of PLoS Genetics the discovery of five genetic variants related to blood pressure in African Americans, findings that may provide new clues to treating and preventing hypertension. This effort, which includes NHGRI researchers, marks the first time that a relatively new research approach, called a genome-wide association study, has focused on blood pressure and hypertension in an African-American population.

2009 — Researchers, supported in part by NHGRI, generate massive amounts of DNA sequencing data of the complete set of exons, or “exomes,” from the genomes of 12 people. The findings, which demonstrate the feasibility of this strategy to find rare genetic variants that may cause or contribute to disease, are published online in Nature.

2009 — NHGRI researchers lead a study that identifies a new group of genetic mutations involved in melanoma, the deadliest form of skin cancer. This discovery, published in Nature Genetics , is particularly encouraging because some of the mutations, which were found in nearly one-fifth of melanoma cases, reside in a gene already targeted by a drug approved for certain types of breast cancer.

2009 — NHGRI launches the next generation of its online Talking Glossary of Genetic Terms. The glossary contains several new features, including more than 100 colorful illustrations and more than two dozen 3-D animations that allow the user to dive in and see genetic concepts in action at the cellular level.

2009 — An NHGRI-led research team finds that carriers of a rare, genetic condition called Gaucher disease face a risk of developing Parkinson's disease more than five times greater than the general public. The findings are published in the New England Journal of Medicine .

2009 — NIH director Francis S. Collins, M.D., Ph.D., announces the appointment of Eric D. Green, M.D., Ph.D., to be director of NHGRI. It is the first time an institute director has risen to lead the entire NIH and subsequently picked his own successor.

2010 — NHGRI launches the Genetics/Genomics Competency Center (G2C2) , an online tool to help educators teach the next generation of health professionals about genetics and genomics.

2010 — An international research team, including researchers from NHGRI, produce the first whole genome sequence of the 3 billion letters in the Neanderthal genome.

2010 — NIH and the Wellcome Trust, a global charity based in London, announce a partnership called the Human Heredity and Health in Africa project (H3Africa) to support population-based genetic studies in Africa by Africa. NHGRI helps administer H3Africa .

2010 — Daniel L. Kastner, M.D., Ph.D., is appointed scientific director of the NHGRI.

2011 — NHGRI's new strategic plan, Charting a course for genomic medicine, from base pairs to bedside , for the future of human genome research is published in the February 10, 2011, issue of Nature .

2011 — A research team from the NIH Undiagnosed Diseases Program, which is co-led by NHGRI, reports in the New England Journal of Medicine the first genetic finding of a rare, adult-onset vascular disorder associated with progressive and painful arterial calcification.

2011 — The Partnership for Public Service selects NHGRI Clinical Director William A. Gahl, M.D., Ph.D., to receive its Science and Environmental Medal (one of nine annual Service to America Awards, or Sammies).

2011 — P. Paul Liu, M.D., Ph.D., a world expert in the onset, development and progression of leukemia, is named NHGRI's deputy scientific director.

2011 — Mark S. Guyer, Ph.D., is named NHGRI deputy director.

2011 — NHGRI announces funding for its five Clinical Sequencing Exploratory Research projects aimed at studying ways that healthcare professionals can use genome sequencing information in the clinic.

2012 — For the first time, researchers in the NIH Human Microbiome Project (HMP) Consortium – including NHGRI investigators — map the normal microbial make-up of healthy humans. They report their findings in a series of coordinated papers in Nature and other journals.

2012 — ENCODE researchers produce a more dynamic picture of the human genome that gives the first holistic view of how the human genome actually does its job. The findings are reported in two papers appearing in Nature .

2012 — NHGRI reorganizes the institute's Extramural Research Program into four new divisions and promotes to division status the office overseeing policy, communications, and education, and the office overseeing administration and management. The divisions and their inaugural directors include: Division of Genome Sciences, Jeffery Schloss, Ph.D.; Division of Genomic Medicine, Teri Manolio, M.D., Ph.D.; Division of Extramural Operations, Bettie Graham, Ph.D.; Division of Genomics and Society, (acting director) Mark Guyer, Ph.D.; Division of policy, communications, and education, Laura Lyman Rodriguez, Ph.D.; and Division of Management, Janis Mullaney, M.B.A.

2012 — NHGRI Director, Dr. Eric Green, creates the The History of Genomics Program within the Office of the Director.

2013 — A special symposium, The Genomics Landscape: A Decade After the Human Genome Project, marks the 10th anniversary of the completion of the Human Genome Project.

2013 — The Smithsonian Institution in Washington, D.C. opens a high-tech, high-intensity exhibition Genome: Unlocking Life's Code to celebrate the 10th anniversary of researchers producing the first complete human genome sequence. The exhibition is a collaboration between the Smithsonian Institution’s National Museum of Natural History and NHGRI. The exhibition will travel across North America following its time at the Smithsonian.

2013 — NHGRI and the Eunice Kennedy Shriver National Institute of Child Health and Human Development announce awards for pilot projects to explore the use of genomic sequencing in newborn healthcare.

2013 — NHGRI selects Lawrence C. Brody, Ph.D., to be the first director of the Division of Genomics and Society, established through the October 2012 reorganization.

2014 — NHGRI Scientific Director Daniel Kastner, M.D., Ph.D., implements a reorganization of NHGRI's 45 intramural investigators and associated research programs into nine branches.

2014 — NHGRI Deputy Director Mark Guyer, who played a critical role in the Human Genome Project and countless other genomics programs, retires from federal service.

2014 — NIH issues the NIH Genomic Data Sharing policy to promote data sharing as a way to speed the translation of data into knowledge, products and procedures that improve health while protecting the privacy of research participants. The final policy will be effective for all NIH-supported research beginning in January 2015.

2014 — Scientists looking across human, fly, and worm genomes find that these species have shared biology. The findings, appearing in the journal Nature , offer insights into embryonic development, gene regulation and other biological processes vital to understanding human biology and disease.

2014 — An international team including researchers from NIH completes the first comprehensive characterization of genomic diversity across sub-Saharan Africa. The study provides clues to medical conditions in people of sub-Saharan African ancestry, and indicates that the migration from Africa in the early days of the human race was followed by a migration back into the continent.

2014 — Investigators with The Cancer Genome Atlas (TCGA) Research Network identify new potential therapeutic targets for a major form of bladder cancer.

2014 — Ellen Rolfes, M.A., is appointed the NHGRI executive officer and director of the NHGRI Division of Management.

2015 — NHGRI celebrates the 25th anniversary of the launch of the Human Genome Project (HGP). To commemorate this anniversary, NHGRI’s History of Genomics Program hosts a seminar series titled, “A Quarter Century after the Human Genome Project: Lessons Beyond Base Pairs,” featuring HGP participants sharing their perspectives about the project and its impact on their careers.

2015 — The Undiagnosed Diseases Network (UDN) opens an online patient application, the UDN Gateway, to streamline the patient application process across its individual clinical sites.

2015 — An international team of scientists from the 1000 Genomes Project Consortium creates the world’s largest catalog of genomic differences among humans, providing researchers with powerful clues to help them establish why some people are susceptible to various diseases.

2015 — NHGRI awards grants of more than $28 million aimed at deciphering the language of how and when genes are turned on and off. The awards emanate from NHGRI’s Genomics of Gene Regulation (GGR) program.

2015 — Shawn Burgess, Ph.D., and colleagues develop transgenic zebrafish as a live animal model of metastasis, offering cancer researchers a new, potentially more accurate way to screen for drugs and to identify new targets against disease.

2015 — Experts from academic and non-profit institutions across the United States join NHGRI and NIH staff at a roundtable meeting to discuss opportunities and challenges associated with the inclusion and engagement of underrepresented populations in genomics research.

2015 — Research funded by NHGRI’s Centers for Excellence in Genome Sciences and published in Nature Genetics provides new insights into the effects and roles of genetic variation and parental influence on gene activity in mice and humans.

2015 — NIH researchers discover the genomic switches of a blood cell are key to regulating the human immune system. The findings, published in Nature , open the door to new research and development in drugs and personalized medicine to help those with autoimmune disorders.

2016 — NHGRI launches the Centers for Common Disease Genomics, which will use genome sequencing to explore the genomic contributions to common diseases such as heart disease, diabetes, stroke and autism.

2016 — NHGRI awards approximately $11.1 million to support research aimed at identifying differences - called genetic variants - in the less-studied regions of the genome that are responsible for regulating gene activity.

2016 — NHGRI funds researchers at its Centers of Excellence in Ethical, Legal and Social Implications Research program to examine the use of genomic information in the prevention and treatment of infectious diseases; genomic information privacy; communication about prenatal and newborn genomic testing results; and the impact of genomics in American Indian and Alaskan Native communities.

2016 — NIH scientists identify a genetic mutation responsible for a rare form of inherited hives induced by vibration, also known as vibratory urticarial.

2016 — NHGRI Senior Investigator Dr. Francis Collins and an international team of more than 300 scientists conduct a comprehensive investigation of the underlying genetic architecture of type 2 diabetes. Their findings suggest that most of the genetic risk for type 2 diabetes can be attributed to common shared genomic variants.

2016 — The Policy and Program Analysis Branch held a public workshop, “Investigational Device Exemptions and Genomics,” to help investigators and institutional review board members learn more about Food and Drug Administration regulations and their application to genomics research.

2017 — NHGRI celebrates 20 years as an NIH Institute. The milestone highlights the transition from the center known as the National Center for Human Genome Research, to our current status as a full-fledged NIH institute. Those 20 years encompassed a host of research accomplishments, from the completion of The Human Genome Project, to DNA sequencing technology development, to bringing genomic medicine to the clinic.

2017 — NHGRI releases a collection of oral history videos featuring candid conversations with pioneering genomics researchers and an interactive discussion with the institute's three directors to date. NHGRI plans to release approximately 25 videos over the next year and additional videos in the future.

2017 — Laura Koehly, Ph.D., is named chief of NHGRI's Social and Behavioral Research Branch (SBRB) , which conducts research that will potentially transform healthcare through the integration of genomic medicine into the clinic.

2018 — NHGRI launches a new round of strategic planning that will establish a 2020 vision for genomics research aimed at accelerating scientific and medical breakthroughs.

2018 — NIH and INOVA Health System launch The Genomic Ascertainment Cohort (TGAC) , a two-year pilot project that will allow them to recall genotyped people and examine the genes and gene variants' influence on their phenotypes, an individual's observable traits, such as height, eye color or blood type.

2018 — Rep. Louise M. Slaughter (D-N.Y.), lead author of the Genetic Information Nondiscrimination Act of 2008 (GINA), passes away at the age of 88 .

2018 — The Cancer Genome Atlas publishes the PanCancer Atlas , a detailed genomic analysis on a data set of molecular and clinical information from over 10,000 tumors representing 33 types of cancer.

2019 — NHGRI researchers discover a new autoinflammatory disease called CRIA syndrome .

2019 — NHGRI appoints Dr. Benjamin Solomon as clinical director.

2020 — NHGRI appoints Chris Gunter, Ph.D. , as a senior advisor to the director for genomics engagement.

2020 — NHGRI establishes new intramural precision health research program .

2020 — NHGRI commemorates 20th anniversary of White House event announcing draft human genome sequence.

2020 — NIH announces the provision of $75 million in funding over five years for the Electronic Medical Records and Genomics (eMERGE) Genomic Risk Assessment and Management Network.

2020 — NHGRI researchers reframe dog-to-human aging comparisons .

2020 — NHGRI researchers generate the complete human X chromosome sequence .

2020 — Scientists use genomics to discover ancient dog species that may teach us about human vocalization .

2020 — NHGRI celebrates the 30th Anniversary of the commencement of The Human Genome Project

2020 — NHGRI researchers work with patients, families and the scientific community to improve the informed consent process .

2021 — NHGRI proposes an action agenda for building a diverse genomics workforce .

2021 — Dr. Neil Hanchard joins NHGRI as a clinical investigator.

2021 — NHGRI appoints Oleg Shchelochkov as intramural training program director .

2021 — NIH researchers develop guidelines for reporting polygenic risk scores .

2021 — NIH scientists develop breath test for methylmalonic acidemia .

2021 — NHGRI director appoints Vence Bonham as acting deputy director .

2021 — NIH expands existing gene expression resources to include developmental tissues .

2021 — Charles Rotimi selected as next scientific director .

2021 — NHGRI creates Office of Training, Diversity and Health Equity .

2021 — NHGRI researchers narrow down the number of genomic variants that are strongly associated with blood lipid levels and generated a polygenic risk score to predict elevated low-density lipoprotein cholesterol levels, a major risk factor for heart disease.

2021 — NHGRI selects Valentina Di Francesco as chief data science strategist.

2021 — NHGRI creates the Office of Genomic Data Science .

2021 — NIH researchers find thousands of new microorganisms living on human skin.

2022 — NIH-funded small businesses contributed to the completion of the human genome sequence .

2022 — Researchers generate the first complete, gapless sequence of a human genome .

2022 — NHGRI History of Genomics Program celebrates it's 10th anniversary .

2022 — NHGRI selects Charles P. Venditti as new chief of the Metabolic Medicine Branch .

2023 — NHGRI hosts a roundtable on potential concerns of social and behavioral genomics .

Biographical Sketch of NHGRI Director, Eric D. Green, M.D., Ph.D.

Eric D. Green, M.D., Ph.D.

Eric D. Green, M.D., Ph.D., is the director of the National Human Genome Research Institute (NHGRI) at the National Institutes of Health (NIH), a position he has held since late 2009. Previously, he served as the NHGRI scientific director (2002-2009), chief of the NHGRI Genome Technology Branch (1996-2009), and director of the NIH Intramural Sequencing Center (1997-2009).

Dr. Green received his B.S. degree in bacteriology from the University of Wisconsin-Madison in 1981, and his M.D. and Ph.D. from Washington University, St. Louis, in 1987. During residency training in clinical pathology (laboratory medicine), he worked in the laboratory of Dr. Maynard Olson. In 1992, he was appointed assistant professor of pathology and genetics and co-investigator in the Human Genome Center at Washington University. In 1994, he joined the newly established Intramural Research Program of the National Center for Human Genome Research, later renamed the National Human Genome Research Institute.

Honors given to Dr. Green include a Helen Hay Whitney Postdoctoral Research Fellowship (1989-1990), a Lucille P. Markey Scholar Award in Biomedical Science (1990-1994), induction into the American Society for Clinical Investigation (2002), an Alumni Achievement Award from Washington University School of Medicine (2005), induction into the Association of American Physicians (2007), a Distinguished Alumni Award from Washington University (2010), the Cotlove Lectureship Award from the Academy of Clinical Laboratory Physicians and Scientists (2011), a Ladue Horton Watkins High School Distinguished Alumni Award (2012), and the Wallace H. Coulter Lectureship Award from the American Association for Clinical Chemistry (2012). He is a founding editor of the journal Genome Research (1995-present) and a series editor for Genome Analysis: A Laboratory Manual (1994-1998), both published by Cold Spring Harbor Laboratory Press. He is also co-editor of the Annual Review of Genomics and Human Genetics (since 2005). Dr. Green has authored or co-authored over 340 scientific publications.

While directing an independent research program for almost two decades, Dr. Green was at the forefront of efforts to map, sequence, and understand eukaryotic genomes. (A eukaryote is any organism whose cells contain a nucleus and other organelles enclosed within membranes.) His work included significant involvement in the Human Genome Project. These efforts eventually blossomed into a highly productive program in comparative genomics that provided important insights about genome structure, function and evolution. His laboratory also identified and characterized several human disease genes, including those implicated in certain forms of hereditary deafness, vascular disease and inherited peripheral neuropathy.

As NHGRI director, Dr. Green leads the Institute's research programs and other initiatives. Under his guidance, the Institute has completed two major cycles of strategic planning to ensure that its research investments in genomics effectively advance human health. The first effort yielded the highly cited 2011 NHGRI strategic vision, “ Charting a course for genomic medicine from base pairs to bedside ” ( Nature 470:204-213, 2011); the second yielded the 2020 paper ” Strategic vision for improving human health at The Forefront of Genomics ” ( Nature 586:683-692, 2020).

These two strategic planning processes have guided a major expansion of NHGRI’s research portfolio, highlights of which include the design and launch of major new programs to unravel the functional complexities of the human genome, to catalyze the growth of genomic data science, to accelerate the application of genomics to medical care and to enhance the building of a robust and diverse genomics workforce of the future.

Dr. Green has also played an instrumental leadership role in developing many high-profile efforts relevant to genomics. These efforts include multiple NIH Common Fund Programs — such as the Undiagnosed Diseases Network, Human Heredity and Health in Africa (H3Africa), and the Human Microbiome Project — the Smithsonian-NHGRI exhibition Genome: Unlocking Life's Code , several trans-NIH data science initiatives, the NIH Genomic Data Sharing Policy and the NIH All of Us Research Program.

Beyond NHGRI-specific programs, Dr. Green has also played an instrumental leadership role in the development of a number of high-profile efforts relevant to genomics, including the Smithsonian-NHGRI exhibition Genome: Unlocking Life's Code , the NIH Big Data to Knowledge (BD2K) program, the NIH Genomic Data Sharing Policy, and the U.S. Precision Medicine Initiative.

NHGRI Directors

Office of the Director

The Office of the Director oversees general operations, administration and communications for the National Human Genome Research Institute (NHGRI). It provides overall leadership; sets policies; develops scientific, fiscal and management strategies; assists in governing the ethical behavior of its employees, and coordinates genomic research for the National Institutes of Health with other federal, private and international programs.

There are three offices housed within the Office of the Director. The Office of Communications (OC), which leads corporate communications about the research and programs supported by the National Human Genome Research Institute (NHGRI), the Office of Genomic Data Science (OGDS), which provides leadership, strategic guidance and coordination for NHGRI activities, programs and policies in genomic data science, and the Training, Diversity and Health Equity Office (TiDHE), which develops and supports initiatives that expand opportunities for genomics education and careers; cultivates genomics training programs and workforce development initiatives for individuals underrepresented in biomedical research; and promotes genomics research to improve minority health, reduce health disparities and foster health equity.

Extramural Research Program

NHGRI's Extramural Research Program (ERP) helps provide intellectual vision to the field of genomics. It also manages the meetings of NHGRI's National Advisory Council for Human Genome Research. In consultation with the broader genomics community, the ERP supports grants for research and training and career development at sites across the country.

The ERP is composed of four divisions:

  • The Division of Genome Sciences oversees basic genomic research and technology development, as well as major activities such as large-scale genome sequencing. It plans, directs, and facilitates multi-disciplinary research to understand the structure and function of genomes in health and disease. The division develops and funds research projects, and supports research training grants, research center grants, and contracts.
  • The Division of Genomic Medicine leads the institute's efforts to move genomic technologies and approaches into clinical applications and care. It develops and supports research to identify and advance approaches for the use of genomic data to improve diagnosis, treatment, and prevention of disease through grants, training, and contracts.
  • The Division of Genomics and Society carries out research related to the many societal issues relevant to genomics research, and includes the institute's Ethical, Legal and Social Implications (ELSI) program.
  • The Division of Extramural Operations manages ERP’s operational aspects, including conducting the review of grant applications and grants management.

Division of Intramural Research

The National Human Genome Research Institute's (NHGRI) Division of Intramural Research (DIR) plans and conducts laboratory and clinical research to enable greater understanding of human disease and develop better methods for detection, prevention and treatment of heritable and genetic disorders.

The DIR is one of the premier research programs working to unravel the genetic basis of human disease. In its short existence, the division has made many seminal contributions to the fields of genetics and genomics.

Highlights of NHGRI investigators' accomplishments in recent years include the identification of the genes responsible for numerous human genetic diseases; development of new paradigms for mapping, sequencing, and interpreting the human and other vertebrate genomes; Development and application of DNA microarray technologies for large-scale analyses of gene expression; creation of innovative computational tools for analyzing large quantities of genomic data; generation of animal models critical to the study of human inherited disorders; and design of novel approaches for diagnosing and treating genetic disease.

NHGRI investigators, along with their collaborators at other NIH Institutes and various research institutions worldwide, have embarked on a number of high-risk efforts to unearth clues about the complex genetic pathways involved in human diseases. These efforts have used genomic sequence data from humans and other species to pinpoint hundreds of potential disease genes, including those implicated in cancer, diabetes, premature aging, hereditary deafness, various neurological, developmental, metabolic, and immunological disorders, and others. These studies have brought together NHGRI basic scientists and clinicians in collaborations aimed at developing better approaches for detecting, diagnosing, and managing these often-debilitating genetic disorders.

Division of Management

The Division of Managementplans and directs administrative management functions at the National Human Genome Research Institute, including administrative management, management analysis and evaluation, financial management, information technology, ethics and human resources. It advises senior leadership on developments in administrative management and their implications and effects on program management, and coordinates administrative management activities in support of their programs.

This page last reviewed on December 19, 2023

Connect with Us

  • More Social Media from NIH
  • Share full article

Advertisement

Supported by

Scientists Finish the Human Genome at Last

The complete genome uncovered more than 100 new genes that are probably functional, and many new variants that may be linked to diseases.

genome research

By Carl Zimmer

Two decades after the draft sequence of the human genome was unveiled to great fanfare, a team of 99 scientists has finally deciphered the entire thing. They have filled in vast gaps and corrected a long list of errors in previous versions, giving us a new view of our DNA.

The consortium has posted six papers online in recent weeks in which they describe the full genome. These hard-sought data, now under review by scientific journals, will give scientists a deeper understanding of how DNA influences risks of disease, the scientists say, and how cells keep it in neatly organized chromosomes instead of molecular tangles.

For example, the researchers have uncovered more than 100 new genes that may be functional, and have identified millions of genetic variations between people. Some of those differences probably play a role in diseases.

For Nicolas Altemose, a postdoctoral researcher at the University of California, Berkeley, who worked on the team, the view of the complete human genome feels something like the close-up pictures of Pluto from the New Horizons space probe.

“You could see every crater, you could see every color, from something that we only had the blurriest understanding of before,” he said. “This has just been an absolute dream come true.”

Experts who were not involved in the project said it will enable scientists to explore the human genome in much greater detail. Large chunks of the genome that had been simply blank are now deciphered so clearly that scientists can start studying them in earnest.

“The fruit of this sequencing effort is amazing,” said Yukiko Yamashita, a developmental biologist at the Whitehead Institute for Biomedical Research at the Massachusetts Institute of Technology.

While scientists have known for decades that genes were spread across 23 pairs of chromosomes, these strange, wormlike microscopic structures remained largely a mystery.

By the late 1970s, scientists had gained the ability to pinpoint a few individual human genes and decode their sequence. But their tools were so crude that hunting down a single gene could take up an entire career.

Toward the end of the 20th Century, an international network of geneticists decided to try to sequence all the DNA in our chromosomes. The Human Genome Project was an audacious undertaking, given how much there was to sequence. Scientists knew that the twin strands of DNA in our cells contained roughly three billion pairs of letters — a text long enough to fill hundreds of books.

When that team began its work, the best technology the scientists could use sequenced bits of DNA just a few dozen letters, or bases, long. Researchers were left to put them together like the pieces of a vast jigsaw puzzle. To assemble the puzzle, they looked for fragments with identical ends, meaning that they came from overlapping portions of the genome. It took years for them to gradually assemble the sequenced fragments into larger swaths.

The White House announced in 2000 that scientists had finished the first draft of the human genome, and details of the project were published the following year. But long stretches of the genome remained unknown, while scientists struggled to figure out where millions of other bases belonged.

It turned out that the genome was a very hard puzzle to put together from small pieces. Many of our genes exist as multiple copies that are nearly identical to each other. Sometimes the different copies carry out different jobs. Other copies — known as pseudogenes — are disabled by mutations. A short fragment of DNA from one gene might fit just as well into the others.

And genes only make up a small percentage of the genome. The rest of it can be even more baffling . Much of the genome is made up of virus-like stretches of DNA that exist largely just to make new copies of themselves that get inserted back into the genome.

In the early 2000s, scientists got a little better at putting together the genome puzzle from its tiny pieces. They made more fragments, read them more accurately, and developed new computer programs to assemble them into bigger chunks of the genome.

Periodically, researchers would unveil the latest, best draft of the human genome — known as the reference genome. Scientists used the reference genome as a guide for their own sequencing efforts. For example, clinical geneticists would catalog disease-causing mutations by comparing genes from patients to the reference genome.

The newest reference genome came out in 2013. It was a lot better than the first draft, but it was a long way from complete. Eight percent of it was simply blank.

“There’s basically an entire human chromosome that had gone missing,” said Michael Schatz, a computational biologist at Johns Hopkins University.

In 2019, two scientists — Adam Phillippy, a computational biologist at the National Human Genome Research Institute, and Karen Miga, a geneticist at the University of California, Santa Cruz — founded the Telomere-to-Telomere Consortium to complete the genome.

Dr. Phillippy admitted that part of his motivation for such an audacious project was that the missing gaps annoyed him. “They were just really bugging me,” he said. “You take a beautiful landscape puzzle, pull out a hundred pieces, and look at it — that’s very bothersome to a perfectionist.”

Dr. Phillippy and Dr. Miga put out a call for scientists to join them to finish the puzzle. They ended up with 99 scientists working directly on sequencing the human genome, and dozens more pitching in to make sense of the data. The researchers worked remotely through the pandemic, coordinating their efforts over Slack, a messaging app.

“It was a surprisingly nice ant colony,” Dr. Miga said.

The consortium took advantage of new machines that can read stretches of DNA reaching tens of thousands of bases long. The researchers also invented techniques to figure out where particularly mysterious repeating sequences belonged in a genome.

All told, the scientists added or fixed more than 200 million base pairs in the reference genome. They can now say with confidence that the human genome measures 3.05 billion base pairs long.

Within those new sequences of DNA, the scientists discovered more than 2,000 new genes. Most appear to be disabled by mutations, but 115 of them look as if they can produce proteins — the function of which scientists may need years to figure out. The consortium now estimates that the human genome contains 19,969 protein-coding genes.

With a complete genome finally assembled, the researchers could take a better look at the variation in DNA from one person to the next. They discovered more than two million new spots in the genome where people differ. Using the new genome also helped them to avoid identifying disease-linked mutations where none actually exist.

“It’s a great advance for the field,” said Dr. Midhat Farooqui, the director of molecular oncology at Children’s Mercy, a hospital in Kansas City, Mo., who was not involved in the project.

Dr. Farooqi has started using the genome for his research into rare childhood diseases, aligning DNA from his patients against the newly filled gaps to search for mutations.

Switching to the new genome may be a challenge for many clinical labs, however. They’ll have to shift all of their information about the links between genes and diseases to a new map of the genome. “There will be a big effort, but it will take a couple years,” said Dr. Sharon Plon, a medical geneticist at Baylor College of Medicine in Houston.

Dr. Altemose plans on using the complete genome to explore a particularly mysterious region in each chromosome known as the centromere. Instead of storing genes, centromeres anchor proteins that move chromosomes around a cell as it divides. The centromere region contains thousands of repeated segments of DNA.

In their first look, Dr. Altemose and his colleagues were struck by how different centromere regions can be from one person to another. That observation suggests that centromeres have been evolving rapidly, as mutations insert new pieces of repeating DNA into the regions or cut other pieces out.

While some of this repeating DNA may play a role in pulling chromosomes apart, the researchers have also found new segments — some of them millions of bases long — that don’t appear to be involved. “We don’t know what they’re doing,” Dr. Altemose said.

But now that the empty zones of the genome are filled in, Dr. Altemose and his colleagues can study them up close. “I’m really excited moving forward to see all the things we can discover,” he said.

An earlier version of this article misstated when scientists first arrived at the correct number of human chromosomes. It was in the 1960s, not a century ago.

How we handle corrections

Carl Zimmer writes the “Matter” column. He is the author of fourteen books, including “Life's Edge: The Search For What It Means To Be Alive.” More about Carl Zimmer

The Mysteries and Wonders of Our DNA

Women are much more likely than men to have an array of so-called autoimmune diseases, like lupus and multiple sclerosis. A new study offers an explanation rooted in the X chromosome .

DNA fragments from thousands of years ago are providing insights  into multiple sclerosis, diabetes, schizophrenia and other illnesses. Is this the future of medicine ?

A study of DNA from half a million volunteers found hundreds of mutations that could boost a young person’s fertility  and that were linked to bodily damage later in life.

In the first effort of its kind, researchers now have linked DNA from 27 African Americans buried in the cemetery to nearly 42,000 living relatives .

Environmental DNA research has aided conservation, but scientists say its ability to glean information about humans poses dangers .

That person who looks just like you is not your twin. But if scientists compared your genomes, they might find a lot in common .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • World J Methodol
  • v.11(5); 2021 Sep 20

Genomics in medicine: A new era in medicine

Vishwanath pattan.

Division of Endocrinology, Wyoming Medical Center, Casper, WY 82601, United States

Rahul Kashyap

Department of Anesthesiology and Peri-operative Medicine, Mayo Clinic, Rochester, MN 55905, United States

Vikas Bansal

Narsimha candula.

Hospital Medicine, University Florida Health, Jacksonville, FL 32209, United States

Thoyaja Koritala

Hospital Medicine, Mayo Clinic Health System, Mankato, MN 56001, United States

Salim Surani

Department of Internal Medicine, Texas A&M University, Corpus Christi, TX 78405, United States. moc.liamg@inarusrs

Corresponding author: Salim Surani, FACC, FACP, FCCP, MD, Professor, Department of Internal Medicine, Texas A&M University, 701 Ayers Street, Corpus Christi, TX 78405, United States. moc.liamg@inarusrs

The sequencing of complete human genome revolutionized the genomic medicine. However, the complex interplay of gene-environment-lifestyle and influence of non-coding genomic regions on human health remain largely unexplored. Genomic medicine has great potential for diagnoses or disease prediction, disease prevention and, targeted treatment. However, many of the promising tools of genomic medicine are still in their infancy and their application may be limited because of the limited knowledge we have that precludes its use in many clinical settings. In this review article, we have reviewed the evolution of genomic methodologies/tools, their limitations, and scope, for current and future clinical application.

Core Tip: The field of Genomics is the future of medicine, as evidenced by the unprecedented research and clinical application which pushed the time boundaries for the coronavirus disease 2019 mRNA vaccines. However the path to unleashing the potential from genomic tools is far from perfect. A thorough research with international collaboration and cooperation is a necessity and the need of the hour.

INTRODUCTION

Understanding the human genome has come a long way since the initial discovery of DNA structure by Watson and Crick in 1953[ 1 ]. The genome study and reference used to be a very specialized area, but lately with the advent of the messenger based RNA vaccine have brought the concept of genetics even to the lay public. In the 1970s, the ability to manipulate DNA with recombinant DNA technology increased the horizon. Our understanding of medical genetics began with inheritance patterns of single-gene diseases. The database of Mendelian Inheritance in Man (MIM) was initiated in the early 1960s by McKusick[ 2 ]. As of January 5, 2021, 4368 genes were mapped to phenotype-causing mutations[ 3 ]. However, only a small portion of diseases have a monogenic cause. The majority of the common diseases are polygenic, and elucidation of their mechanism has remained elusive.

The human genome project, which was completed in 2003, revolutionized the understanding of the human genome and served as a turning point to fast forward the genomic methodologies. However, the clinical application of findings from these genomic studies is still in its infancy. This is largely because we still have not understood or made complete sense of the available information. That is, the sequence data have been difficult to correlate to functional outcomes, making it difficult to understand the genetic basis of diseases and the complex gene-lifestyle-environment influences or their interaction. Moreover, most of the initial focus of the research had been on coding regions of DNA which comprises approximately 2% of the DNA and the knowledge about specific implications of non-coding DNA regions (98% of DNA) are largely unknown[ 4 , 5 ].

Remarkably, the human genome and the closest related species chimpanzees differ in single nucleotide alterations by a mere 1.23% and in deletions, insertions, and copy number variations by 3%[ 6 ]. In humans, the genomes of any two individuals are about 99.9% identical. However, a mere 0.1% variation allows for changes in a massive number of nucleotides because the human genome has approximately 30 billion base pairs (3.3 × 10 9 )[ 7 ].

In this review, we will discuss the evolution in genomic methodology, limitations, and their scope for current and future clinical application.

GENOMIC TOOLS AND THEIR EVOLUTION

Dna sequencing.

After the initial DNA sequencing method by Maxam and Gilbert[ 8 ] in 1977, the chain-termination DNA sequencing method developed by Sanger et al [ 9 ] in 1977 was used for the next few decades. It relied on the template DNA strand and had limited capacity for sequencing gene panels. Subsequently, with commercial production of high throughput technologies or next-generation sequencing (NGS) revolutionized the DNA sequencing by 2007[ 10 ]. Also called as massively parallel sequencing, NGS does parallel sequencing of millions of small DNA fragments. Each DNA fragment is fixed at a unique location on the solid support. While the sample of the patient's DNA which serves as a template in NGS is amplified and fragmented, the third-generation sequencing uses single DNA molecules rather than the amplified DNA as a template thus eliminating errors from DNA amplification processes. The NGS can be used for whole-genome sequencing, exome sequencing, or targeted gene panels comprising tens to hundreds of genes.

Single nucleotide polymorphism

Single nucleotide polymorphism (SNP) is the variation in genetic sequence by a single nucleotide. It is the most common type of genetic variation in man[ 11 ]. It was detected in the 1980s using restriction enzymes[ 12 ]. With application of the microarray technology to SNPs, the scope of SNP in clinical practice has widened, especially in oncology. The first SNP array analysis was done in 1998 and the first application of SNP array analysis in cancer was done in 2000[ 13 ]. SNP array analysis is used to determine loss of heterozygosity, allelic imbalance, genomic copy number changes, frequency of homozygous chromosome regions, uniparental disomy, DNA methylation alterations and linkage analysis of DNA polymorphisms in cancer cells[ 13 , 14 ].

DNA amplification

Kary Banks Mullis successfully demonstrated polymerase chain reaction (PCR) in 1983[ 15 ]. PCR is a cost-effective method that can amplify a single DNA exponentially[ 16 ]. It is a rapid, highly specific, and extremely sensitive method. PCR is being used in SNP genotyping, detection of rare sequences, insertion-deletion variants, and structural variants like copy-number variants.

Linkage and association analysis

Linkage studies have been used for mapping of genes for heritable traits to their chromosomal locations. 1 st genetic linkage map was done in 1911 by Sturtevant A[ 17 ]. Parametric linkage analysis is used to map the disease-causing gene for monogenic diseases. Here, the logarithm of the odds (LOD) scores and recombination fractions are used to map the gene location. Model-free linkage analysis or non-parametric linkage analysis is used for complex or polygenic diseases, or when the model of inheritance is not known[ 18 ]. Linkage analysis of the whole genome can identify large regions of the chromosome with evidence of disease containing the gene[ 19 , 20 ], but this large span of chromosomes can have hundreds of candidate genes.

Linkage studies have been used for mapping Mendelian traits with high penetrance in families and relatives[ 20 ]. They are especially useful to identify rare alleles that are present in a small number of families[ 21 ], for disease genes with weak effects and polygenic diseases, linkage disequilibrium association mapping has proved to be more useful. In genome-wide association studies (GWAS), genotyping of hundreds or thousands of SNPs is done in cases and control populations and their association with heritability is analyzed. A combination of linkage and association methodologies helps to identify and characterize the wider range of disease-susceptibility variants[ 22 ].

Fluorescence in Situ Hybridization (FISH) was developed in 1987. It is a cytogenetic technique which uses fluorescent DNA probes which are designed to label precise chromosomal locations. The advantage of FISH over conventional cytogenetic metaphase karyotype analysis is lack of cell culture requirement. It can rapidly evaluate interphase nuclei in the fresh or paraffin-embedded sample[ 23 ]. However, the resolution of this technique is only as good as that of karyotype bands. Cloned DNA FISH probes of about 100 kb, called bacterial artificial chromosomes, are now available. FISH is being utilized more in making clinical diagnosis among Oncology due to its simplicity and reliability to evaluate the key biomarkers in various malignancies.

Comparative genomic hybridization

Comparative genomic hybridization (CGH) was developed in 1992. CGH can detect DNA copy number changes across the entire genome of a patient sample in a single experiment. It compares the hybridization signal intensity of a test sample (for example tumor sample) against a reference sample along the chromosomes[ 13 ].

HAPMAP AND 1000 GENOME PROJECTS HAVE CREATED A CATALOG OF SNPS

The HapMap project was started in 2002 to develop a haplotype map of the human genome. It can also describe the common patterns of human genetic variation[ 24 ]. The 1000 Genomes Project comprised a total of 26 diverse population set in which whole-genome sequencing was performed. It also used deep exome sequencing and dense microarray genotyping to give a comprehensive description of common human genetic variation[ 25 ].

TARGETED GENOME EDITING OR GENOME ENGINEERING

It involves modification of the genome at a precise, prespecified locus using programmable nucleases. Examples of some of the programmable nucleases include zinc-finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs), and clustered regularly interspaced short palindromic repeat (CRISPR)-Cas (CRISPR-associated) system. These programmable nucleases are designed to impart site-specific double-strand breaks (dsBs) in chromosomal DNA. The cell is therefore forced to use one of the endogenous DNA repair mechanisms — homologous recombination or homology-directed repair (HDR) and nonhomologous end-joining (NHEJ). This enables targeted genetic modifications during the repair process in the living cells ( in vivo ) (Table ​ (Table1 1 )[ 26 ]. ZFNs and TALENS recognize the target sequence through protein-DNA interaction. CRISPR-Cas nucleases recognize target sequences through RNA and DNA base pairing[ 26 ].

Characteristics of commonly used genomic tools

SNP: Single nucleotide polymorphism; NGS: Next-generation sequencing; PCR: Polymerase chain reaction; RT-PCR: Real-time reverse transcription polymerase chain reaction; ssDNA: Single stranded DNA.

In the year 2013, Cong et al [ 27 ] and Mali et al [ 28 ] showed successful genome editing in mammalian cells using the CRISPR system. In the last 5 years, we have seen a leap in the research interest (both animal and human) in CRISPR genomic editing.

While genome editing holds promise to correct the defective genome in vivo, therapies can also be designed to alter the gene expression without altering the genomic code. For example, anti-sense oligonucleotide can be used to alter the splice points of pre-mRNA to correct for a defective gene or suppress its expression. Examples of drugs which use splice modulation and approved by Food and Drug Administration (FDA) are Eteplirsen (exon skipping, approved for Duchenne muscular dystrophy) and nusinersen (exon inclusion, approved for spinal muscular atrophy)[ 29 ].

Table ​ Table1 1 summarizes the commonly used genomic tools, their working principle, advantages/applications and limitations (see Table ​ Table1). 1 ). Table ​ Table2 2 summarizes the major genome/gene editing tools their working principle, advantages/applications and limitations. Table ​ Table3 3 summarizes gene-based therapies that are either FDA approved therapies or investigational therapies showing promise.

Characteristics of genome-editing technologies using programmable nucleases

HDR: Homology-directed repair; NHEJ: Nonhomologous end-joining; SARS-CoV-2: Severe acute respiratory syndrome coronavirus 2; TALENs: Transcription activator-like effector nucleases; dsBs: Double stranded breaks; ssDNA: Single stranded DNA; TAL: Transcription activator-like; SHERLOCK: Specific High Sensitivity Enzymatic Reporter UnLOCKing.

Gene based therapies: List of Food and Drug Administration approved therapies and investigational therapies showing promise

AAV: Adeno-associated virus; ALAS1: Aminolevulinate synthase 1; BCL11A: B cell lymphoma/leukemia 11A; HAO1: Hydroxyacid oxidase (glycolate oxidase) 1; modRNA: Nucleoside-modified messenger RNA; SMN: Survival motor neuron 1; FDA: Food and Drug Administration.

The newer genomic technology and tools have broadened the scope and pushed the time limits for development of new diagnostic kits, preventive strategies like vaccines, therapeutic strategies like gene modulation and gene therapy. A lot is yet to be studied in terms of the complex interaction of gene-environment-lifestyle-disease. Knowing the impact of genomics on disease pathophysiology and response to medications[ 30 ]. expands the scope of research and clinical application. While genome editing holds promise to correct the defective genome in vivo, therapies can also be designed to alter the gene expression without altering the genomic code (example exon skipping, or inclusion discussed above).

The newer genomic editing tools have showed great potential and promise but they need to be studied extensively before clinical application. Also, uniform international ethical guidelines and guiding principles need to be established so that these genomic technologies are not misused.

It is very important to include diverse populations and to represent minority population in the genomic studies, so that results could be generalized and more accurate diagnostic, predictive and therapeutic tools can be developed.

Genomics in medicine is indeed a new era in medicine. Even the control of coronavirus disease 2019 pandemic[ 31 ] has just begun at the time of writing of this article with gene based therapies eliciting immune response against severe acute respiratory syndrome coronavirus 2 spike proteins. A unified international collaboration[ 32 , 33 ] is needed to continue expanding gene therapy use in opening new frontiers for fight against novel infections and disease.

Genomic medicine holds great promise for providing insight into disease pathophysiology, provide better diagnostic or disease predictive tools, preventive therapies and finally for targeted treatment of diseases. Although some of the newer tools (like CRISPR system) have great potential, more research is needed before these tools can be unleashed to clinical use. Hence there is great need for studies to unravel the mystery of complex interaction of both coding and noncoding genomic regions with environment and lifestyle influences on disease occurrence and management.

Conflict-of-interest statement: None of the authors have any conflict of interest.

Manuscript source: Invited manuscript

Peer-review started: January 12, 2021

First decision: June 17, 2021

Article in press: July 19, 2021

Specialty type: Medical laboratory technology

Country/Territory of origin: United States

Peer-review report’s scientific quality classification

Grade A (Excellent): 0

Grade B (Very good): 0

Grade C (Good): C

Grade D (Fair): 0

Grade E (Poor): 0

P-Reviewer: Taheri S S-Editor: Gao CC L-Editor: A P-Editor: Guo X

Contributor Information

Vishwanath Pattan, Division of Endocrinology, Wyoming Medical Center, Casper, WY 82601, United States.

Rahul Kashyap, Department of Anesthesiology and Peri-operative Medicine, Mayo Clinic, Rochester, MN 55905, United States.

Vikas Bansal, Department of Anesthesiology and Peri-operative Medicine, Mayo Clinic, Rochester, MN 55905, United States.

Narsimha Candula, Hospital Medicine, University Florida Health, Jacksonville, FL 32209, United States.

Thoyaja Koritala, Hospital Medicine, Mayo Clinic Health System, Mankato, MN 56001, United States.

Salim Surani, Department of Internal Medicine, Texas A&M University, Corpus Christi, TX 78405, United States. moc.liamg@inarusrs .

Genome Biology

Jin_cotton

Featured article: Liquid–liquid phase separation of H3K27me3 reader BP1 regulates transcriptional repression

pear_met

Featured article: Multi-omics provide insights into the regulation of DNA methylation in pear fruit metabolism

Collection image

Call for papers: We are calling for submissions to our Collection on insights from genomic and related research across diverse ancestries, encompassing trans-ethnic, multi-ancestry, and cross-population investigations.

GB_logo

Open calls for papers and latest collections

  • Most accessed

A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline

Authors: Ze-Zhen Du, Jia-Bao He and Wen-Biao Jiao

scifi-ATAC-seq: massive-scale single-cell chromatin accessibility sequencing using combinatorial fluidic indexing

Authors: Xuan Zhang, Alexandre P. Marand, Haidong Yan and Robert J. Schmitz

SuperCellCyto: enabling efficient analysis of large scale cytometry datasets

Authors: Givanna H. Putri, George Howitt, Felix Marsh-Wakefield, Thomas M. Ashhurst and Belinda Phipson

PMF-GRN: a variational inference approach to single-cell gene regulatory network inference using probabilistic matrix factorization

Authors: Claudia Skok Gibbs, Omar Mahmood, Richard Bonneau and Kyunghyun Cho

Increased DNA methylation contributes to the early ripening of pear fruits during domestication and improvement

Authors: Bobo Song, Jinshan Yu, Xiaolong Li, Jiaming Li, Jing Fan, Hainan Liu, Weilin Wei, Lingchao Zhang, Kaidi Gu, Dongliang Liu, Kejiao Zhao and Jun Wu

Most recent articles RSS

View all articles

A survey of best practices for RNA-seq data analysis

Authors: Ana Conesa, Pedro Madrigal, Sonia Tarazona, David Gomez-Cabrero, Alejandra Cervera, Andrew McPherson, Michał Wojciech Szcześniak, Daniel J. Gaffney, Laura L. Elo, Xuegong Zhang and Ali Mortazavi

The Erratum to this article has been published in Genome Biology 2016 17 :181

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Authors: Michael I Love, Wolfgang Huber and Simon Anders

The Kardashian index: a measure of discrepant social media profile for scientists

Authors: Neil Hall

Differential expression analysis for sequence count data

Authors: Simon Anders and Wolfgang Huber

Most accessed articles RSS

Latest articles from Genome Medicine

Smoking-associated gene expression alterations in nasal epithelium reveal immune impairment linked to lung cancer risk.

08 April 2024

Burden of Mendelian disorders in a large Middle Eastern biobank

Nodal variants are associated with a continuum of laterality defects from simple d-transposition of the great arteries to heterotaxy.

03 April 2024

Focus: Fertility

Read the latest papers on fertility across BMC flagship journals.

Cisplatin and carboplatin result in similar gonadotoxicity in immature human testis with implications for fertility preservation in childhood cancer Melissa D. Tharmalingam   et al. Published in BMC Medicine 04 December 2020

Large-scale discovery of male reproductive tract-specific genes through analysis of RNA-seq datasets Matthew Robertson   et al.

Published in BMC Biology 19 August 2020

DNA methylation and body mass index from birth to adolescence: meta-analyses of epigenome-wide association studies Florianne Vehmeijer​​​​​​​  et al. Published in Genome Medicine 25 November 2020

The Tug1 lncRNA locus is essential for male fertility Jordan Lewandowski​​​​​​​  et al. Published in Genome Biology  07 September 2020

Reviews, Research highlights and commentaries

Read authoritative Reviews, thought-provoking Opinions and other content commissioned by the Genome Biology  Editors from leading researchers:

Reviews Research highlights Commentaries   (including Editorials, Comments, Opinions, Q&As and Meeting reports)

Latest article collections

Article Collection Towards an atlas of variant effects

Special Issue Interpretable Deep Learning

Special Issue Plant RNA Biology

Article Collection SEQC2 Articles

Special Issue Long Read Sequencing

Article Collection Climate Change Genomics

See all article collections here .

Latest Tweets

Your browser needs to have JavaScript enabled to view this timeline

From the blog

Exploring the history of smallpox vaccination with 19th Century American vaccination kits

Exploring the history of smallpox vaccination with 19th Century American vaccination kits

16 November 2020

Transposon Day Quiz 2020

Transposon Day Quiz 2020

16 June 2020

DNA Day quiz 2020

DNA Day quiz 2020

24 April 2020

Reviewer Acknowledgements 2020

The Editors and staff of Genome Biology would like to warmly thank the Reviewers whose comments helped to shape the journal, for their invaluable assistance with review of manuscripts in 2020.

  • Editorial Board
  • Editorial Team
  • Join our editorial board
  • Sign up for article alerts and news from this journal
  • Manuscript editing services

Annual Journal Metrics

2022 Citation Impact 12.3 - 2-year Impact Factor 17.4 - 5-year Impact Factor 3.476 - SNIP (Source Normalized Impact per Paper) 9.249 - SJR (SCImago Journal Rank)

2023 Speed 21 days submission to first editorial decision for all manuscripts (Median) 277 days submission to accept (Median)

2023 Usage  6,688,476 downloads 12,515 Altmetric mentions

  • More about our metrics

Peer Review Taxonomy

This journal is participating in a pilot of NISO/STM's Working Group on Peer Review Taxonomy, to identify and standardize definitions and terminology in peer review practices in order to make the peer review process for articles and journals more transparent. Further information on the pilot is available here .

The following summary describes the peer review process for this journal:

  • Identity transparency: Single anonymized
  • Reviewer interacts with: Editor
  • Review information published: Review reports. Reviewer Identities reviewer opt in. Author/reviewer communication

We welcome your feedback on this Peer Review Taxonomy Pilot. Please can you take the time to complete this short survey.

ISSN: 1474-760X

ScienceDaily

The genomic architecture of inherited DNA variants

You have your mother's eyes and your father's smile, but genetics is much more than just what's on the surface. In a study that spans more than a decade, researchers at Baylor College of Medicine have looked at generations of families in a specific population to reveal the role newly inherited DNA variants play on recessive disease traits, and in the process, they have created a population specific database revealing unique DNA information unseen in larger cohorts.

The findings, now published in Genetics in Medicine OPEN , revealed a correlation between occurrences of complex genetic disorders in those families with increased levels of consanguinity when compared to unaffected populations. Consanguinity is when both parents contribute similar genetic markers to an offspring, such as by sharing a common ancestor, and the genetic information from both the genome inherited from the father and that from the mother are identical.

"We observed that the areas on the chromosome known as ROH, regions of homozygosity, were longer in those individuals in which there was a higher degree of parental consanguinity when compared to those with less," said Dr. Zeynep Coban-Akdemir, postdoctoral associate in molecular and human genetics at Baylor and currently assistant professor at UTHealth School of Public Health as well as co-lead author on the study. "We can see what is happening when consanguinity is at play and also when new genetic variations are introduced into the family unit of the clan or tribe representing more distant ancestors."

Dr. Xiaofei Song, a former Baylor graduate student now working as an assistant professor at Moffitt Cancer Center, said, "We further applied a statistical method to systematically assess the impact of these genetic variations on disease. Our results indicate that the newly introduced genetic variations can better explain the clinical features observed in our patients." Song also is co-lead author on the study.

"The published study contributes to the field of both rare disease and population genomics. From a trainee perspective, the article provides a valuable resource for comprehending fundamental concepts of human genetics and applying diverse computational methods to elucidate these concepts," said Ph.D candidate Tugce Bozkurt-Yozgatli, with the Acibadem University in Istanbul, Turkey.

Coban-Akdemir, who worked in the Lupski Lab at Baylor where the research was conducted, says this is an important part of the findings because it reveals how genes act within different populations and clans to contribute to different recessive genetic disorders.

The population studied was a cohort of individuals originating from Turkey that is known to have different variations in genetic markers when compared to other populations from greater Europe. Researchers created and analyzed a database of variants derived from exome sequencing, a genomics assay providing a glimpse into genetic variation genomewide, of 773 unrelated volunteers who were affected with various suspected rare Mendelian disease traits, which are diseases caused by a mutation in a single gene and clearly passed down from one generation to the next in accordance with Gregor Mendel expectations. They were compared to another database created by the same researchers of 643 unaffected relatives.

Roughly half of the genetic variants in this Turkish group are not present in greater European control populations that are found in shared databases commonly used by genetic researchers.

"This group of Turkish individuals and families gives us insight into genetics that the average population doesn't provide. What we found in this Turkish population is very unique. Not only is this group underrepresented in larger databases, but it shows us that they have an enriched genetic variation that is only seen within this population when compared to European populations," Coban-Akdemir said.

Dr. Davut Pehlivan, assistant professor of pediatrics -- neurology at Baylor, said on a single individual there are around 40 million Watson-Crick base pair variations within our DNA.

"The Human Genome Project opened the doors for researchers to investigate entire genomic DNA complement using next-generation sequencing technology. However, more struggles appeared with these advancements. For example, it is hard to pinpoint which variant is causing disease among 40 million variations of our DNA. Studying healthy populations helps us to eliminate many of these common variations from consideration. Thus, we studied both patients and their healthy relatives in the Turkish population." Pehlivan said. "There are a lot of changes in the genome, and we don't fully understand the meaning of all of those details, but the data from this population study will help all investigators around the world who are trying to interpret the results of other variants in the human genome DNA."

Pehlivan described gathering the information and families wanting to participate in genomics research beginning in 2010, traveling long distances to rural areas where the patients were mostly located, a human interest story itself, to make sure the database and clinical information would show an accurate representation for these families.

"We discovered more than 200 genes that contributed to the existing body of disease gene associations. This will help us get closer to understanding, in this population and in others, what is causing these diseases and the human biological perturbation underlying a broad scope of diseases. Our studies will open new avenues of research in human biology and genome biology and eventually help to potentially bring nucleic acid treatments, something used to develop the COVID vaccine, to the patients and families" Pehlivan said.

"This team of researchers is not just helping the population that they studied, but their findings also can be applied to many populations. We all are very different individuals on this planet, yet our genes act very similarly, and we all share a common humanity. So, understanding how genetic disorders work helps us to support affected families across the globe," said Dr. James R. Lupski, the Cullen Foundation Endowed Chair in Genetics and Genomics at Baylor.

In the past, Coban-Akdemir and Dr. Claudia M.B Carvalho, previously with Baylor and currently in her own laboratory at the Pacific Northwest Research Institute (PNRI) in Seattle who also contributed to this study, have worked on studying variants of genes to identify causes of diseases through production of truncated or altered proteins that take on a new or different function. Their work also focused on databases of populations with and without genetic disease. Their current work reflects the importance of diversity and inclusion as work continues to reveal causes of genetic diseases.

This work was supported in part by the U.S. National Human Genome Research Institute /National Heart Lung and Blood Institute grant number UM1HG006542 to the Baylor Hopkins Center for Mendelian Genomics (BHCMG), the U.S. National Human Genome Research Institute U01HG011758 to the Baylor College of Medicine for the Genomics Research to Elucidate the Genetics of Rare Disease consortium (BCM-GREGoR), the National Institute of Neurological Disorders and Stroke Q22 (NINDS) R35NS105078, and the National Human Genome Research Institute U54-HG003273. J.E.P. was supported by NHGRI K08 HG008986.

  • Human Biology
  • Personalized Medicine
  • Diseases and Conditions
  • Gene Therapy
  • Medical Topics
  • Parkinson's Research
  • Workplace Health
  • Introduction to genetics
  • DNA microarray
  • Molecular biology

Story Source:

Materials provided by Baylor College of Medicine . Note: Content may be edited for style and length.

Journal Reference :

  • Zeynep Coban-Akdemir, Xiaofei Song, Francisco C. Ceballos, Davut Pehlivan, Ender Karaca, Yavuz Bayram, Tadahiro Mitani, Tomasz Gambin, Tugce Bozkurt-Yozgatli, Shalini N. Jhangiani, Donna M. Muzny, Richard A. Lewis, Pengfei Liu, Eric Boerwinkle, Ada Hamosh, Richard A. Gibbs, V. Reid Sutton, Nara Sobreira, Claudia M.B. Carvalho, Chad A. Shaw, Jennifer E. Posey, David Valle, James R. Lupski. The impact of the Turkish population variome on the genomic architecture of rare disease traits . Genetics in Medicine Open , 2024; 2: 101830 DOI: 10.1016/j.gimo.2024.101830

Cite This Page :

Explore More

  • 3D Mouth of an Ancient Jawless Fish
  • Connecting Lab-Grown Brain Cells
  • Device: Self-Healing Materials, Drug Delivery
  • How We Perceive Bitter Taste
  • Next-Generation Digital Displays
  • Feeling Insulted? How to Rid Yourself of Anger
  • Pregnancy Accelerates Biological Aging
  • Tiny Plastic Particles Are Found Everywhere
  • What's Quieter Than a Fish? A School of Them
  • Do Odd Bones Belong to Gigantic Ichthyosaurs?

Trending Topics

Strange & offbeat.

  • International edition
  • Australia edition
  • Europe edition

He Jiankui

Scientist who gene-edited babies is back in lab and ‘proud’ of past work despite jailing

China’s He Jiankui, who used Crispr to edit genome, says he is working on genetic diseases and suggests human embryo gene editing will one day be accepted

A Chinese scientist who was imprisoned for his role in creating the world’s first genetically edited babies says he has returned to his laboratory to work on the treatment of Alzheimer’s and other genetic diseases.

In an interview with a Japanese newspaper, He Jiankui said he had resumed research on human embryo genome editing, despite the controversy over the ethics of artificially rewriting genes, which some critics predicted would lead to demand for “designer babies”.

“We will use discarded human embryos and comply with both domestic and international rules,” He told the Mainichi Shimbun, adding that he had no plans to produce more genome-edited babies. Previously, He had used a tool known as Crispr-Cas9 to rewrite DNA in embryos.

In 2019 a court in China sentenced He to three years in prison for violating medical regulations after he claimed the previous year that he had created genetically modified twin sisters, Lulu and Nana, before birth.

His experiments sent shockwaves through the medical and scientific world. He was widely condemned for having gone ahead with the risky, ethically contentious and medically unjustified procedure with inadequate consent from the families involved.

The court found that He had forged documents from an ethics review panel that were used to recruit couples for his research.

He said he had used a gene-editing procedure known as Crispr-Cas9 to rewrite the DNA in the sisters’ embryos – modifications he claimed would make the children immune to HIV .

He has continued to defend his work, despite widespread criticism, saying he was “proud” of having created Lulu and Nana. A third girl was born in 2019 as a result of similar experiments.

He told the Mainichi that he hoped to use genome editing in human embryos to develop treatments for rare genetic diseases such as Duchenne muscular dystrophy and familial Alzheimer’s disease, at three laboratories he has opened since his release from prison in 2022.

He said the three genome-edited children were “perfectly healthy and have no problems with their growth”, according to the newspaper, adding that the twins, now aged 5, were attending kindergarten.

“The results of analysing [the children’s] entire gene sequences show that there were no modifications to the genes other than for the medical objective, providing evidence that genome editing was safe,” he told the Mainichi. “I’m proud to have helped families who wanted healthy children.”

He told the Guardian in 2023 that he had acted “too quickly” by pressing ahead with the procedure, but stopped short of voicing regret or apologising.

In his interview with the Mainichi, he said society would “eventually accept” human embryo gene editing in the quest to find treatments for genetic diseases.

Most viewed

  • Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics

The Human Genome Project

  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contact by Research Area
  • News & Events
  • Research Areas
  • Research investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission & Vision
  • Policies & Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

The Human Genome Project (HGP) is one of the greatest scientific feats in history. The project was a voyage of biological discovery led by an international group of researchers looking to comprehensively study all of the DNA (known as a genome) of a select set of organisms. Launched in October 1990 and completed in April 2003, the Human Genome Project’s signature accomplishment – generating the first sequence of the human genome – provided fundamental information about the human blueprint, which has since accelerated the study of human biology and improved the practice of medicine.

Learn more about the Human Genome Project below.

G5 Reunion

A virtual discussion with the leaders of the five genome-sequencing centers that provides the untold story on how they got the HGP across the finish line in 2003.

DNA sequencing by gel electrophoresis

A fact sheet detailing how the project began and how it shaped the future of research and technology.

Human Genome Project Timeline of Events | NHGRI

An interactive timeline listing key moments from the history of the project.

HGP Timeline

A downloadable poster containing major scientific landmarks before and throughout the project.

Francis Collins

Prominent scientists involved in the project reflect on the lessons learned.

HGP Banbury Meeting

Commentary in the journal Nature written by NHGRI leaders discussing the legacies of the project.

Science and Nature Covers

Lecture-oriented slides telling the story of the project by a front-line participant.

Human Genome Project

Related Content

Jay Shendure

Last updated: September 7, 2023

Princeton University

Princeton engineering, can language models read the genome this one decoded mrna to make better vaccines..

By Scott Lyon

April 8, 2024

Single strand ribonucleic acid.

Princeton researchers led by Mengdi Wang have developed a language model to home in on partial genome sequences and optimize those sequences to improve function for the development of mRNA vaccines and other therapies. Illustration from Adobe Stock.

The same class of artificial intelligence that made headlines coding software and passing the bar exam has learned to read a different kind of text — the genetic code.

That code contains instructions for all of life’s functions and follows rules not unlike those that govern human languages. Each sequence in a genome adheres to an intricate grammar and syntax, the structures that give rise to meaning. Just as changing a few words can radically alter the impact of a sentence, small variations in a biological sequence can make a huge difference in the forms that sequence encodes.

Now Princeton University researchers led by machine learning expert Mengdi Wang are using language models to home in on partial genome sequences and optimize those sequences to study biology and improve medicine. And they are already underway.

In a paper published April 5 in the journal Nature Machine Intelligence, the authors detail a language model that used its powers of semantic representation to design a more effective mRNA vaccine such as those used to protect against COVID-19.

Found in Translation

Mengdi Wang in her Princeton office.

Scientists have a simple way to summarize the flow of genetic information. They call it the central dogma of biology. Information moves from DNA to RNA to proteins. Proteins create the structures and functions of living cells.

Messenger RNA, or mRNA, converts the information into proteins in that final step, called translation. But mRNA is interesting. Only part of it holds the code for the protein. The rest is not translated but controls vital aspects of the translation process.

Governing the efficiency of protein production is a key mechanism by which mRNA vaccines work. The researchers focused their language model there, on the untranslated region, to see how they could optimize efficiency and improve vaccines.

After training the model on a small variety of species, the researchers generated hundreds of new optimized sequences and validated those results through lab experiments. The best sequences outperformed several leading benchmarks for vaccine development, including a 33% increase in the overall efficiency of protein production.

Increasing protein production efficiency by even a small amount provides a major boost for emerging therapeutics, according to the researchers. Beyond COVID-19, mRNA vaccines promise to protect against many infectious diseases and cancers.

Wang, a professor of electrical and computer engineering and the principal investigator in this study, said the model’s success also pointed to a more fundamental possibility. Trained on mRNA from a handful of species, it was able to decode nucleotide sequences and reveal something new about gene regulation. Scientists believe gene regulation, one of life’s most basic functions, holds the key to unlocking the origins of disease and disorder. Language models like this one could provide a new way to probe.

Wang’s collaborators include researchers from the biotech firm RVAC Medicines as well as the Stanford University School of Medicine.

The Language of Disease

The new model differs in degree, not kind, from the large language models that power today’s AI chat bots. Instead of being trained on billions of pages of text from the internet, their model was trained on a few hundred thousand sequences. The model also was trained to incorporate additional knowledge about the production of proteins, including structural and energy-related information.

The research team used the trained model to create a library of 211 new sequences. Each was optimized for a desired function, primarily an increase in the efficiency of translation. Those proteins, like the spike protein targeted by COVID-19 vaccines, drive the immune response to infectious disease.

Previous studies have created language models to decode various biological sequences, including proteins and DNA, but this was the first language model to focus on the untranslated region of mRNA. In addition to a boost in overall efficiency, it was also able to predict how well a sequence would perform at a variety of related tasks.

Wang said the real challenge in creating this language model was in understanding the full context of the available data. Training a model requires not only the raw data with all its features but also the downstream consequences of those features. If a program is designed to filter spam from email, each email it trains on would be labeled “spam” or “not spam.” Along the way, the model develops semantic representations that allow it to determine what sequences of words indicate a “spam” label. Therein lies the meaning.

Wang said looking at one narrow dataset and developing a model around it was not enough to be useful for life scientists. She needed to do something new. Because this model was working at the leading edge of biological understanding, the data she found was all over the place.

“Part of my dataset comes from a study where there are measures for efficiency,” Wang said. “Another part of my dataset comes from another study [that] measured expression levels. We also collected unannotated data from multiple resources.” Organizing those parts into one coherent and robust whole — a multifaceted dataset that she could use to train a sophisticated language model — was a massive challenge.

“Training a model is not only about putting together all those sequences, but also putting together sequences with the labels that have been collected so far. This had never been done before.”

The paper, “A 5′ UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions,” was published in Nature Machine Learning. Additional authors include Dan Yu, Yupeng Li, Yue Shen and Jason Zhang, from RVAC Medicines; Le Cong from Stanford; and Yanyi Chu and Kaixuan Huang from Princeton.

Related News

Composers & Computers Episode 2: That Magic Touch

Episode 2: That Magic Touch

Composers & Computers, Episode 3, Haydn Seek. There is an image of a soundwave under the series logo.

Episode 3: Haydn Seek

Composers & Computers Season 2, Episode 1, Stanley Jordan Pulls out all the stops. Sound wave image under the podcast series logo.

Episode 1: Stanley Jordan Pulls Out All the Stops

Chatbot illustration with person's hands holding a phone.

Personalizing ChatGPT can make it more offensive, researchers find

Dense rows of low crops growing in a field, with trees in the distance and a clear blue sky.

Princeton IP Accelerator funding awarded to support promising new technologies

An advanced chip taped out surrounded by a gold square surrounded by a large array of gold pins.

Built for AI, this chip moves beyond transistors for huge computational gains

genome research

Mengdi Wang

genome research

Bioengineering and Health

genome research

Data Science

Related departments and centers.

Professor writes on white board while talking with grad student.

Electrical and Computer Engineering

Six people in lab looking toward camera.

Omenn-Darling Bioengineering Institute

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 24 August 2020

The road ahead in genetics and genomics

  • Amy L. McGuire 1 ,
  • Stacey Gabriel 2 ,
  • Sarah A. Tishkoff   ORCID: orcid.org/0000-0002-1339-5959 3 , 4 ,
  • Ambroise Wonkam   ORCID: orcid.org/0000-0003-1420-9051 5 , 6 ,
  • Aravinda Chakravarti   ORCID: orcid.org/0000-0002-4264-2285 7 ,
  • Eileen E. M. Furlong   ORCID: orcid.org/0000-0002-9544-8339 8 ,
  • Barbara Treutlein   ORCID: orcid.org/0000-0002-3299-5597 9 ,
  • Alexander Meissner   ORCID: orcid.org/0000-0001-8646-7469 2 , 10 , 11 , 12 ,
  • Howard Y. Chang   ORCID: orcid.org/0000-0002-9459-4393 13 ,
  • Núria López-Bigas   ORCID: orcid.org/0000-0003-4925-8988 14 , 15 , 16 ,
  • Eran Segal   ORCID: orcid.org/0000-0002-6859-1164 17 &
  • Jin-Soo Kim   ORCID: orcid.org/0000-0003-4847-1306 18  

Nature Reviews Genetics volume  21 ,  pages 581–596 ( 2020 ) Cite this article

80k Accesses

108 Citations

346 Altmetric

Metrics details

  • Genetic techniques

In celebration of the 20th anniversary of Nature Reviews Genetics , we asked 12 leading researchers to reflect on the key challenges and opportunities faced by the field of genetics and genomics. Keeping their particular research area in mind, they take stock of the current state of play and emphasize the work that remains to be done over the next few years so that, ultimately, the benefits of genetic and genomic research can be felt by everyone.

The contributors

Amy L. McGuire is the Leon Jaworski Professor of Biomedical Ethics and Director of the Center for Medical Ethics and Health Policy at Baylor College of Medicine. She has received numerous teaching awards at Baylor College of Medicine, was recognized by the Texas Executive Women as a Woman on the Move in 2016 and was invited to give a TedMed talk titled “There is No Genome for the Human Spirit” in 2014. In 2020, she was elected as a Hastings Center Fellow. Her research focuses on ethical and policy issues related to emerging technologies, with a particular focus on genomic research, personalized medicine and the clinical integration of novel neurotechnologies.

Stacey Gabriel is the Senior Director of the Genomics Platform at the Broad Institute since 2012 and has led platform development, execution and operation since its founding. She is Chair of Institute Scientists and serves on the institute’s executive leadership team. She is widely recognized as a leader in genomic technology and project execution. She has led the Broad’s contributions to numerous flagship projects in human genetics, including the International HapMap Project, the 1000 Genomes Project, The Cancer Genome Atlas, the National Heart, Lung, and Blood Institute’s Exome Sequencing Project and the TOPMed programme. She is Principal Investigator of the Broad’s All of Us (AoU) Genomics Center and serves on the AoU Program Steering Committee.

Sarah A. Tishkoff is the David and Lyn Silfen University Associate Professor in Genetics and Biology at the University of Pennsylvania, Philadelphia, USA, and holds appointments in the School of Medicine and the School of Arts and Sciences. She is a member of the US National Academy of Sciences and a recipient of an NIH Pioneer Award, a David and Lucile Packard Career Award, a Burroughs/Wellcome Fund Career Award and an American Society of Human Genetics Curt Stern Award. Her work focuses on genomic variation in Africa, human evolutionary history, the genetic basis of adaptation and phenotypic variation in Africa, and the genetic basis of susceptibility to infectious disease in Africa.

Ambroise Wonkam is Professor of Medical Genetics, Director of GeneMAP (Genetic Medicine of African Populations Research Centre) and Deputy Dean Research in the Faculty of Health Sciences, University of Cape Town, South Africa. He has successfully led numerous NIH- and Wellcome Trust-funded projects over the past decade to investigate clinical variability in sickle cell disease, hearing impairment genetics and the return of individual findings in genetic research in Africa. He won the competitive Clinical Genetics Society International Award for 2014 from the British Society of Genetic Medicine. He is president of the African Society of Human Genetics.

Aravinda Chakravarti is Director of the Center for Human Genetics and Genomics, the Muriel G. and George W. Singer Professor of Neuroscience and Physiology, and Professor of Medicine at New York University School of Medicine. He is an elected member of the US National Academy of Sciences, the US National Academy of Medicine and the Indian National Science Academy. He has been a key participant in the Human Genome Project, the International HapMap Project and the 1000 Genomes Project. His research attempts to understand the molecular basis of multifactorial disease. He was awarded the 2013 William Allan Award by the American Society of Human Genetics and the 2018 Chen Award by the Human Genome Organization.

Eileen E. M. Furlong is Head of the Genome Biology Department at the European Molecular Biology Laboratory (EMBL) and a member of the EMBL Directorate. She is an elected member of the European Molecular Biology Organization (EMBO) and the Academia Europaea, and a European Research Council (ERC) advanced investigator. Her group dissects fundamental principles of how the genome is regulated and how it drives cell fate decisions during embryonic development, including how developmental enhancers are organized and function within the 3D nucleus. Her work combines genetics, (single-cell) genomics, imaging and computational approaches to understand these processes. Her research has advanced the development of genomic methods for use in complex multicellular organisms.

Barbara Treutlein is Associate Professor of Quantitative Developmental Biology in the Department of Biosystems Science and Engineering of ETH Zurich in Basel, Switzerland. Her group uses and develops single-cell genomics approaches in combination with stem cell-based 2D and 3D culture systems to study how human organs develop and regenerate and how cell fate is regulated. For her work, Barbara has received multiple awards, including the Friedmund Neumann Prize of the Schering Foundation, the Dr. Susan Lim Award for Outstanding Young Investigator of the International Society of Stem Cell Research and the EMBO Young Investigator Award.

Alexander Meissner is a scientific member of the Max Planck Society and currently Managing Director of the Max Planck Institute (MPI) for Molecular Genetics in Berlin, Germany. He heads the Department of Genome Regulation and is a visiting scientist in the Department of Stem Cell and Regenerative Biology at Harvard University. Before his move to the MPI, he was a tenured professor at Harvard University and a senior associate member of the Broad Institute, where he co-directed the epigenomics programme. In 2018, he was elected as an EMBO member. His laboratory uses genomic tools to study developmental and disease biology with a particular focus on epigenetic regulation.

Howard Y. Chang is the Virginia and D. K. Ludwig Professor of Cancer Genomics at Stanford University and an investigator at the Howard Hughes Medical Institute. He is a physician–scientist who has focused on deciphering the hidden information in the non-coding genome. His laboratory is best known for studies of long non-coding RNAs in gene regulation and development of new epigenomic technologies. He is an elected member of the US National Academy of Sciences, the US National Academy of Medicine, and the American Academy of Arts and Sciences.

Núria López-Bigas is ICREA research Professor at the Institute for Research in Biomedicine and Associate Professor at the University Pompeu Fabra. She obtained an ERC Consolidator Grant in 2015 and was elected as an EMBO member in 2016. Her work has been recognized with the prestigious Banc de Sabadell Award for Research in Biomedicine, the Catalan National Award for Young Research Talent and the Career Development Award from the Human Frontier Science Program. Her research focuses on the identification of cancer driver mutations, genes and pathways across tumour types and in understanding the mutational processes that lead to the accumulation of mutations in cancer cells.

Eran Segal is Professor in the Department of Computer Science and Applied Mathematics at the Weizmann Institute of Science, heading a multidisciplinary laboratory with extensive experience in machine learning, computational biology and analysis of heterogeneous high-throughput genomic data. His research focuses on the microbiome, nutrition and genetics, and their effect on health and disease and aims to develop personalized medicine based on big data from human cohorts. He has published more than 150 publications and received several awards and honours for his work, including the Overton and the Michael Bruno awards. He was recently elected as an EMBO member and as a member of the Israel Young Academy.

Jin-Soo Kim is Director of the Center for Genome Engineering in the Institute for Basic Science in Daejon, South Korea. He has received numerous awards, including the 2017 Asan Award in Medicine, the 2017 Yumin Award in Science and the 2019 Research Excellence Award (Federation of Asian and Oceanian Biochemists and Molecular Biologists). He was featured as one of ten Science Stars of East Asia in Nature ( 558 , 502–510 (2018)) and has been recognized as a highly cited researcher by Clarivate Analytics since 2018. His work focuses on developing tools for genome editing in biomedical research.

Similar content being viewed by others

genome research

A brief history of human disease genetics

Melina Claussnitzer, Judy H. Cho, … Mark I. McCarthy

genome research

Molecular quantitative trait loci

François Aguet, Kaur Alasoo, … Tuuli Lappalainen

genome research

A new era in functional genomics screens

Laralynne Przybyla & Luke A. Gilbert

Making genomics truly equitable

Amy McGuire. For the field of genetics and genomics, the first decade of the twenty-first century was a time of rapid discovery, transformative technological development and plummeting costs. We moved from mapping the human genome, an international endeavour that took more than a decade and cost billions of dollars, to sequencing individual genomes for a mere fraction of the cost in a relatively short time.

During the subsequent decade, the field turned towards making sense of the vast amount of genomic information being generated and situating it in the context of one’s environment, lifestyle and other non-genetic factors. Much of the hype that characterized the previous decade was tempered as we were reminded of the exquisite complexity of human biology. A vision of medicine driven by genetically determined risk predictions was replaced with a vision of precision in which genetics, environment and lifestyle all converge to deliver the right treatment to the right patient at the right time 1 .

As we embark on the third decade of this century, we are now faced with the prospect of being able not only to more accurately predict disease risk and tailor existing treatments on the basis of genetic and non-genetic factors but also to potentially cure or even eliminate some diseases entirely with gene-editing technologies.

These advancements raise many ethical and policy issues, including concerns about privacy and discrimination, the right of access to research findings and direct-to-consumer genetic testing, and informed consent. Significant investment has been made to better understand the risks and benefits of clinical genomic testing, and there has been vigorous debate about the ethics of human gene editing, with many prominent scientists and bioethicists calling for a moratorium on human germline editing until it is proven to be safe and effective and there is broad societal consensus on its appropriate application 2 .

These are all important issues that we need to continue to explore, but as the technologies that have been developed and tested at warp speed over the past two decades begin to be integrated into routine clinical care, it is imperative that we also confront one of the most difficult and fundamental challenges in genomics, in medicine and in society — rectifying structural inequities and addressing factors that privilege some while disadvantaging others. The genomics of the future must be a genomics for all, regardless of ethnicity, geography or ability to pay.

This audacious goal of making genomics truly equitable requires multifaceted solutions. The disproportionate burden of illness and death among racial and ethnic minorities associated with the global COVID-19 pandemic 3 and recent protests against police brutality towards African American citizens 4 have strengthened the antiracism movement and amplified demands for racial equity.

To be part of this movement and effect change will require humility. We must actively listen and learn from each other, especially when it is uncomfortable and our own complicity may be implicated. It will require solidarity and a recognition that we are all connected through our common humanity. And it will require courage. It may seem like a platitude, but it is true that nothing will change unless actual change is made. If we continue to do things as they have always been done, we will end up where we have always been. It is time to step into the discomfort and dare to do something different.

So what can we do differently to make genomics more equitable? I propose three areas where we should focus attention to address this important question. First, we must ensure equitable representation in genomic research. Examining 2,511 studies involving nearly 35 million samples from the GWAS Catalog in 2016, Popejoy and Fullerton found that the vast majority (81%) come from individuals of European descent, with only 5% coming from non-Asian minority populations 5 . This has created an ‘information disparity’ that has an impact on the reliability of clinical genomic interpretation for under-represented minorities 6 . The US National Institutes of Health (NIH) has invested in efforts to increase diversity in genomic research, but to be successful these efforts must be accompanied by serious attention to earning the trust of disadvantaged and historically mistreated populations. This will require, at a minimum, more meaningful engagement, improved transparency, robust systems of accountability, and a commitment to creating opportunities that promote and support a genomics workforce that includes scientists and clinicians from under-represented populations.

It is insufficient to achieve diverse representation in genomic research; however, there must also be equitable access to the fruits of that research. An analysis of the US Centers for Disease Control and Prevention’s 2018 Behavioural Risk Factor Surveillance System found that non-elderly adults from self-identified racial or ethnic minority groups are significantly less likely to see a doctor because of cost than non-elderly white adults 7 . This finding reflects how the structure and financing of health care in the United States perpetuates inequities and contributes to the larger web of social injustice that is at the heart of the problem. Even when socio-economic factors are controlled for, racial disparities in access to genetic services persist 8 . Large-scale, sustained research is needed to better understand and actively address the multitude of factors that contribute to this, including issues related to structural racism, mistrust, implicit and explicit bias, a lack of knowledge of genetic testing, and concerns about misuse of genetic information.

Finally, and perhaps most daunting, we must strive to achieve more equitable outcomes from genomic medicine. Many racial and ethnic minorities disproportionately experience chronic disease and premature death compared with white individuals. Disparities also exist by gender, sexual orientation, age, disability status, socio-economic status and geographical location. Health outcomes are heavily influenced by social, economic and environmental factors. Thus, although providing more equitable access to genomic services and ensuring more equitable representation in genomic research are necessary first steps, they are not enough 9 . Genomics can only be part of the solution if it is integrated with broader social, economic and political efforts aimed at addressing disparities in health outcomes. For genomics to be truly equitable, it must operate within a just health-care system and a just society.

we must strive to achieve more equitable outcomes from genomic medicine

Genome sequencing at population scale

Stacey Gabriel. Twenty years ago, I finished a PhD project that involved laboriously sequencing one gene — a rather complicated one, RET — in a couple of hundred people to catalogue pathogenic variants for Hirschsprung disease. This work required designing primers on the basis of genome sequence data as they were gradually released, amplifying the gene exon by exon (all 20!), running sequencing gels and manually scoring sequence changes. The notion of sequencing the whole genome to catalogue sequence changes was something to wish for in our wildest dreams.

Thanks to great strides in technology and the hard work of geneticists, engineers, epidemiologists and clinicians, much progress has been made; huge numbers of genomes (and exomes) have been sequenced across the world. Disease gene-finding projects such as my graduate work are now done routinely, rather than one gene at a time, using whole-exome or whole-genome sequencing (WGS) in families and affected individuals, enabling the identification of genes and causative mutations in thousands of Mendelian diseases and some complex diseases.

But the real promise of genome sequencing lies in true population-scale sequencing, ultimately at the scale of tens of millions of individuals, whereby genome sequencing of unselected people enables the unbiased, comprehensive study of our genome and the variation therein. It provides a ‘lookup table’ to catalogue disease-causing and benign variants (our ‘allelic series’). The genome sequence should become part of the electronic health record; it is a stable, persistent source of information about a person akin to physical measurements such as weight or blood pressure, exposures such as smoking or alcohol use, and (in many ways better than) self-reported family history.

the real promise of genome sequencing lies in true population-scale sequencing, ultimately at the scale of tens of millions of individuals

What can we learn? What needs to be solved? Even fairly small numbers of genomes aggregated in a consistent and searchable form have enabled a new way to use and interpret genomic data, just in the past couple of years providing a glimpse at the future. Efforts such as gnomAD 10 are a start — this database contains data from more than 15,000 genomes and 125,0000 exomes. With this resource, the frequency of genetic variants within populations is readily available. A clinician interpreting the genome of a patient can ask whether a variant has been observed before. The data provide a starting point for assessing the functional impact of classes of genetic variation and the ability to ask questions about ‘missing’ genetic variation where there is constraint.

Coupled with clinical data, building up population-scale databases of genomic plus clinical information will fuel the application of better risk interpretation using polygenic risk scores (PRSs) 11 . More routine WGS will shorten the ‘diagnostic odyssey’, in which patients suffer through rounds of testing and parents are left uncertain about future reproductive planning. More efficient clinical trials might be built using genomic information. With existing genomic information on all individuals in a health system, trials could be designed in a way that selects individuals most likely to have an event. This enrichment could provide more promising, shorter, smaller and cheaper trial design.

These databases must also rapidly be built in such a way that is representative of the population, representing the actual racial and ethnic diversity, not just what was available as banked sample collections. These are well known to be predominantly European-descent samples and thus preclude application of risk prediction tools in non-white individuals and have limited the ability to find population-specific genetic associations, such as those that have been demonstrated in type 2 diabetes mellitus (T2DM) 12 .

We have to solve important issues — data sharing, privacy and getting the data to scale. Sharing genomic and clinical data is of key importance to drive forward discovery and our understanding of how to use these data in the health-care setting. To do this well and responsibly, trust must be built and maintained through adherence to the rights of privacy, protection and non-discrimination. Progress is being made through the creation of data platforms and the development of frameworks for data protection and sharing; for example, by the work of the Global Alliance for Genomics and Health (GA4GH).

Several large biobanks are already being established to launch population-scale efforts. The UK Biobank is a vanguard programme that contains genotype data, questionnaire-based health and physical measurements on 500,000 individuals and some linkage to their medical records. Other efforts such as the All of Us research programme have been launched with goals directed at true population-based representation, and biobanks that link genomic data to comprehensive medical records in specific health-care systems (for example, Geisinger) or in specific countries or regions (for example, Estonia and Iceland) are also under way.

A big piece of this puzzle is generating comprehensive genome sequence data in these programmes and far beyond. For this aim, large-scale, affordable sequencing is key. No problem, right? Is sequencing not always getting cheaper? The problem is that this assumption is no longer true. We have got to where we are today because for a long time, from 2008 to 2013, sequencing costs dropped exponentially. However, in recent years, the sequencing cost curve has flattened, as is apparent in publicly reported cost estimates provided by the US National Human Genome Research Institute 13 . The cost per megabase of sequence data has remained largely unchanged since around 2016, hovering around a list price of US$0.01 per megabase, which translates to a US$1,000 genome. Gone are the days of our field touting the impressive decrease of cost in comparison with Moore’s law, and this development is worrying.

Some discounting does happen at considerable volume, and whole genomes can be priced in the range of US$500 to US$700. However, large projects (more than 500,000 samples) sequenced at these prices are few and far between, and are generally dependent on pharmaceutical or biotech funding, which can bring with it restrictions on data sharing. It is my belief that a fivefold to sevenfold reduction in total costs is needed to unlock more sequencing at the population scale and, ultimately, for genome sequencing to be more widely applied in the health-care setting. At US$100 per genome, the cost represents less than 1% of the annual average health-care expenditure per person in the United States, and a genome sequence is a one-time investment that can be referenced again and again over the entire lifespan of a person. Getting that cost curve down will be important to inspire health-care systems to adopt genome sequencing routinely.

I see three main drivers that will get us to US$100 per genome: innovation, scale and competition.

Innovation . Generating sequence data requires multiple components, and there are multiple areas ripe for innovation. Sample preparation can be improved through more efficient methods that decrease the labour required, or miniaturization can decrease the cost of the reagents used in library preparation. Developments to decrease data processing costs are also ripe for innovation. Recently, we showed that processing using optimized computing power lowered the time and cost of creating a sequence file by ~50% (S.G., unpublished observations). While decreases in the costs of sample preparation and data processing are important, they represent a small component of the total cost. Roughly 70% of the cost of sequencing a human genome is the sequencing reagent (flow cell) and the instrument. Appreciable cost decrease is made possible only by decreasing these marginal costs, as was demonstrated in the period from 2010 to 2014, when flow-cell densities doubled and sequencing cost dropped by an order of magnitude (US$100 per gigabase to US$10 per gigabase).

Scale . One component of cost is the fixed cost borne by the sequencing centre or the sequencing vendor. With high scale, centres can become more efficient and offset costs such as the costs of personnel, equipment and facilities. Scale can also result in volume discounting of the reagents, although this process is tightly controlled and approached cautiously depending on overall market dynamics.

Competition . Innovation and scale can only achieve so much. The cost of generating the data (the cost per gigabase) dominates and thus must come down considerably. The current market requires alternative options to drive this advance. Presently, the market for short-read sequencing is lacking viable, proven competition that would force flow-cell densities and machine yield to be increased and put pressures on volume discounting. While options for long-read sequencing exist and play a role in particular applications, such as de novo sequencing and structural variant resolution, they are at present far from cost competitive and, therefore, do not apply pressure to bring down the cost of routine WGS.

We need innovation, great economies of scale and/or real competition to come to play in the marketplace. When it comes to sequencing technology, particularly at a large scale, we cannot be complacent and work around the current barriers to realize small gains and one-off wins. This might involve specific types of investment beyond just financial ones; adopting and vetting new technology requires time, creativity, commitment and patience. It is a challenge for our community to take on now. In 5 years’ time, I hope we can look back at the era of the US$100 genome and progress towards real population-scale databases that fuel discovery, enriching our knowledge of the human allelic series and, importantly, the routine use of genomic data in the health-care setting.

A global view of human evolution

Sarah Tishkoff. The past 10 years saw an exponential increase in SNP array and high-coverage WGS data owing to innovations in genomic technologies. It is now possible to generate WGS data from tens of thousands of individuals (for example, GenomeAsia 100K 14 and NIH TOPMed 15 ). An increase in medical biobanks with access to electronic health records (for example, the UK Biobank 16 , the Million Veteran Project 17 and BioBank Japan 18 ) is enabling the mapping of hundreds of genetic associations with complex traits and diseases, as well as phenome-wide association studies 19 to map pleiotropic associations of phenotypes with genes. The genetic associations identified in these and other studies have been used to calculate PRSs for predicting complex phenotypes and risk of diseases.

Yet despite these advances, as of 2019, nearly 80% of individuals in genome-wide association studies (GWAS) were of European ancestries, ~10% were of East Asian ancestries, ~2% were of African ancestries, ~1.5% were of Hispanic ancestries and less than 1% were of other ancestries 20 . There is also a strong European bias in genomic reference databases, such as gnomAD and GTEx . These biases limit our knowledge of genetic risk factors for disease in ethnically diverse populations and could exacerbate health inequities 20 . Furthermore, PRSs that were estimated using European data do not accurately predict phenotypes and disease risk in non-European populations, performing worst in individuals with African ancestry 21 . The lack of transportability of PRSs across ethnic groups is likely due to differences in patterns of linkage disequilibrium and haplotype structure (resulting in different SNPs tagging causal variants), differences in allele frequencies, gene × gene effects and gene × environment effects. It is also possible that the genetic architecture of complex traits and diseases may differ across ethnic groups owing to different demographic histories and adaptation to diverse environments.

Although there have been initiatives to increase inclusion of ethnically diverse populations in human genomics research (for example, the NIH TOPMed 15 and H3Africa consortia), Indigenous populations remain under-represented. Great care must be taken to ensure that genomic research of minority and Indigenous populations is conducted in an ethical manner. This involves establishing partnerships with local research scientists, being sensitive to local customs and cultural concerns, obtaining both community and individual consent, and returning results to communities that participated when possible. In addition, there should be training and capacity building so that genomic research can be conducted locally, where feasible.

A particular area of focus in the future should be developing tools and resources that make genomic data and analyses accessible in low- and middle-income countries. We have to ensure that all people benefit from the genomics revolution and advances in precision medicine and gene editing. Thus, several of the biggest challenges in the next decade will be (1) to increase inclusion of ethnically diverse populations in human genomics research; (2) the generation of more diverse reference genomes using methods that generate long sequencing reads, and haplotype phasing, to account for the large amount of structural variation that likely exists within and between populations; (3) the training of a more diverse community of genomic research scientists; and (4) the development of better methods for accurately predicting phenotypes and genetic risk across ethnically diverse populations and for distinguishing gene × environment effects.

The inclusion of ethnically diverse populations, including Indigenous populations, is also critical for reconstructing human evolutionary history and understanding the genetic basis of adaptation to diverse environments and diets. While there have been a number of success stories for identifying genes of large effect that play a role in local adaptation (for example, lactose tolerance and sickle cell disease (SCD) associated with malaria resistance), identifying signatures of polygenic selection has been considerably more challenging 22 . Genomic signatures of polygenic adaptation are based on the ability to detect subtle shifts in allele frequencies at hundreds or thousands of loci with minor effect on the phenotype of a complex trait and to determine whether that shift is a result of demography or natural selection. A more daunting challenge arises from the same issues of portability of PRSs described earlier — variants associated with a complex trait may not tag well across ethnic groups and/or the genetic architecture of a trait may differ in different populations. Furthermore, it has recently been shown that uncorrected population stratification can result in a false signal of polygenic selection 23 . For example, several studies have identified signatures of polygenic adaptation for height across European populations (selection for increased height in northern Europeans and for decreased height in southern Europeans). However, it was recently shown that these results were influenced by population structure that could not be easily corrected using standard approaches, particularly for SNPs below genome-wide levels of significance 23 . When this analysis was repeated with variants identified in a more homogenous set of individuals of European ancestry from the UK Biobank, these signatures of polygenic adaptation were erased 23 . Thus, methods for detecting polygenic adaptation that are less biased by population structure and by population ascertainment bias will need to be developed in the future. These studies will also benefit from inclusion of more ethnically diverse populations in GWAS and identification of better tag SNPs as described earlier. A challenge of inclusion of minority populations in GWAS is that sample sizes are often small relative to majority populations. However, the high levels of genetic diversity and extremes of phenotypic diversity observed in some populations, particularly those from Africa, make them particularly informative for GWAS. For example, a GWAS of skin pigmentation in fewer than 1,600 Africans was informative for identifying novel genetic variants that affect skin colour, including a previously uncharacterized gene, MFSD12 (ref. 24 ). Thus, genomic studies in the future must make inclusion of minority populations a priority.

A challenge in both GWAS and selection scans has been the identification of causal genetic variants that directly have an impact on variable traits. Most of these variants are in non-coding regions of the genome. The development of high-throughput approaches, such as massively parallel luciferase expression assays to identify gene regulatory regions and high-throughput CRISPR screens in vitro and in vivo to identify functional variants influencing the trait of interest, will be useful 25 . There is also a need to better understand cell type-specific variation and gene regulation at the single-cell level, including response to stimuli such as immune, pharmacological and nutrient challenges, in ethnically diverse populations. However, these approaches are still limited by the need to have informative cell lines. This can be particularly challenging to obtain for Indigenous populations living in remote regions. Improvements in the differentiation of induced pluripotent stem cells (iPS cells) into assorted cell types and into organoids will be important for facilitating functional genomic studies. Establishment of iPS cells and organoids from diverse non-human primate species will also be informative for comparative genomic studies to identify the evolution of human-specific traits such as brain development and cognition. However, iPS cell-derived cells may not accurately reflect the impact of mutations acting on developmental phenotypes, which will require development of more efficient in vivo approaches in model organisms.

Perhaps the biggest revolution in the study of recent human evolutionary history has been the development of methods that make it feasible to sequence and/or obtain targeted genotypes from ancient DNA samples. The generation of high-coverage reference genomes for archaic hominid species such as Neanderthals and Denisovans, located in Eurasia, has made it feasible to identify archaic introgressed segments within the genomes of non-Africans. Some of these regions have been shown to play a role in adaptive traits such as adaptation to high altitude and immune response 26 . Furthermore, there has been an explosion of studies of ancient genetic variation in Europeans within the past 30,000 years that has demonstrated a much more complex model of the peopling of Europe, and the recent evolution of adaptive traits, than previously known from the archaeological record or from studies of modern populations 27 . The biggest challenge has been the inability to get high-quality ancient DNA from regions with a tropical climate, such as Africa and Asia. While there has been success in analysing DNA samples as old as 15,000 years in Africa, which has been informative for tracing recent migration and admixture events 28 , the lack of a more ancient African reference genome makes it very challenging to detect archaic introgression, which currently relies on statistical modelling approaches. Thus, the biggest challenge in the next 10 years will be the successful sequencing of ancient DNA more than 20,000 years old from all regions of the world, so that we may have a better understanding of the complex web of population histories from across the globe.

African genomics — the next frontier

Ambroise Wonkam. To fully meet the potential of global genetic medicine, research into African genomic variation is a scientific imperative, with equitable access being a major challenge to be addressed. Studying African genomic variation represents the next frontier of genetic medicine for three major reasons: ancestry, ecology and equity.

On the basis of a ‘pan-genome’ generated from 910 individuals of African descent, at least 300 million DNA variants (10%) are not found in the current human reference genome 29 , and 2–19% of the genome of ancestral Africans derives from poorly investigated archaic populations that diverged before the split of Neanderthals and modern humans 30 . Neanderthal genome contributions make up ~2% of the genome in present-day Europeans and are enriched for variations in genes involved in dermatological phenotypes, neuropsychiatric disorders and immunological functions 31 . Once technical challenges in sequencing poor-quality DNA have been overcome and approaches to investigate the genomic contribution of African archaic populations have been refined, it is likely that associations between variants in ancient African DNA and human traits or diseases will be found, providing insights that can benefit modern-day humans.

As a consequence of the 300,000–500,000 years of genomic history of modern humans in Africa, ancestral African populations are the most genetically diverse in the world. By contrast, there is an extreme genetic bottleneck, resulting in much less variation, in all non-African populations who evolved from the thousands of humans who migrated out of Africa approximately 70,000 years ago. Current PRSs, which aim to predict the risk for an individual of a specific disease on the basis of the genetic variants that individual harbours, exhibit a bias regarding usability and transferability across populations, as most PRSs do not account for multiple alleles that are either limited or of high frequency among Africans. A GWAS on the genetic susceptibility to T2DM identified a previously unreported African-specific significant locus, while showing transferability of 32 established T2DM loci 32 . In addition, nonsense mutations found commonly among Africans in PCSK9 , which are rare in Europeans 33 , are associated with a 40% reduction in plasma levels of low-density lipoprotein, supporting PCSK9 as a target for dyslipidaemia therapeutics. In the largest GWAS meta-analysis for 34 complex traits, conducted in 14,345 Africans, several loci had limited transferability among cohorts 34 , further illustrating that genomic variation is highest among Africans compared with other populations. As a consequence, linkage disequilibrium is lower in Africans, which improves fine mapping and identification of causative variants. Indeed, while only 2.4% of participants in large GWAS are African individuals, they account for 7% of all associations 35 . Moreover, whole-exome sequencing of nearly 1,000 African study participants of Xhosa ancestry with schizophrenia found very rare damaging mutations in multiple genes 36 , a finding that could be replicated in a Swedish cohort of 5,000 individuals. In comparison, results for the Xhosa cohort yielded larger effect sizes, which shows that for the same number of cases and controls, the greater genetic variation in African populations provides more power to detect genotype–phenotype relationships. Therefore, millions of African genomes must be sequenced, with genotyping and analysis tools optimized for their interrogation.

Greater availability of African genomes will improve our understanding of genomic variation and complex trait associations in all populations but will also support research into common monogenic diseases. The discovery of a single African origin of the SCD mutation, about 5,000–7,000 years ago, not only suggested recent migration and admixture events between Africans and Mediterranean and/or Middle Eastern populations but also enhanced our understanding of genetic variation in general as well as its potential impact on haemoglobinopathies 37 . For example, variants in the HBB -like gene cluster linked with high levels of fetal haemoglobin have been associated with less severe SCD; because the level of fetal haemoglobin is under genetic control, it is amenable to therapeutic manipulation by gene editing 38 . Moreover, knowledge of an individual’s genetic variants can have an impact on secondary prevention of and treatment strategies for SCD. For example, variants in APOL1 and HMOX1 and co-inheritance of α-thalassaemia are associated with kidney dysfunctions 39 ; stroke in SCD is associated with targeted genetic variants used in a Bayesian model; and overall SCD mortality has been associated with circulating transcriptomic profiles. It is estimated that 75% of the 305,800 babies with SCD born each year are born in Africa; SCD in Africa will serve as a model for understanding the impact of genetic variation on common monogenic traits and help to illustrate the multiple layers of genomic medicine implementation.

Greater availability of African genomes will improve our understanding of genomic variation and complex trait associations in all populations

Exploring African genomic diversity will also increase discovery of novel variants and genes for rare monogenic conditions. Indeed, allelic and locus heterogeneity display important differences in African individuals compared with other populations; for example, mutations in GJB2 account for nearly 50% of cases of congenital non-syndromic hearing impairment among Eurasians but are nearly non-existent in Africans, and there is evidence that novel variants in hearing impairment-associated genes are more likely to be found in Africans than in populations of European or Asian ancestries 40 . Higher fertility rate, consanguinity practices and regional genetic bottlenecks will improve novel gene discovery for monogenic diseases in Africa, as well as disease–gene pair curation, and will address existing challenges surrounding database biases and inference of variant deleteriousness, which have led to the misclassification of variants.

Differential population genomic variant frequencies are shaped by natural evolutionary selection as an adaptation to environmental pressures. The African continent follows a North–South axis, which is associated with variable climates and biodiversity, both motors of natural selection. This specific African ecology has shaped genetic variation accordingly, which can have a detrimental or positive impact on health. Obvious examples are variants that cause SCD but confer resistance to malaria 37 , APOL1 variants that are protective against trypanosomes (the parasites that cause sleeping sickness) 41 and variants of OSBPL10 and RXRA that protect against dengue fever 42 . Unfortunately, APOL1 variants also increase susceptibility to chronic kidney disease in populations of African ancestry 39 , 41 . A better understanding of the functional impact of genetic variants specific to African populations, particularly those that have been selected under environmental pressure, and the way they interact with each other is needed and will have a positive impact on genetic medicine practice. Moreover, immunogenetic studies among Africans will further our understanding of natural selection and responses to emerging infectious diseases, such as COVID-19.

The scientific imperative of genomic research of African populations is expected to enhance genetic medicine knowledge and practice in Africa but will face the challenges of overburdened and under-resourced public health-care systems, and often absent ethical, legal and social implication frameworks 43 , requiring international collaboration to be managed. Developing an African genomics workforce will be necessary to meet the major need for research across the lifespan for cohorts of millions of individuals with complex or monogenic diseases. Such endeavours can thrive on the foundation of recently established initiatives such as H3Africa. Indeed, equitable access for Africans is essential if African genomics is to reach its full potential as the next frontier of global genetic medicine.

Decoding multifactorial phenotypes

Aravinda Chakravarti. We live in a time of great technological progress in genomics and computing. And we live in a time when ‘genetics’ is a household word, with a public increasingly adept at understanding its relevance to their own lives. Not surprisingly, the study of genetics is being reinvented, rediscovered and reshaped, and we are beginning to understand the science of human heredity at a resolution that was impossible before.

The most significant genetics puzzle today, in my view, is the dissection of ‘family resemblance’ of complex phenotypes, both for intellectual (raison d'être of genetics) and practical (disease diagnosis and therapy) reasons. We have long known that family resemblance arises from shared alleles, declining as genetic relationship wanes, but the precise molecular components and composition of this resemblance are still poorly understood. At the turn of the twentieth century, the components were a matter of bitter and acrimonious debate 44 between the ‘Mendelians’ and the ‘Biometricians’, until the opposing views were reconciled by Ronald Fisher’s 1918 analysis 45 that complex inheritance could be explained through segregation of many genes, each individually Mendelian. In 1920, its publication delayed by World War I, this notion was elegantly demonstrated by the experimental studies of Altenburg and Muller using truncate wing , an “inconstant and modifiable character” 46 in Drosophila .

Fisher’s model assumed an infinite number of genes additively contributing to a trait, with common genetic variation at each component locus comprising two alleles that differ only slightly in their genetic effects 45 ; these genetic assumptions were quite contrary to what was then known 44 . Throughout the past century, this view matured, as segregation analyses of human phenotypes taught us that — beyond the effects of some major genes — most trait variation was polygenic, modulated by family-specific and random environmental factors 47 . Today, we have empirical evidence from GWAS, which use dense maps of genetic variants on hundreds of thousands of individuals measured for many traits and diseases, that the genetic architecture of most multifactorial traits is from common sequence variants with small allelic differences at thousands of sites across the genome 48 . This replacement of a pan-Mendelian view with a pan-polygenic view of traits is one of the most important contributions of genomics to genetics. Unfortunately, this mapping success has not clarified the number of genes involved, the identity of those genes or how those genes specify the phenotype. Indeed, some have concluded that many of the mapped GWAS loci are unrelated to the core biology of each phenotype 49 . Thus, for a deeper understanding, we need radically different approaches to understand complex trait biology in contrast to merely expanding GWAS in larger and larger samples.

for a deeper understanding, we need radically different approaches to understand complex trait biology

Yet, the most significant biology to emerge from GWAS is that most of the likely trait-causing variants fall outside coding sequences, in regulatory elements, most frequently enhancers 50 , 51 . This important finding has uncovered four new genetic puzzles. First, the non-coding regulatory machinery is vast; how much of this regulation is compromised, and how does it affect phenotypes? Second, regulatory changes affect RNA expression at many genes and protein expression at others; how does a cell ‘read’ these numerous changes as specific signals? Third, how is this coordinated expression response translated into cellular responses affecting phenotypes? Fourth, if specific environmental factors affect the same phenotype, which components do they dysregulate? In my opinion, we need to answer these questions for specific traits and diseases to truly understand their polygenic biology. Finally, these explanations must also answer the question of why some traits are decidedly Mendelian whereas others are not.

The questions of tomorrow will need to focus on four areas: the biology of enhancers and the transcription factors that bind them 51 ; the effect of genetic variation in enhancers 50 ; gene regulatory networks (GRNs) that regulate expression of multiple genes 52 ; and how GRN changes lead to specific cellular responses 53 . Despite many advances, the number of enhancers regulating expression of a specific gene remains unknown. How many enhancers are cell type specific versus ubiquitous? How many are constitutive rather than stage specific? And do they act additively or synergistically in gene expression? Additionally, which cognate transcription factors bind these enhancers, with what dynamics and how are they regulated 54 ? These details of a gene’s ‘enhancer code’ are critical for assessing its relative effect on a trait. Next, how does enhancer sequence variation affect a gene’s activity? Does such variation affect transcription factor binding only or its interaction with the promoter? Is the enhancer variant’s effect evident in all cellular states or only some? Is variation in only one enhancer sufficient to alter gene expression, or are multiple changes in multiple elements necessary?

Additional critical questions include which genes are involved in the core pathway underlying a trait, and how do we identify them 49 ? Elegant work has shown how genes are regulated within integrated modular GRNs, whereby one gene’s product is required in a subsequent step by another gene, with feedback interactions 52 . These GRNs comprise elements from the genome, transcriptome and proteome, with rate-limiting steps that require regulation. As our work on Hirschsprung disease has shown 50 , 53 , a GRN is composed of core genes, is the logic diagram of regulation of a major rate-limiting cellular step, is enriched in coding and enhancer disease variants with disease susceptibility scaling with increasing number of variants, and with disease resulting from effects on its rate-limiting gene product 53 . That is, the GRN integrates the expression of multiple genes. Finally, we need to understand how GRN changes alter cell properties and behaviour. I speculate that rate-limiting steps in GRNs are major regulators of broad cell properties, be they differentiation, migration, proliferation or apoptosis, the cellular integrator of GRN variation. Thus, genetic variation across the genome affects enhancers dysregulating many genes, but only when they dysregulate GRNs through rate-limiting steps do they affect cell and tissue biology 55 . This offers the promise of a mechanistic understanding of human polygenic disease.

The way forward for complex trait biology, including disease, is to shift our approach from reverse to forward genetics, using genome-wide approaches to cell type-specific gene perturbation. I believe we can construct cell-type GRNs en masse, inclusive of their enhancers, transcription factors and feedback or feedforward interactions, to then assay functionally defined variation in phenotypes. But, even this approach will be insufficient. We need to test our success by solving at least a few complex traits completely and demonstrating their veracity using a synthetic biology approach to recapitulate the phenotype in a model system; similarly to the field of chemistry, analysis has to be followed by de novo synthesis. Our genomic technologies are getting up to the task to enable this advance; as geneticists, are we?

Enhancers and embryonic development

Eileen Furlong. The work of my group sits at the interface of genome regulation and animal development, and there have been many exciting advances in both during the past decade. Developmental biology studies fundamental processes such as tissue and organ development and how complexity emerges through the combined action of cell communication, movement and mechanical forces. After the discovery that differentiated cells could be reprogrammed to a naive embryonic stem cell-like state, the past decade has witnessed an explosion in in vitro cellular reprogramming and differentiation studies. Organoids are a very exciting extension of this. The extent to which these fairly simple systems can self-organize and generate complexity 56 is one of the unexpected surprises of the past 5–10 years. The buzz around stem cells has also renewed interest in cellular plasticity in vivo and has uncovered an unexpected degree of transdifferentiation and dedifferentiation 57 . In the mouse heart, for example, cardiomyocytes dedifferentiate and proliferate to regenerate heart tissue when damaged within the first week after birth 58 .

Our understanding of the molecular changes that accompany differentiation has hugely advanced owing to the jump in scale, resolution and sensitivity of next-generation sequencing technologies over the past decade. This has led to a flood of studies in embryonic stem cells, iPS cells and embryos that revealed new concepts underlying genome regulation by measuring transcript diversity, transcription factor occupancy, chromatin accessibility and conformation, and chromatin, DNA and RNA modifications. The future challenge will be to connect this information to the physical characteristics of cells and how they form complex tissues. New technologies that solve many challenges of working with embryos will help, including CRISPR to engineer genomes, optogenetics to perturb proteins, lattice light-sheet and selective plane illumination microscopy to image processes in vivo, and low-input methods to overcome issues with scarce material. Particularly exciting to me are recent advances in single-cell genomics, which, although they are in their early days, will dramatically change the way we study embryogenesis. Many new insights have already emerged, including the discovery of unknown cell types and new developmental trajectories for well-established cell types. Even the concept of ‘cell identity’ has come into question.

Cell identities are largely driven by transcription factors, which act through cis -regulatory elements called ‘enhancers.’ One of the most exciting unsolved mysteries, in my opinion, is how enhancers relay information to their target genes. The textbook view of enhancers is of elements with exclusive function that regulate a specific target gene through direct promoter interactions, which occur sequentially if multiple enhancers are involved. However, emerging concepts in the past decade question many of these ‘dogmas’. Some enhancers have dual functions, whereas others may even regulate two genes. Enhancer–promoter communication is now viewed in the light of spatial genome organization, including topologically associating domains (TADs) and membraneless nuclear microcompartments (that is, hubs or condensates) 59 . Being present within the same TAD likely increases the frequency of enhancer–promoter interactions, but how a specific enhancer finds its correct promoter within a TAD, or when TADs are rearranged 60 , 61 , remains a mystery. Hubs or condensates are dynamic microcompartments 62 that contain high local concentrations of proteins, including transcription factors and the transcriptional machinery. One potential implication of condensates is that enhancers may not need to ‘directly’ touch a gene’s promoter to regulate transcription — rather, it may be sufficient to come in close proximity within the same condensate. Presumably, once proteins reach a critical concentration, transcription will be initiated. While this model fits a lot of emerging data, there are still many open questions. What is the required distance between an enhancer and a promoter to trigger transcription? Does this distance differ for different enhancers 63 depending on their transcription factor–DNA affinities? Do different chromatin environments 64 influence the process? At some loci, mutation of a single transcription factor-binding site in a single enhancer can have dramatic effects on gene expression and development. It is difficult to reconcile such cases with a shared condensate model, as other proteins bound to the enhancers and promoter should still phase separate. By contrast, there are many examples where mutation of a single transcription factor-binding site, or even an entire enhancer, has minimal impact on the expression of a gene. These observations suggest that there may be different types of loci, with requirements for different types of chromatin topologies and local nuclear environments, which will be important to tease apart in the coming years.

The genetic dissection of model loci in the 1990s and the first decade of the twenty-first century led to much of our understanding of how genes are regulated. The power of genomics in the past few decades has captured regulatory information for all genes genome-wide, providing more unbiased views of regulatory signatures, leading to new models of gene regulation. What is missing is empirical testing at a large scale. A major challenge is to move to more systematic in vivo functional dissection in organisms. CRISPR-based pooled screens have advanced the interrogation of genomic regions in cell culture systems. However, scaling functional assays in embryos remains a huge challenge. The task is enormous — even long-standing model organisms, such as Drosophila and mice, lack knockout strains for all protein-coding genes, and the number of regulatory elements is at least an order of magnitude higher. There has been little progress in developing scalable methods to quantify the contribution of a transcription factor’s input to an enhancer’s activity, and gene expression, in embryos. More systematic unbiased data will uncover more generalizable regulatory principles, increase our predictive abilities of gene regulation and developmental programmes, and enhance our understanding of the impact of genetic variation.

A major challenge is to move to more systematic in vivo functional dissection in organisms

Perhaps the most promising and exciting prospects in the coming years are to use single-cell genomics, imaging and the integration of the two to dissect the amazing complexity of embryonic development. Single-cell genomics can reveal information about developmental transitions in a way that was unfeasible before. When combined with temporal information, such data can reconstruct developmental trajectories 65 , 66 and identify the regulatory regions and transcription factors likely responsible for each transition 67 . The scale and unbiased nature of the data, profiling tens to hundreds of thousands of cells, provides much richer information than anyone envisaged just 5 years ago, bringing a new level of inference and causal modelling. The ability to measure single-cell parameters in situ (called ‘spatial omics’) will be transformative in the context of developing embryos to reveal the functional impact of spatial gradients, inductive signals and cell–cell interactions, and to move to digital 4D embryos. Combining these approaches with genetic perturbations holds promise to decode developmental programmes as they unfold. Will this bring us to a predictive understanding of the regulatory networks driving embryonic development during the next decade? ‘Simple’ model organisms are a fantastic test case to determine the types and scale of data required and to develop the computational framework to build predictive networks. The systematic functional dissection of gene regulation and true integration of single-cell genomics with single-cell imaging will bring many exciting advances in our understanding of the programmes driving embryonic development in the coming years.

Spatial multi-omics in single cells

Barbara Treutlein. Incredibly, the first single-cell transcriptome was sequenced just over a decade ago 68 ! Since this milestone, transcriptomes of millions of cells have been sequenced and analysed from diverse organisms, tissues and other cellular biosystems, and these maps of cell states are revolutionizing the life sciences. The technologies and associated computational methods have matured and been democratized to such an extent that nearly all laboratories can apply the approach to their particular system or question.

Of course, the transcriptome is not enough, and protocols have already been developed to measure chromatin accessibility, histone modifications, protein abundances, cell lineages and other features linked to genome activity in single cells 69 . Currently, many studies use dissociation-based single-cell genomics methods, where the spatial context is disrupted to facilitate the capture of single cells for downstream processing. Methods are improving to measure genomic features in situ 70 , as well as to computationally map features to spatial contexts 71 , 72 . The stage is set for the next phase of single-cell genomics, where spatial registration of multimodal genome activity across molecular, cellular and tissue or ecosystem scales will enable virtual reconstructions with extraordinary resolution and predictive capacity. These virtual maps will rely on multi-omic profiling of healthy and perturbed tissues and organisms, which presents major challenges and opportunities for innovation.

Cell throughput remains a challenge, and it is unclear what role dissociation-based single-cell sequencing protocols will play in the future. These protocols are fairly easy to implement, and laboratories around the world can execute projects with tens of thousands of cells analysed per experiment. However, there are scenarios in which measuring millions of cells per experiment would be desired, such as in perturbation screens. Combinatorial barcoding methods push cell-throughput boundaries 73 ; however, it is unclear how to scale full transcriptome sequencing economically to millions of cells using current sequencing technologies. ‘Compressed sensing’ modalities — whereby a limited, selected and/or random number of features are measured per cell, and high-dimensional feature levels are recovered through inference or similarity to a known reference — provide an interesting possibility to increasing cell throughput 74 .

Most single-cell transcriptome protocols are currently limited to priming the polyadenylation track present on all cellular mRNAs; however, this approach leads to biased sampling of highly expressed mRNAs. Clever innovations for random or targeted RNA enrichment could be a way to build up composite representations of cell states. Image-based in situ sequencing methods provide a means for increasing the number ofcells measured per experiment, as millions of cells can be imaged without a substantial increase in financial cost, although imaging time is a limiting factor. There remains a lot of room for experimental and computational optimizations to measure the transcriptome, random barcodes, DNA conformations and protein abundances from the micrometre scale to the centimetre scale spatially, and it will be interesting to see how methods for spatial registration advance over the next 5 years.

Currently, most high-throughput measurements are performed on cell suspensions or on intact tissues using one modality. That said, studies are emerging that measure several features from the same cell; for example, mRNA and chromatin accessibility 75 or mRNA and lineage 76 . To build virtual maps, independent measurements from different cells can be integrated with use of data integration tools 77 , although it can be difficult to align cell states across modalities in particular in developing systems. Therefore, the ultimate goal is to directly measure as many features as possible (for example, RNA, lineage, chromatin, proteins and DNA methylation) in the same cell 78 , ideally with spatial resolution. Furthermore, combining genetic and pharmacological perturbation screens with single-cell multi-omic measures will be informative to understand cell state landscapes and underlying regulatory networks for each cell type. The CRISPR–Cas field continues to develop creative tools for precise single-locus editing and other manipulations 79 , and incorporation of these toolkits with single-cell sequencing readouts will certainly bring new mechanistic insight.

Life forms are inherently dynamic, and each cell has a story to tell. Static measurements do not provide sufficient insight into the mechanisms that give rise to each cell state observed in a tissue. Computational approaches to stitch together independent measurements across time can be used to reconstruct potential histories; however, these are indirect inferences. Long-term live imaging in 2D cultures using confocal microscopy and in 3D tissues using light-sheet microscopy provides morphology, behaviour, location and, in some cases, molecular information on the history of a cell. Indeed, such long-term imaging experiments revealed that cell fates or states can be predicted from cell behaviour across many generations 80 . Cell tracking combined with end point single-cell genomics experiments can help to understand how cell states came to be; however, these experiments lack molecular resolution of the intermediates. There are strategies using CRISPR–Cas systems to capture highly prevalent RNAs inside cells at given times and insert these RNAs into DNA for storage and subsequent readout 81 . Together with live tracking and end-point single-cell genomics, such methods could provide unprecedented insight into cell histories.

My vision is that the emerging technologies described above can be applied to human 2D cell culture and 3D organoid biosystems to understand human development and disease mechanisms. My team and others are working to build virtual human organs that are based on high-throughput, multimodal single-cell genomics data. Organoid counterparts provide opportunities to perturb the system and understand lineage histories. Together, the next generation of single-cell genomics methods and human organoid technologies will provide unprecedented opportunities to develop new therapies for human disease.

the next generation of single-cell genomics methods and human organoid technologies will provide unprecedented opportunities

Unravelling the layers of the epigenome

Alexander Meissner. Around 1975, the idea that 5-methylcytosine could provide a mechanism to control gene expression gained traction, despite little knowledge of its genomic distribution or the associated enzymes 82 . With similarly limited genomic information or knowledge of the players involved, the histone code hypothesis was put forward in 2000 to explain how multiple different covalent modifications of chromatin may be coordinated to direct specific regulatory functions 83 . Tremendous progress has been made since, and the list of core epigenetic regulators that have been discovered and characterized seems largely complete 84 .

DNA sequencing has continued to dominate the past decade and contributed to an exponential growth of genome-wide maps of all layers of regulation. In the early days, individual CpG sites could be measured by restriction enzymes, whereas now we have generated probably well over a trillion cytosine methylation measurements. An equally astonishing number of genome-wide data sets have been collected for transcriptomes, histone modifications, transcription factor occupancy and DNA accessibility. Furthermore, the number of single-cell transcriptome and epigenome data sets continues to grow at an unprecedented pace.

On the basis of this overabundance of data across many normal and diseased cell states, for instance, we now clearly understand the non-random distribution of cytosine methylation across many different organisms. These maps have helped to refine our understanding of its relationship to gene expression, including the realization that only a few promoters are normally controlled via this modification, whereas gene bodies are actively targeted, and most dynamic changes occur at distal regulatory sites. Similar insights exist for many core histone modifications, and, in general, we have an improved appreciation of the epigenetic writers, readers and erasers involved. Over the past decade, we have seen substantially integrated and multilayered epigenomic analyses that provide a fairly comprehensive picture of epigenomic landscapes, including their dynamics across development and disease.

Additional innovation is now needed around data access and sharing. As noted, there is certainly no shortage of data, but to enable individual researchers to generate and verify hypotheses quickly improved tools are required to access and browse these data. Over the past decade, large coordinated projects such as ENCODE , the Roadmap Epigenomics Project and Blueprint Epigenome have initiated such efforts, but it remains a reality that data are not at everyone’s fingertips quite yet.

Moreover, despite decades of steady and recently accelerated progress, many important questions remain regarding the molecular coordination and developmental functions of these epigenetic modifications. For instance, cytosine methylation at gene bodies has been preserved for more than a billion years of evolution and yet its precise function is still under investigation. How and why did genomic methylation switch to a global mechanism in vertebrates compared with the selected methylation observed in invertebrates? What is the precise function of this modification in each of its regulatory contexts, and how are its ubiquitously acting enzymes recruited to specific sites in the genome? The latter is particularly timely given recent observations that enhancers, but also some repetitive elements, show ongoing recruitment of both de novo methylation and demethylation activity. Moreover, extraembryonic tissues show redirected activity that shares notable similarities with the long observed altered DNA methylation landscape found across most cancer types 85 . Lastly, it is abundantly clear that DNA methylation is essential for mammalian development; but despite us knowing this for nearly three decades, it is not clear how and why developing knockout embryos die. The specific developmental requirements are also largely true for many histone-modifying enzymes; however, it remains incompletely understood how exactly these modifications interact to support gene regulation.

A decade ago it seemed likely that we would answer questions such as these using newly gained sequencing power as a potent tool for generating hypotheses. However, for the most part, epigenomic analyses have expanded a highly valuable, but still largely descriptive, understanding of numerous epigenetic layers. So one may ask, what is different now and why should we expect to answer these questions in the coming years?

Technological innovation has always played a key role in biology, and some broadly applicable, recent breakthroughs will enable us to drive progress in the coming years. These include the transfer of the bacterial innate immunity CRISPR–Cas system as a universal genome-targeting tool 86 as well as for base editing, epigenome editing and various genome manipulations. Similarly, new fast-acting endogenous protein degradation systems have been developed that further enhance our ability to probe for precise function 87 . The past decade also saw major improvements in imaging technologies as well as cell and molecular biology, moving from the 2D space into the 3D space with both organoid cell culture models 88 and chromosome conformation capture approaches for exploring nuclear organization 89 .

Another major shift included the reappreciation that membraneless organelles are a widespread mechanism of cellular organization 90 . In particular, there have been many advances in our understanding of how condensates form and function, including for transcriptional regulation. Together with known properties of modified histones on DNA and the fact that many epigenetic regulators also contain intrinsically disordered regions, it is reasonable to assume that these physical properties will have a major impact on our understanding of chromatin. Importantly, changes in topology have been linked to disease 91 , and similar connections have been reported recently for condensates 92 . This will likely be an exciting area to follow in the coming years.

there have been many advances in our understanding of how condensates form and function, including for transcriptional regulation

Lastly, our research continues to be more and more reliant on multidisciplinary skills, with mathematics, physics, chemistry and computer science playing an ever-more central role in biology, which will require some rethinking in training and institutional organization to accomplish our goals. Going forward, we will need more functional integration, which in part due to the aforementioned selected discoveries is now very tractable. In particular, more refined perturbation of gene activity, which for many chromatin regulators should be separated into catalytic and regulatory functions, together with readouts at multiple levels of resolution will bring us closer to the insights needed. We recently exemplified this with a pipeline that explores epigenetic regulator mutant phenotypes at single-cell resolution 93 . From these studies, we may be able to understand how epigenetic regulators interact with the environment to influence or protect the organismal phenotype, connecting detailed molecular genetics to classical theories of epigenetic phenomena.

As we approach the 100-year anniversary of the detection of 5-methylcytosine in DNA 94 , it seems we can hope to declare at least for some layers of the epigenome that we fully understand the rules under which they operate. This may enable the exploration of more precise therapeutic interventions, for instance by redirecting chromatin modifiers rather than blocking their universal catalytic activities, which are shared between normal and diseased states. Of course, looking back at predictions made just 10 years ago 95 , one should expect many additional unforeseen advances that are just as difficult to predict now as they were back then.

Long non-coding RNAs: a time to build

Howard Chang. Long non-coding RNAs (lncRNAs) are the dominant transcriptional output of many eukaryotic genomes. Although studies over the past decade have revealed diverse mechanisms and disease implications for many lncRNAs, the vast majority of lncRNAs remain mysterious. The fundamental challenge is that we lack the knowledge to systematically transform lncRNA sequence into function. Progress in the next decade may come from a paradigm shift from ‘reading’ to ‘writing’ lncRNAs.

Gene regulation was once thought to be the exclusive province of proteins. Intense efforts for disease diagnosis and treatment focused almost entirely on protein-coding genes and their products, ignoring the vast majority of the genome. Even at the time of the completion of the Human Genome Project, only a handful of functional lncRNAs were known that silenced the expression of neighbouring genes. Thus, it was widely believed that the genome contained mostly ‘junk’, which sometimes made RNA as transcriptional noise.

The human genome is currently estimated to encode nearly 60,000 lncRNAs, ranging from several hundred to tens of thousands of bases, that apparently do not function by encoding proteins 96 . Studies over the past decade discovered that many lncRNAs act at the interface between chromatin modification machinery and the genome. Specific lncRNAs can act as guides, scaffolds or decoys to control the recruitment of specific chromatin modification enzymes or transcription factors to DNA or their dismissal from DNA 97 . lncRNAs can activate as well as silence genes, and these RNAs can target neighbouring genes as a function of local chromosomal folding (in cis ) or at a distance throughout the genome (in trans ). Detailed dissections of individual lncRNAs have revealed that lncRNAs are composed of modular RNA motifs that enable one lncRNA to connect proteins that read, write or erase specific chromatin marks. These findings have galvanized substantial excitement about lncRNAs; laboratories around the world are now investigating the roles of lncRNAs in diverse systems, ranging from control of flowering time in plants to mutations in human genetic disorders.

Nonetheless, the notable progress to date can be viewed as anecdotal — each lncRNA is its own story. When a new lncRNA sequence is recognized in a genome database or RNA profiling experiment, we are still in the dark about what may happen to the cell or organism (if anything) when the lncRNA is removed. Indeed, efforts to ‘read’ lncRNAs have been the dominant experimental strategy over the past two decades. Systematic efforts in the ENCODE, FANTOM and emerging cell atlas consortia have mapped the transcriptional landscape, transcript isoforms and, more recently, single-cell expression profiles of lncRNAs. These powerful data are now combined with genome-scale CRISPR-based methods to inactivate tens of thousands of lncRNAs, one at a time, to observe possible cell defects 98 , 99 . However, many challenges remain. Positive hits require further exploratory studies to define possible mechanisms of action, and we lack a principled strategy to combine lncRNA knockouts to address genetic redundancy and compensation.

A potentially fruitful and complementary direction is the pivot from ‘reading’ to ‘writing’ long RNA scripts. On the basis of the systematic dissection of RNA sequences and secondary structures in lncRNAs, we and others believe that the information in lncRNAs resembles that on a billboard (in which keywords and catchphrases are repeated) rather than a finely honed legal document (where every comma counts). Small units of RNA shapes are repeated within lncRNAs to build up the meaning in the lncRNA billboard, but these RNA shapes can be rearranged in different orders or locations without affecting meaning. These insights have allowed scientists to recognize lncRNA genes from different species that perform the same function even though the primary sequences bear little similarity 100 . Moreover, investigators were able to strip down lncRNAs to their essential ‘words’, composed of these key repeating shapes and one-tenth the size of the original lncRNA, which still functioned in vivo to control chromatin state over a whole chromosome 100 , 101 . Finally, it is now possible to successfully create synthetic lncRNAs. By adding RNA shapes to carefully chosen RNA templates, investigators are starting to create designer lncRNAs that can regulate chromatin in vivo 100 , suffice to partly rescue the physiological lncRNA gene knockout 102 , or target RNAs to specific cytotopic locations within the cell 103 , 104 .

The shift from reading to writing lncRNAs will challenge us on the technical front, leading to potential transformative technologies. Current technologies for massively parallel reporter gene assays are built on short sequence inserts. A plan to build tens of thousands of synthetic lncRNAs will require accurate long DNA or RNA synthesis. These designer sequences will need to be placed into the appropriate locations in the genome and controlled to have proper developmental expression, splicing pattern and RNA chemical modifications. Landmark studies using the XIST lncRNA, which normally silences the second X chromosome in female cells, to silence the ectopic chromosome 21 in Down syndrome cells highlight the biomedical promise of such an approach 105 .

As the field develops technologies for large-scale creation and testing of synthetic lncRNAs, we can rigorously test our understanding of the information content in the language of RNA sequences and shapes. The next decade promises to be an exciting time for building non-coding RNAs and to create entirely new tools to manipulate gene function for biology and medicine.

FAIR genomics to track tumorigenesis

Núria López-Bigas. Cancer research is one of the fields that has probably benefited the most from the technological and methodological advances of genomics. In the span of less than two decades, the field has witnessed an incredible boost in the generation of cancer genomic, epigenomic and transcriptomic data of patients’ tumours, both in bulk and more recently at the single-cell level. My dream as a cancer researcher is to have a full understanding of the path that cells follow towards tumorigenesis. Which events in the life of an individual, a tissue and a particular cell lead to the malignant transformation of some cells? Of course I do not expect to have a deterministic answer, as this is not a deterministic process. Instead we should aim for a quantitative or probabilistic understanding of the key events that drive tumorigenesis. We have solid epidemiological evidence showing that smoking increases the probability of lung cancer, exposure to the Sun raises the probability of developing melanoma and some anticancer treatments increase the probability of secondary neoplasms. But which specific mechanisms at the molecular and cellular levels influence these increases?

One first clear goal of cancer genomics is to catalogue all genes involved in tumorigenesis across different tissues. Although this is a daunting task, it is actually feasible 106 . By analysing the mutational patterns of genes across tumours, one can identify those with significant deviations from what is expected under neutrality, which indicates that these mutations provide a selective advantage in tumorigenesis and are thus driver mutations. We can imagine a future in which through the systematic analysis of millions of sequenced tumour genomes this catalogue or compendium moves closer and closer to completion. For this to happen, not only do we need genome sequencing to expand — this process is already in motion in research, clinical settings and the pharmaceutical industry — but more importantly the resulting data must be made FAIR (findable, accessible, interoperable and reusable) 107 . To this end, consortia and initiatives that promote, catalyse and facilitate the sharing of genomic data, such as the Beyond 1 Million Genomes consortium, the GA4GH or the cBioPortal for Cancer Genomics , are necessary.

Of note, cataloguing genes and mutations involved in cancer development, albeit a very important first step, is still far from the final goal of understanding how and under which conditions they drive tumorigenesis. Framing cancer development as a Darwinian evolutionary process helps me to navigate the path towards this final objective. As is true of any Darwinian process, its two key features are variation and selection. Thanks to the past 15 years of cancer genomics, we now have a much better grasp of the origin of somatic genetic variation between cells across different tissues. The study of the variability in the number, type and genomic distribution of mutations across tumours provides a window into the life history of cells across the somatic tissues of an individual 108 , 109 . In addition, recent studies sequencing the genome of healthy cells in different tissues 110 , 111 , 112 have shown that mutations accumulate in hundreds and thousands in our cells in normal conditions over time. These studies have also detected positive selection in some genes across healthy tissues. Hence, positive selection is a pervasive process that operates not only in tumorigenesis but also in healthy tissues, where it is a hallmark of somatic development of skin, oesophagus, blood and other tissues. Take, for example, clonal haematopoiesis: it results from a continuous Darwinian evolutionary process in which over time (with age) some haematopoietic cells harbouring mutations in certain blood development genes, such as DNMT3A and TET2 , outcompete other cells in the compartment 113 , 114 . This process is part of normal haematopoietic development. Problems arise only when this process gets out of control, leading to leukaemia in the case of blood, or a malignant tumour in solid tissues. Why is it only in rare cases that this ubiquitous interplay between variation and selection becomes uncontrollable and results in full-blown tumorigenesis? Which events, beside known tumorigenic mutations, drive this process?

we now have a much better grasp of the origin of somatic genetic variation between cells across different tissues

If we have learnt something in recent years, it is that virtually all tumours harbour driver mutations 115 , 116 , 117 , implying that driver genomic events are necessary. However, they are clearly not sufficient for tumorigenesis to occur. So, what are these other triggers of the tumorigenic process? What happens in the lung cells of a smoker or in the haematopoietic cells of a patient treated with chemotherapy that increases their chances to become malignant? Epigenetic modifications and changes in selective constraints, such as evolutionary bottlenecks, for example, at the time of chemotherapy, may be part of the answer.

For the near future, my dream is to see a further increase in FAIR cancer genomics data to help us disentangle the step-by-step game of variation and selection in our tissues that leads to tumorigenesis and likely other ageing-related diseases.

Integrating genomics into medicine

Eran Segal. The past 20 years in genomics have been extraordinary. We developed high-throughput sequencing and learned how to use it to efficiently sequence full genomes and measure gene expression and epigenetic marks at the genome-wide scale and even at the single-cell level 118 . Using these capabilities, we created unprecedented catalogues of novel genomes, functional DNA elements and non-coding RNAs from all kingdoms of life 119 . But — perhaps with the exception of cancer 120 and gene therapy for some monogenic diseases 121 — genomics has yet to deliver on its promise to have an impact on our everyday life. For example, drugs and diagnostics are still being developed in the traditional way, with screening assays to find lead compounds for targets typically arising from animal studies, without involving genomics in any of the steps. Moreover, when the global COVID-19 pandemic hit, the genome of the spreading severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was rapidly sequenced, but why some infected individuals exhibit severe disease and others do not remains unknown.

Indeed, our next challenge is to translate the incredible resources and technologies developed in genomics into an improved understanding of health and disease. This improved understanding should transform the field of medicine to use genomics in its transition to personalized medicine, which promises individualized treatment by targeting the right medication to the right person at the right time on the basis of that person’s unique profile. By continuing to focus on more and more measurements and the creation of more atlases and catalogues, we run the danger of drowning in ever-growing amounts of data and correlative findings. Walking down this path can lead to an endless endeavour, as bulk measurements can always be replaced with single-cell ones, or measures at higher temporal and spatial resolution, across more conditions and wider biological contexts.

Instead, we should use genomics to tackle big unanswered questions such as what causes the variation that we see across people in phenotypes, disease susceptibility and drug responses? What is the relative contribution of genetic, epigenetic, microbiome and environmental factors? How are their effects mediated, and what would be the effect of different interventions? Ultimately, we should strive to use genomics to generate actionable and personalized insights that lead to better health. We are now at an inflexion point in genomics that allows us for the first time to apply it to study human biology and realize these ambitious aims 122 .

At the cellular level, we can use iPS cells from patients to derive cellular models of multiple diseases and prioritize treatments based on measuring both their cellular and molecular response (for example, gene expression and epigenetics) to existing drugs and drug combinations. We can even use massively parallel assays to separately measure the effect of each of tens of thousands of rationally designed mutations, including patient-specific mutations, as we have done, for example, in testing the effect of all clinically identified mutations in TP53 on cellular function 123 . Measuring the molecular effects of directed mutations in genes encoding transcription factors and signalling molecules and in other genes can reveal the underlying pathways and regulatory networks of the disease studied and identify putative therapeutic targets. The application of such approaches to fields that are still poorly understood, such as neurodegenerative diseases, can be particularly impactful.

But we can be much more ambitious and directly profile large cohorts of human individuals using diverse ‘omics’ assays. As molecular changes typically precede clinical disease manifestations, longitudinal measurements coupled with clinical phenotyping have the potential of identifying novel disease diagnostics and therapeutic targets. Indeed, biobanks that track large samples of hundreds of thousands of individuals have recently emerged and are proving highly informative 124 . However, at the molecular level their focus has thus far been on genetics. Technological advances and cost reductions now allow us to obtain much deeper person-specific multi-omic profiles that include transcriptome, proteome, methylome, microbiome, immune system and metabolome measurements. Having these data on the same individual and at multiple time points can reveal which omic layer is more perturbed and informative for each disease and identify associations between molecular markers and disease.

The challenge in using such observational data from human cohorts is to identify which of the associations are causal. One way to address this is to wisely select the nature and type of the associations studied. For example, in working with microbiome data, we can move from analyses at the level of species composition to analyses at the level of SNPs in bacterial genes. Such associations are more specific and more likely to be causal, as in the case of a SNP in the dadH bacterial gene, which correlated with metabolism of the primary medication to treat Parkinson disease and the gut microbiota from patients 125 . Another approach is to use longitudinal measurements and separation of time to emulate target trials from observational data 126 . For example, we can select distinct subsets from the cohort that match on several known risk factors (for example, age or body mass index) but differ on a marker of interest (for example, expression of a gene or presence of an epigenetic mark), and compare future disease onset or progression in these two populations. Similarly, retrospective analysis of baseline multi-omic measurements from participants in randomized clinical trials may identify markers that distinguish responders from non-responders and be used for patient stratification or for identifying additional putative targets.

Ultimately, biomarkers identified from observational cohorts need to be tested in randomized clinical trials to establish causality and assess efficacy. In the case of microbial strains extracted from humans, we may be able to skip animal testing and go directly to human trials. In other cases, such as when human genes are being manipulated, we will need to start with cell culture assays and animal testing before performing clinical trials in humans. However, in all cases, tested omic targets should have already shown associations in human individuals, thus making them more likely to be relevant and succeed in trials, as is the case with drug targets for which genetic evidence links them to the disease 127 .

Beyond these scientific challenges, there is the challenge of engaging the public and diverse ethnic and socio-economic groups to participate in such large-scale multi-omic profiling endeavours even before we can present them with immediate benefits. We can start with incentives in the form of informational summary reports of the data measured and gradually move towards carefully and responsibly conveyed actionable insights as we learn more.

Overcoming the aforementioned challenges is not an easy task, but with the breathtaking advances that genomics has undergone in the past two decades, the time may be right to tackle them. Success can transform genomics from being applied mostly in research settings to having it become an integral and inseparable part of medicine.

CRISPR genome editing enters the clinic

Jin-Soo Kim. In the past several years, genome editing has come of age 128 , in particular because of the repurposing of CRISPR systems. Genomic DNA can be modified in a targeted manner in vivo or in vitro with high efficiency and precision, potentially enabling therapeutic genome editing for the treatment of both genetic and non-genetic diseases. All three types of programmable nucleases developed for genome editing, namely zinc-finger nucleases, transcription activator-like effector nucleases and CRISPR nucleases, are now under clinical investigation. In the next several years, we will be able to learn whether these genome-editing tools will be effective and safe enough to treat patients with an array of diseases, including HIV infection, leukaemia, blood disorders and hereditary blindness, heralding a new era in medicine.

If the history of the development of novel drugs or treatments such as gene therapy and monoclonal antibodies is any guide, the road to therapeutic genome editing is likely to be bumpy but ultimately worth travelling. Key questions related to medical applications of programmable nucleases concern their mode of delivery, specificity, on-target activity and immunogenicity. First, in vivo delivery (or direct delivery into patients) of genes or mRNAs encoding programmable nucleases or preassembled Cas9 ribonucleoproteins can be a challenge, given the large size of these nucleases. Ex vivo (or indirect) delivery is, in general, more efficient than in vivo delivery but is limited to cells from blood or bone marrow, which can be collected with ease, edited in vitro and transfused back into patients. Ongoing developments of nanoparticles and viral vectors are expected to enhance and expand in vivo genome editing in tissues or organs not readily accessible with current delivery systems, such as the brain.

Second, programmable nucleases, including CRISPR nucleases, can cause unwanted on-target and off-target mutations, which may contribute to oncogenesis. Several cell-based and cell-free methods have been developed to identify genome-wide CRISPR off-target sites in an unbiased manner 129 , 130 , 131 . But it remains a challenge to validate off-target activity at sites with low mutation frequencies (less than 0.1%) in a population of cells, owing to the intrinsic error rates of current sequencing technologies. Even at on-target sites, CRISPR–Cas9 can induce unexpected outcomes such as large deletions of chromosomal segments 132 . It will be important to understand the mechanisms behind the unusual on-target activity and to measure and reduce the frequencies of such events.

Last but not least, Cas9 and other programmable nucleases can be immunogenic, potentially causing undesired innate and adaptive immune responses. In this regard, it makes sense that initial clinical trials have focused on ex vivo delivery of Cas9 ribonucleoproteins into T cells or in vivo gene editing in the eye, an immunologically privileged organ. Cas9 epitope engineering or novel Cas9 orthologues derived from non-pathogenic bacteria may avoid some of the immune responses, offering therapeutic modalities for in vivo genome editing in tissues or organs with little or no immune privilege.

Base editing 133 , 134 and prime editing 135 are promising new approaches that may overcome some of the limitations of nuclease-mediated genome editing. Base editors and prime editors are composed of a Cas9 nickase, rather than the wild-type Cas9 nuclease, and a nucleobase deaminase and a reverse transcriptase, respectively. Because a nickase, unlike a nuclease, produces DNA single-strand breaks or nicks, but not double-strand breaks (DSBs), base editors and prime editors are unlikely to induce large deletions at on-target sites and chromosomal rearrangements resulting from non-homologous end joining (NHEJ) repair of concurrent on-target and off-target DSBs. Furthermore, when it comes to gene correction rather than gene disruption, these new types of gene editors are much more efficient and ‘cleaner’ than DSB-producing nucleases because they neither require donor template DNA nor rely on error-prone NHEJ; in human cells, DSBs are preferentially repaired by NHEJ, leading to small insertions or deletions (indels), rather than by homologous recombination involving donor DNA.

Base editors and prime editors are also well suited for germline editing and in utero editing (that is, gene editing in the fetus), which should be done with caution, in full consideration of ethical, legal and societal issues. In principle, CRISPR–Cas9 can be used for the correction of pathogenic mutations in human embryos; however, donor DNA is seldom used as a repair template in human embryos 136 . Recurrent or non-recurrent de novo mutations are responsible for the vast majority of genetic diseases. Cell-free fetal DNA in the maternal blood can be used to detect these de novo mutations in fetuses, which are absent in the parents. Some de novo mutations are manifested even before birth, leading to miscarriage, disability or early death after birth; it is often too late and inefficient to attempt gene editing in newborns. These mutations could be corrected in utero using base editors or prime editors without inducing unwanted indels and without relying on inefficient homologous recombination. Compared with germline editing or preimplantation genetic diagnosis, in utero editing, if proven safe and effective in the future, should be ethically more acceptable because it does not involve the creation or destruction of human embryos.

As promising and powerful as they are, current versions of base editors and prime editors can be further optimized and improved. For instance, Cas9 evolved in microorganisms as a nuclease rather than a nickase. Current Cas9 nickases used for base editing (D10A SpCas9 variant) and prime editing (H840A variant) can be engineered to increase their activities and specificities. In parallel, deaminase and reverse transcriptase moieties in base editors and prime editors, respectively, can be engineered or replaced with appropriate orthologues to increase the efficiency and scope of genome editing. It has been shown that base editors can cause both guide RNA-dependent and guide RNA-independent DNA or RNA off-target mutations, raising concerns for their applications in medicine. Prime editors may also cause unwanted on-target and off-target mutations, which must be carefully studied before moving on to therapeutic applications.

Biomedical researchers are now equipped with powerful tools for genome editing. I expect that these tools will be developed further and applied more broadly in both research and medicine in the coming years.

Collins F. The director of the NIH lays out his vision of the future of medical science. Time https://time.com/5709207/medical-science-age-of-discovery (2019).

The National Academies of Sciences, Engineering, and Medicine Organizing Committee for the International Summit on Human Gene Editing. On human gene editing: international summit statement. The National Academies of Sciences, Engineering, and Medicine https://www.nationalacademies.org/news/2015/12/on-human-gene-editing-international-summit-statement (2015).

Centers for Disease Control and Prevention. COVID-19 in racial and ethnic minority groups. CDC https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/racial-ethnic-minorities.html (2020).

Edwards, F., Lee, H. & Esposito, M. Risk of being killed by police use of force in the United States by age, race–ethnicity, and sex. Proc. Natl Acad. Sci. USA 116 , 16793–16798 (2019).

CAS   PubMed   PubMed Central   Google Scholar  

Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538 , 161–164 (2016).

Popejoy, A. B. et al. The clinical imperative for inclusivity: race, ethnicity, and ancestry (REA) in genomics. Hum. Mutat. 39 , 1713–1720 (2018).

PubMed   PubMed Central   Google Scholar  

Artiga, S. & Orgera, K. Key facts on health and health care by race and ethnicity. Kaiser Family Foundation https://www.kff.org/report-section/key-facts-on-health-and-health-care-by-race-and-ethnicity-coverage-access-to-and-use-of-care/ (2019).

Armstrong, K., Micco, E., Carney, A., Stopfer, J. & Putt, M. Racial differences in the use of BRCA1/2 testing among women with a family history of breast or ovarian cancer. JAMA 293 , 1729–1736 (2005).

CAS   PubMed   Google Scholar  

Bonham, V. L., Callier, S. L. & Royal, C. D. Will precision medicine move us beyond race? N. Engl. J. Med. 374 , 2003–2005 (2016).

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581 , 434–443 (2020).

Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50 , 1219–1224 (2018).

The SIGMA Type 2 Diabetes Consortium. Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico. Nature 506 , 97–101 (2014).

Google Scholar  

Wetterstrand, K. A. DNA sequencing costs: data from the NHGRI genome sequencing program (GSP). National Human Genome Research Institute https://www.genome.gov/sequencingcostsdata (2019).

Wall, J. D. et al. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576 , 106–111 (2019).

CAS   Google Scholar  

Kowalski, M. H. et al. Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genet. 15 , e1008500 (2019).

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209 (2018).

Gaziano, J. M. et al. Million veteran program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70 , 214–223 (2016).

PubMed   Google Scholar  

Nagai, A. et al. Overview of the BioBank Japan Project: study design and profile. J. Epidemiol. 27 , S2–S8 (2017).

Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31 , 1102–1110 (2013).

Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177 , 1080 (2019).

Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51 , 584–591 (2019).

McQuillan, M. A., Zhang, C., Tishkoff, S. A. & Platt, A. The importance of including ethnically diverse populations in studies of quantitative trait evolution. Curr. Opin. Genet. Dev. 62 , 30–35 (2020).

Sohail, M. et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife 8 , e39702 (2019).

Crawford, N. G. et al. Loci associated with skin pigmentation identified in African populations. Science 358 , eaan8433 (2017).

Gasperini, M., Tome, J. M. & Shendure, J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat. Rev. Genet. 21 , 292–310 (2020).

Racimo, F., Sankararaman, S., Nielsen, R. & Huerta-Sánchez, E. Evidence for archaic adaptive introgression in humans. Nat. Rev. Genet. 16 , 359–371 (2015).

Skoglund, P. & Mathieson, I. Ancient genomics of modern humans: the first decade. Annu. Rev. Genomics Hum. Genet. 19 , 381–404 (2018).

Vicente, M. & Schlebusch, C. M. African population history: an ancient DNA perspective. Curr. Opin. Genet. Dev. 62 , 8–15 (2020).

Sherman, R. M. et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat. Genet. 51 , 30–35 (2019).

Durvasula, A. et al. Recovering signals of ghost archaic introgression in African populations. Sci. Adv. 12 , eaax5097 (2020).

Skov, L. et al. The nature of Neanderthal introgression revealed by 27,566 Icelandic genomes. Nature 582 , 78–83 (2020).

Adeyemo, A. A. et al. ZRANB3 is an African-specific type 2 diabetes locus associated with beta-cell mass and insulin response. Nat. Commun. 10 , 3195 (2019).

Cohen, J. et al. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9 . Nat. Genet. 37 , 161–165 (2005).

Gurdasani, D. et al. Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell 179 , 984–002.e36 (2019).

Gurdasani, D. et al. Genomics of disease risk in globally diverse populations. Nat. Rev. Genet. 20 , 520–535 (2019).

Gulsuner, S. et al. Genetics of schizophrenia in the South African Xhosa. Science 367 , 569–573 (2020).

Shriner, D. & Rotimi, C. N. Whole-genome-sequence-based haplotypes reveal single origin of the sickle allele during the Holocene wet phase. Am. J. Hum. Genet. 102 , 547–556 (2018).

Wu, Y. et al. Highly efficient therapeutic gene editing of human haematopoietic stem cells. Nat. Med. 25 , 776–783 (2019).

Geard, A. et al. Clinical and genetic predictors of renal dysfunctions in sickle cell anaemia in Cameroon. Br. J. Haematol. 178 , 629–639 (2017).

Lebeko, K. et al. Targeted genomic enrichment and massively parallel sequencing identifies novel nonsyndromic hearing impairment pathogenic variants in Cameroonian families. Clin. Genet. 90 , 288–290 (2016).

Genovese, G. et al. Association of trypanolytic ApoL1 variants with kidney disease in African Americans. Science 329 , 841–845 (2010).

Sierra, B. et al. OSBPL10, RXRA and lipid metabolism confer African-ancestry protection against dengue haemorrhagic fever in admixed Cubans. PLoS Pathog. 13 , e1006220 (2017).

Wonkam, A. & de Vries, J. Returning incidental findings in African genomics research. Nat. Genet. 52 , 17–20 (2020).

Provine, W. B. The Origins of Theoretical Population Genetics (University of Chicago Press, 1971)

Fisher, R. A. The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb. 52 , 399–433 (1918).

Altenburg, E. & Muller, H. J. The genetic basis of truncate wing – an inconstant and modifiable character in Drosophila. Genetics 5 , 1–59 (1920).

Morton, N. E. Analysis of family resemblance. I. Introduction. Am. J. Hum. Genet. 26 , 318–330 (1974).

Visscher, P. M. et al. 10 Years of GWAS discovery: biology, function and translation. Am. J. Hum. Genet. 101 , 5–22 (2017).

Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169 , 1177–1186 (2017).

Emison, E. S. et al. A common, sex-dependent mutation in a putative RET enhancer underlies Hirschsprung disease susceptibility. Nature 434 , 857–863 (2005).

Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337 , 1190–1195 (2012).

Davidson, E. Emerging properties of animal gene regulatory networks. Nature 468 , 911–920 (2010).

Chatterjee, S. et al. Enhancer variants synergistically drive dysregulation of the RET gene regulatory network in Hirschsprung disease. Cell 167 , 355–368 (2016).

Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U. & Gaul, U. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451 , 535–540 (2008).

Chakravarti, A. & Turner, T. N. Revealing rate-limiting steps in complex disease biology: The crucial importance of studying rare, extreme-phenotype families. Bioessays 38 , 578–586 (2016).

Lancaster, M. A. et al. Cerebral organoids model human brain development and microcephaly. Nature 501 , 373–379 (2013).

Rothman, J. & Jarriault, S. Developmental plasticity and cellular reprogramming in caenorhabditis elegans. Genetics 213 , 723–757 (2019).

Porrello, E. R. et al. Transient regenerative potential of the neonatal mouse heart. Science 331 , 1078–1080 (2011).

Mir, M., Bickmore, W., Furlong, E. E. M. & Narlikar, G. Chromatin topology, condensates and gene regulation: shifting paradigms or just a phase? Development 146 , dev182766 (2019).

Ghavi-Helm, Y. et al. Highly rearranged chromosomes reveal uncoupling between genome topology and gene expression. Nat. Genet. 51 , 1272–1282 (2019).

Despang, A. et al. Functional dissection of the Sox9-Kcnj2 locus identifies nonessential and instructive roles of TAD architecture. Nat. Genet. 51 , 1263–1271 (2019).

Hnisz, D., Shrinivas, K., Young, R. A., Chakraborty, A. K. & Sharp, P. A. A phase separation model for transcriptional control. Cell 169 , 13–23 (2017).

Shrinivas, K. et al. Enhancer features that drive formation of transcriptional condensates. Mol. Cell 75 , 549–561 e547 (2019).

Narlikar, G. J. Phase-separation in chromatin organization. J. Biosci. 45 , 5 (2020).

Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566 , 496–502 (2019).

Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360 , eaar3131 (2018).

Cusanovich, D. A. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555 , 538–542 (2018).

Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 6 , 377–382 (2009).

Camp, J. G., Platt, R. & Treutlein, B. Mapping human cell phenotypes to genotypes with single-cell genomics. Science 365 , 1401–1405 (2019).

Lein, E., Borm, L. E. & Linnarsson, S. The promise of spatial transcriptomics for neuroscience in the era of molecular cell typing. Science 358 , 64–69 (2017).

Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33 , 495–502 (2015).

Achim, K. et al. High-throughput spatial mapping of single-cell RNA-seq data to tissue of origin. Nat. Biotechnol. 33 , 503–509 (2015).

Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357 , 661–667 (2017).

Cleary, B., Cong, L., Cheung, A., Lander, E. S. & Regev, A. Efficient generation of transcriptomic profiles by random composite measurements. Cell 171 , 1424–1436 e1418 (2017).

Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361 , 1380–1385 (2018).

Kester, L. & van Oudenaarden, A. Single-cell transcriptomics meets lineage tracing. Cell Stem Cell 23 , 166–179 (2018).

Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20 , 257–272 (2019).

Zhu, C., Preissl, S. & Ren, B. Single-cell multimodal omics: the power of many. Nat. Methods 17 , 11–14 (2020).

Anzalone, A. V., Koblan, L. W. & Liu, D. R. Genome editing with CRISPR-Cas nucleases, base editors, transposases and prime editors. Nat. Biotechnol . (2020).

Loeffler, D. et al. Asymmetric lysosome inheritance predicts activation of haematopoietic stem cells. Nature 573 , 426–429 (2019).

Schmidt, F., Cherepkova, M. Y. & Platt, R. J. Transcriptional recording by CRISPR spacer acquisition from RNA. Nature 562 , 380–385 (2018).

Holliday, R. & Pugh, J. E. DNA modification mechanisms and gene activity during development. Science 187 , 226–232 (1975).

Strahl, B. D. & Allis, C. D. The language of covalent histone modifications. Nature 403 , 41–45 (2000).

Jambhekar, A., Dhall, A. & Shi, Y. Roles and regulation of histone methylation in animal development. Nat. Rev. Mol. Cell Biol. 20 , 625–641 (2019).

Smith, Z. D. et al. Epigenetic restriction of extraembryonic lineages mirrors the somatic transition to cancer. Nature 549 , 543–547 (2017).

Jinek, M. et al. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337 , 816–821 (2012).

Nabet, B. et al. The dTAG system for immediate and target-specific protein degradation. Nat. Chem. Biol. 14 , 431–441 (2018).

Clevers, H. Modeling development and disease with organoids. Cell 165 , 1586–1597 (2016).

Dekker, J., Marti-Renom, M. A. & Mirny, L. A. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14 , 390–403 (2013).

Banani, S. F., Lee, H. O., Hyman, A. A. & Rosen, M. K. Biomolecular condensates: organizers of cellular biochemistry. Nat. Rev. Mol. Cell Biol. 18 , 285–298 (2017).

Lupianez, D. G. et al. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell 161 , 1012–1025 (2015).

Basu, S. et al. Unblending of transcriptional condensates in human repeat expansion disease. Cell 181 , 1062–1079 e1030 (2020).

Grosswendt, S. et al. Epigenetic regulator function through mouse gastrulation. Nature 584 , 102–108 (2020).

Johnson, T. B. & Coghill, R. D. Researches on pyrimidines. C111. The discovery of 5-methyl-cytosine in tuberculinic acid, the nucleic acid of the tubercle bacillus. J. Am. Chem. Soc. 47 , 2838–2844,47 (1925).

Heard, E. et al. Ten years of genetics and genomics: what have we achieved and where are we heading? Nat. Rev. Genet. 11 , 723–733 (2010).

Quinn, J. J. & Chang, H. Y. Unique features of long non-coding RNA biogenesis and function. Nat. Rev. Genet. 17 , 47–62 (2016).

Kopp, F. & Mendell, J. T. Functional classification and experimental dissection of long noncoding RNAs. Cell 172 , 393–407 (2018).

Liu, S. J. et al. CRISPRi-based genome-scale identification of functional long noncoding RNA loci in human cells. Science 355 , eaah7111 (2017).

Rubin, A. J. et al. Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks. Cell 176 , 361–376.e17 (2019).

Quinn, J. J. et al. Rapid evolutionary turnover underlies conserved lncRNA-genome interactions. Genes. Dev. 30 , 191–207 (2016).

Kirk, J. M. et al. Functional classification of long non-coding RNAs by k-mer content. Nat. Genet. 50 , 1474–1482 (2018).

Carter, A. C. et al. Spen links RNA-mediated endogenous retrovirus silencing and X chromosome inactivation. eLife 9 , e54508 (2020).

Lubelsky, Y. & Ulitsky, I. Sequences enriched in Alu repeats drive nuclear localization of long RNAs in human cells. Nature 555 , 107–111 (2018).

Shukla, C. J. et al. High-throughput identification of RNA nuclear enrichment sequences. EMBO J. 37 , e98452 (2018).

Czerminski, J. T. & Lawrence, J. B. Silencing Trisomy 21 with XIST in neural stem cells promotes neuronal differentiation. Dev. Cell 52 , 294–308 e3 (2020).

Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer https://doi.org/10.1038/s41568-020-0290-x (2020).

Article   PubMed   Google Scholar  

Wilkinson, M. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3 , 160018 (2016).

Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578 , 94–101 (2020).

Gonzalez-Perez, A., Radhakrishnan, S. & Lopez-Bigas, N. Local determinants of the mutational landscape of the human genome. Cell 177 , 101–114 (2019).

Martincorena, I. et al. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348 , 880–886 (2015).

Martincorena, I. et al. Somatic mutant clones colonize the human esophagus with age. Science 362 , 911–917 (2018).

Yokoyama, A. et al. Age-related remodelling of oesophageal epithelia by mutated cancer drivers. Nature 565 , 312–317 (2019).

Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371 , 2477–2487 (2014).

Jaiswal, S. et al. Age related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 371 , 2488–2498 (2014).

Sabarinathan, R. et al. The whole-genome panorama of cancer drivers. Preprint at bioRxiv https://doi.org/10.1101/190330 (2017).

Pich, O. et al. The mutational footprints of cancer therapies. Nat. Genet. 51 , 1732–1740 (2019).

Campbell, P. J. et al. Pan-cancer analysis of whole genomes. Nature 578 , 82–93 (2020).

Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550 , 451–453 (2017).

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489 , 57–74 (2012).

Damodaran, S. et al. Cancer Driver Log (CanDL): catalog of potentially actionable cancer mutations. J. Mol. Diagn. 17 , 554–559 (2015).

High, K. A. & Roncarolo, M. G. Gene therapy. N. Engl. J. Med. 381 , 455–464 (2019).

Shilo, S., Rossman, H. & Segal, E. Axes of a revolution: challenges and promises of big data in healthcare. Nat. Med. 26 , 29–38 (2020).

Kotler, E. et al. A systematic p53 mutation library links differential functional impact to cancer mutation pattern and evolutionary conservation. Mol. Cell 71 , 873 (2018).

Swanson, J. M. The UK Biobank and selection bias. Lancet 380 , 110 (2012).

Maini Rekdal, V., Bess, E. N., Bisanz, J. E., Turnbaugh, P. J. & Balskus, E. P. Discovery and inhibition of an interspecies gut bacterial pathway for levodopa metabolism. Science 364 , eaau6323 (2019).

Hernán, M. A. & Robins, J. M. Using big data to emulate a target trial when a randomized trial is not available. Am. J. Epidemiol. 183 , 758–764 (2016).

Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47 , 856–860 (2015).

Kim, J.-S. Genome editing comes of age. Nat. Protoc. 11 , 1573–1578 (2016).

Kim, D. et al. Digenome-seq: genome-wide profiling of CRISPR-Cas9 off-target effects in human cells. Nat. Methods 12 , 237–243 (2015).

Tsai, S. Q. et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat. Biotechnol. 33 , 187–197 (2015).

Wienert, B. et al. Unbiased detection of CRISPR off-targets in vivo using DISCOVER-Seq. Science 364 , 286–289 (2019).

Kosicki, M., Tomberg, K. & Bradley, A. et al. Repair of double-strand breaks induced by CRISPR-Cas9 leads to large deletions and complex rearrangements. Nat. Biotechnol. 36 , 765–771 (2018).

Komor, A. C. et al. Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature 533 , 420–424 (2016).

Nishida, K. et al. Targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems. Science 353 , aaf8729 (2016).

Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576 , 149–157 (2019).

Ma, H. et al. Correction of a pathogenic gene mutation in human embryos. Nature 548 , 413–419 (2017).

Download references

Acknowledgements

A.C. acknowledges that the ideas in his contribution were developed through studies on Hirschsprung disease and thanks the many trainees who have contributed to this work over the past 5 years. A.L.M. acknowledges A. Gutierrez, K. Kostick, G. Lazaro, M. Majumder, K. Munoz, S. Pereira, H. Smith and P. Zuk for feedback. A.M. thanks D. Hnisz, Z. D. Smith, J. Charlton and H. Kretzmer for feedback and the Max Planck Society for funding. A.W. is supported by NIH awards U54HG009790, U01HG009716, U01HG007459 and U24HL135600, and Wellcome Trust award H3A/18/001, and states that the funders had no role in study design, and analysis, decision to publish or preparation of the manuscript. B.T. acknowledges J. G. Camp for helpful discussions. E.E.M.F. is very grateful to A. Ephrussi, M. Mir, M. Perino, Y. Kherdjemil, T. Pollex and S. Secchia for useful comments. E. E. M. F is supported by European Research Council (Advanced Grant) agreement no. 787611 (DeCRyPT). E.S. is supported by grants from the European Research Council and the Israel Science Foundation. H.Y.C. is supported by NIH RM1-HG007735 and R35-CA209919. H.Y.C. is an investigator of the Howard Hughes Medical Institute. J.-S.K. is supported by the Institute for Basic Science (IBS-R021-D1). N.L-B. acknowledges funding from the European Research Council (Consolidator Grant 682398), the Spanish Ministry of Economy and Competitiveness (SAF2015-66084-R, European Regional Development Fund) and the Asociación Española Contra el Cáncer (GC16173697BIGA). S.A.T. is funded by NIH grants R35 GM134957-01 and NIAMS R01AR076241-01A1 and American Diabetes Association Pathway to Stop Diabetes grant #1-19-VSN-02.

Author information

Authors and affiliations.

Center for Medical Ethics and Health Policy, Baylor College of Medicine, Houston, TX, USA

Amy L. McGuire

Broad Institute of MIT and Harvard, Cambridge, MA, USA

Stacey Gabriel & Alexander Meissner

Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA

Sarah A. Tishkoff

Department of Biology, University of Pennsylvania, Philadelphia, PA, USA

Department of Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa

Ambroise Wonkam

Institute of Infectious Diseases and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa

Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA

Aravinda Chakravarti

European Molecular Biology Laboratory, Genome Biology Department, Heidelberg, Germany

Eileen E. M. Furlong

Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland

Barbara Treutlein

Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany

Alexander Meissner

Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA

Institute of Chemistry and Biochemistry, Freie Universität Berlin, Berlin, Germany

Center for Personal Dynamic Regulomes, Howard Hughes Medical Institute, Stanford University, Stanford, CA, USA

Howard Y. Chang

Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain

Núria López-Bigas

Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain

Institució Catalana de Recerca i Estudis Avançats, Barcelona, Spain

Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel

Center for Genome Engineering, Institute for Basic Science, Daejon, Republic of Korea

Jin-Soo Kim

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Amy L. McGuire , Stacey Gabriel , Sarah A. Tishkoff , Ambroise Wonkam , Aravinda Chakravarti , Eileen E. M. Furlong , Barbara Treutlein , Alexander Meissner , Howard Y. Chang , Núria López-Bigas , Eran Segal or Jin-Soo Kim .

Ethics declarations

Competing interests.

H.Y.C. is a co-founder of Accent Therapeutics and Boundless Bio and an advisor of 10x Genomics, Arsenal Biosciences and Spring Discovery. J.-S.K. is a co-founder of and holds stock in ToolGen Inc. A.C., A.L.M., A.M., A.W., B.T., E.E.M.F., E.S., N.L.-B., S.G. and S.A.T. declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Beyond 1 Million Genomes : https://b1mg-project.eu/

Blueprint Epigenome : https://www.blueprint-epigenome.eu/

cBioPortal for Cancer Genomics : https://www.cbioportal.org/

ENCODE : https://www.encodeproject.org/

Global Alliance for Genomics and Health : https://www.ga4gh.org/

gnomAD : https://gnomad.broadinstitute.org/

GTEx : https://www.gtexportal.org/home/

GWAS Catalog : https://www.ebi.ac.uk/gwas

H3Africa : https://h3africa.org

Roadmap Epigenomics Project : http://www.roadmapepigenomics.org/

Rights and permissions

Reprints and permissions

About this article

Cite this article.

McGuire, A.L., Gabriel, S., Tishkoff, S.A. et al. The road ahead in genetics and genomics. Nat Rev Genet 21 , 581–596 (2020). https://doi.org/10.1038/s41576-020-0272-6

Download citation

Accepted : 21 July 2020

Published : 24 August 2020

Issue Date : October 2020

DOI : https://doi.org/10.1038/s41576-020-0272-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Overcoming barriers to single-cell rna sequencing adoption in low- and middle-income countries.

  • Tracy Boakye Serebour
  • Adam P. Cribbs
  • Sarah J. B. Snelling

European Journal of Human Genetics (2024)

An overview of artificial intelligence in the field of genomics

  • Khizra Maqsood
  • Hani Hagras
  • Nicolae Radu Zabet

Discover Artificial Intelligence (2024)

Current trends, limitations and future research in the fungi?

  • Kevin D. Hyde
  • Petr Baldrian
  • Arttapon Walker

Fungal Diversity (2024)

Improving variant calling using population data and deep learning

  • Nae-Chyun Chen
  • Alexey Kolesnikov
  • Andrew Carroll

BMC Bioinformatics (2023)

Methods in molecular biology and genetics: looking to the future

  • Diego A. Forero
  • Vaibhav Chand

BMC Research Notes (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

genome research

  • Open access
  • Published: 05 April 2024

Yak genome database: a multi-omics analysis platform

  • Hui Jiang 1 , 2   na1 ,
  • Zhi-Xin Chai 3   na1 ,
  • Xiao-Ying Chen 1 , 2   na1 ,
  • Cheng-Fu Zhang 1 , 2 ,
  • Yong Zhu 1 , 2 ,
  • Qiu-Mei Ji 1 , 2 &
  • Jin-Wei Xin 1 , 2  

BMC Genomics volume  25 , Article number:  346 ( 2024 ) Cite this article

129 Accesses

Metrics details

The yak ( Bos grunniens ) is a large ruminant species that lives in high-altitude regions and exhibits excellent adaptation to the plateau environments. To further understand the genetic characteristics and adaptive mechanisms of yak, we have developed a multi-omics database of yak including genome, transcriptome, proteome, and DNA methylation data.

Description

The Yak Genome Database ( http://yakgenomics.com/ ) integrates the research results of genome, transcriptome, proteome, and DNA methylation, and provides an integrated platform for researchers to share and exchange omics data. The database contains 26,518 genes, 62 transcriptomes, 144,309 proteome spectra, and 22,478 methylation sites of yak. The genome module provides access to yak genome sequences, gene annotations and variant information. The transcriptome module offers transcriptome data from various tissues of yak and cattle strains at different developmental stages. The proteome module presents protein profiles from diverse yak organs. Additionally, the DNA methylation module shows the DNA methylation information at each base of the whole genome. Functions of data downloading and browsing, functional gene exploration, and experimental practice were available for the database.

This comprehensive database provides a valuable resource for further investigations on development, molecular mechanisms underlying high-altitude adaptation, and molecular breeding of yak.

Peer Review reports

Although single omics study provides information and insights into specific biological or molecular processes, it is hard to confirm the real molecular mechanisms underlying the functionality of an organism and the relationships between biological processes and environmental factors. Integrating and analyzing multiple omics data provide an effective and systematic approach to life science researchers. In general, genomics provides DNA sequence information, transcriptomics examines gene transcription patterns under specific conditions, proteomics explores the composition and expression levels of proteins in cells, and DNA methylation involves chemical modifications on DNA molecules [ 1 ]. Multi-omics analysis combines data at different levels to comprehensively explore biological processes. Multi-omics analysis reveals connections between genomics, transcriptomics, proteomics, and DNA methylation data, facilitating to understand how genomic variations impact gene transcription and protein expression, as well as the associations between DNA methylation and gene activities [ 2 ]. These pieces of information contribute novel information to the gene regulatory networks, which are important to molecular mechanisms underlying biological functions, development, metabolism, etiopathology, and environmental adaptation.

The yak ( Bos grunniens ) is a unique species in the Qinghai-Tibet Plateau, and widely distributes in high-altitude areas of Western China and neighboring regions. As a large mammal at the highest-altitude area, yak has survived and adapted to the harsh and cold environment after thousands of years of evolution [ 3 ]. Their unique biological features make them an ideal model for studying adaptive evolution and high-altitude ecosystems. Yak also plays important roles in agriculture and economic development. As a significant livestock species, yak provides meat, fur, and other economic resources. Their dung is also an important source of agricultural fertilizer and energy production. Moreover, yak positively impacts the ecological balance and vegetation restoration in the plateau grasslands through their grazing behaviors [ 4 ]. In recent years, we have analyzed yak using different omics approaches. These data preliminarily explored the yak genetic characteristics, gene transcription, protein expression, and DNA methylation patterns, as well as molecular regulatory mechanisms in response to different conditions [ 5 , 6 , 7 , 8 , 9 , 10 , 11 ], providing novel insights into the mechanisms underlying evolution, and high-altitude adaptation in yak.

Currently, the data resources of yak omics researches are generally stored in public databases in their raw data format, such as NCBI. These databases primarily provide storage and retrieval functions, but lack an integrated platform for data integration and in-depth analysis. Hu et al. [ 12 ] developed a yak genome database ( http://me.lzu.edu.cn/yak ), which incorporated genome sequences, predicted genes and associated annotations, non-coding RNA sequences, transposable elements, and single nucleotide variants of yak, as well as three-way whole-genome alignments between human, cattle and yak. However, this database did not include other omics datasets, such as transcriptome, proteome, and DNA methylation. Given the vast and diverse nature of omics data, the traditional database retrieval methods could not fully explore the relationship between different types of datasets [ 13 ]. Thus, an integrated platform of different omics data is crucial to facilitate data integration, interaction, and analysis. An integrated platform can also offer advanced data mining and machine learning algorithms to help researchers discover the complex relationships among yak genomics, transcriptomics, proteomics, and other omics levels, further deepening our understanding of biological processes and diseases in yak.

In this study, the Yak Genome Database ( http://yakgenomics.com/ ) was constructed, which successfully assembled a comprehensive yak fine-scale genome map at the chromosome level, using PacBio sequencing, Illumina sequencing, Bionano assembly, and Hi-C three-dimensional genome scaffolding. Moreover, this platform also integrated transcriptome, proteome, and DNA methylation data of yak, which were not available in Yak Genome Database developed by Hu et al. [ 12 ]. This database provides basic information for yak researches in future, such as molecular breeding, molecular evolution, disease prevention and control.

Construction and content

The Yak Genome Database was deployed in the Ubuntu 20.04 operation system using the AKKA 2.13 (web server), MySQL 8.0.30 (database server), Scala 2.13.2, and SBT 1.3.9. All data were managed and stored using the MySQL Database Management System. The query function was enforced based on Slick 3.3.2 middleware tier. The Jbrowse 1.16.11 was used to visualize the genome. The website interfaces were designed and implemented using the Bootstrap 4.6.0 and the Play Framework 2.8.7. The software versions and statistical tools used for data analyses and plot preparation have been presented in Xin et al. [ 6 , 7 , 8 , 9 , 10 , 11 ]. The boxplots, and heatmaps were prepared using R 4.2.1. The website has been tested in several popular web browsers, including Firefox, Google Chrome, and Internet Explorer.

Utility and discussion

The yak genome database content.

The multi-omics data in the Yak Genome Database are categorized into two central functional domains: data resources and navigation (Fig.  1 ). The data resources contain four main modules, including genome, transcriptome, proteome, and methylation information. The database contains 26,518 genes, 62 transcriptomes, 144,309 proteome spectra, and 22,478 methylation sites of yak. The navigation page consists of Browser, Jbrowse, Search and Blast functions. Currently, the database supports individual download of images and gene data. In the future, we will add functions such as one-click download of whole genome information.

figure 1

The homepage of yak genome database

Genome module

The Genome module incorporates the complete genomic DNA sequence of yaks obtained by the third-generation high-throughput sequencing platform (PacBio RSII) [ 14 ]. The yak genome was sequenced at a coverage of 70X, with the second-generation sequencing data used to correct errors. The Bionano assisted assembly technology was used for high-quality assembly, and analysis. Next, a refined physical map of the yak chromosome was generated, providing a more readable and complete genome database than the fragmented information in another Yak Genome Database (BosGru_v2.0) [ 15 ], and contributing a novel genome tool to yak researchers.

When accessing the ‘Genome’ section on the homepage, a new page will display information of genes at all locations, such as Gene ID, Chromosome, Start Position, End Position, Strand, GO (Gene Ontology) terms, Interpro, KEGG (Kyoto Encyclopedia of genes and Genomes), Swissprot, and Trembl in a user-friendly table format (Fig.  2 A). When clicking each gene, users can access detailed information of this gene, including annotations, transcriptional levels, proteome data, Jbrowse page, and nucleotide sequences associated with the gene (Fig.  2 B- 2 D). The ‘Annotation’ tav provides comprehensive gene annotation information, including GO terms, KEGG pathways, and Interpro annotations, which can be further explored by clicking them. The ‘Expression’ tab displays gene expression levels across different cattle breeds and tissues, and users can download the images in various formats by selecting the menu in the upper right corner of the image. ‘Jbrowse’ is used to display integrated information from annotated genomic datasets, while ‘Seqs’ provides the coding sequence (CDS) and protein sequence on the selected gene.

figure 2

Features of the genome module. ( A ) Genome browse. ( B ) Basic information and annotation of a gene. ( C ) Gene expression. ( D ) Gene Jbrowse and sequences

Transcriptome module

Previously, comparative transcriptome sequencing was performed on lung, gluteal muscle, and mammary gland tissues of low-altitude cattle (Sanjiang and Holstein cattle), Tibetan cattle (living at a moderate altitude), and yaks (living at a high altitude). In addition, these tissues of yaks at different ages (6, 30, 60, and 90 months) were also subjected to transcriptome sequencing. These analyses identified the functional genes involved in the major biochemical, metabolic, and signal transduction pathways involved in yak development and high-altitude adaptation [ 10 , 11 ]. These data are included in the transcriptome module on the website, providing a valuable transcriptome database for specific tissue biomarkers, molecular research, and breeding of yaks. After clicking the “Transcriptome” button, users can select the strain in the ‘Sample’ dialog box, enter the gene ID in the ‘Gene ID’ dialog box, and then click ‘Search’ (Fig.  3 A), and then the website will return the transcriptional levels of the selected genes in selected samples in the forms of data table, Boxplot, Lineplot, and Heatmap (Fig.  3 B and D).

figure 3

Features of the transcriptome module. ( A ) Transcriptome browse. ( B ) Box plot, ( C ) Line plot and ( D ) Heatmap of gene expression

Proteome module

Using the liquid chromatography-mass spectrometry (LC-MS) method, proteomic analyses were conducted for four specific tissues from four different species (yak, Tibetan cattle, Sanjiang cattle, and Holstein cattle) [ 7 , 8 , 9 ]. All the animals were female and 60 months of age. The proteome module provides two input dialog boxes. Users can select two samples and then click the “search” button. Next, the website will return the comparison results of the expression levels of all genes in the two selected samples, including log2(fold change) and statistical parameters (Fig.  4 ).

figure 4

Browse of the proteome module

Methylation module

DNA methylation is a critical epigenetic modification that occurs in both animals and plants, playing pivotal roles in chromosome structure, gene expression and regulation [ 16 ]. The establishment of a comprehensive DNA methylation database for yak can significantly advance the comprehension of cellular gene expression and regulation, and provide deeper insights into the spatiotemporal specificity of DNA methylation across various developmental stages and organs [ 17 ]. The DNA methylation database of yak presents single-base methylation maps and tissue-specific methylation maps. The single-base methylation maps include: 1) DNA methylation levels at the single-base resolution, 2) DNA methylation levels specific to different base types, 3) DNA methylation levels specific to different gene structures, 4) DNA methylation levels in repetitive sequences, and 5) DNA methylation levels in non-coding sequences and regulatory regions. The tissue-specific methylation maps involve three tissues: mammary gland, lung, and muscle [ 6 ]. On the website, users can select ‘Sample’ and ‘Chromosome’ in the Methylation module, set the ‘Start Position’ and ‘End Position,’ and finally click ‘Search’ to obtain the corresponding DNA methylation results on the selected sequences (Fig.  5 ).

figure 5

Features of the methylation module. ( B ) Box plot, ( C ) Line plot and ( D ) Heatmap of methylated gene expression

‘Browse’ allows users to read the yak genome directly. ‘JBrowse’ is a next-generation genome browser built with JavaScript and HTML5. The Jbrowse of Yak Genome Database includes tracks describing gene, gene sequence, mRNAs, structure, and other gene-related features, and provides a graphical display of annotations on the yak genome (Fig.  6 ). Users can browse gene models on chromosomes and unanchored contigs. For example, if user set the genomic region from 4,454,001 bp to 5,878,000 bp on Chr1 for browsing, all genes in this region will appear in order (Fig.  6 A). When clicking on ‘BmuPB021145’, an extra layer will appear with the detailed information, such as mRNAs, CDS and other features (Fig.  6 B). For more operational details, users can click the ‘Help’ button, which provides comprehensive instructions and guidance.

figure 6

Regional view of the genome using Jbrowse. ( A ) A graphic view of the region 4,454,001 bp to 5,878,000 bp on Chr1. ( B ) The interface after clicking on ‘BmuPB021145’.

The ‘Search’ tab supplies users with two methods (search by gene ID or range) for genome searching. When users click on ‘Blast’, three options ‘Blastn Gene’, ‘Blastn Genome’ and ‘Blastp’ will display. Users can select the Blast type and enter a DNA or protein sequence, and set the parameters of ‘Evalue’, ‘Word size’ and ‘Max target seqs’. After clicking the ‘Search’ button, the nucleotide or protein sequence complying the search conditions will display and could be downloaded by the users.

Additional tools

The Yak Genome Database also provides users with several convenient online tools, including Primer designer, GO and KEGG enrichment. The ‘Primer designer’ tool offers primer design function to amplify a selected sequence. The ‘GO enrichment’ and ‘KEGG enrichment’ tools facilitate the users to obtain the GO and KEGG enrichment results of a set of genes.

Maintenance of the yak genome database in future

To ensure continuous operation of the Yak Genome Database, we would assign an administrator to manage the website regularly. We would keep omics studies on yak in future, and all the omics data we obtained would be uploaded to this database. In addition, we would keep cooperations with other investigators and find more cooperators who work on yak. Next, all the progresses on yak omics would also be encouraged to supplement in this database.

Conclusions

The Yak Genome Database is a comprehensive platform of genomic physical map, which integrates genome, transcriptome, proteome, and DNA methylation data. Information in the database can be downloaded, and shared through the Internet. Users who want to upload their own data can contact the administrator of the website. By providing timely updates on yak research progress, the Yak Genome Database enables efficient and interactive sharing of existing scientific data among researchers worldwide who are interested in yak, cattle, livestock, ruminant animals, and even medical research. Comparative analysis of multidimensional data from key yak tissues aims to uncover the mechanisms underlying high-altitude adaptation, disease resistance, cold tolerance, and starvation resistance of large animals in the plateau. These findings contribute to molecular breeding of livestock animals and the understanding of human responses to harsh environments.

Data availability

The datasets generated and analyzed in the current study are freely available on the Download page of Yak database with the web link: http://yakgenomics.com/ .

Abbreviations

Coding Sequence

Gene Ontology

Kyoto Encyclopedia of Genes and Genomes

National Center for Biotechnology Information

Manzoni C, Kia DA, Vandrovcova J, Hardy J, Wood NW, Lewis PA, et al. Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief Bioinform. 2018;19(2):286–302.

Article   CAS   PubMed   Google Scholar  

Liao Y, Wang J, Zou J, Liu Y, Liu Z, Huang Z. Multi-omics analysis reveals genomic, clinical and immunological features of SARS-CoV-2 virus target genes in pan-cancer. Front Immunol. 2023;14:1112704.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Ge Q, Guo Y, Zheng W, Zhao S, Cai Y, Qi X. Molecular mechanisms detected in yak lung tissue via transcriptome-wide analysis provide insights into adaptation to high altitudes. Sci Rep. 2021;11(1):7786.

Ayalew W, Chu M, Liang C, Wu X, Yan P. Adaptation mechanisms of Yak ( Bos grunniens ) to high-Altitude Environmental stress. Animals. 2021;11(8):2344.

Article   PubMed   PubMed Central   Google Scholar  

Gao X, Wang S, Wang YF, Li S, Wu SX, Yan RG, et al. Long read genome assemblies complemented by single cell RNA-sequencing reveal genetic and cellular mechanisms underlying the adaptive evolution of yak. Nat Commun. 2022;13(1):4887.

Xin J, Chai Z, Zhang C, Zhang Q, Zhu Y, Cao H, et al. Methylome and transcriptome profiles in three yak tissues revealed that DNA methylation and the transcription factor ZGPAT co-regulate milk production. BMC Genom. 2020;21(1):731.

Article   CAS   Google Scholar  

Xin JW, Chai ZX, Zhang CF, Zhang Q, Zhu Y, Cao HW, et al. Signature of high altitude adaptation in the gluteus proteome of the yak. J Exp Zool B Mol Dev Evol. 2020;334(6):362–72.

Xin JW, Chai ZX, Zhang CF, Zhang Q, Zhu Y, Cao HW, et al. Differences in proteomic profiles between yak and three cattle strains provide insights into molecular mechanisms underlying high-altitude adaptation. J Anim Phys Anim Nutr. 2022;106(3):485–93.

Xin JW, Chai ZX, Zhang CF, Yang YM, Zhang Q, Zhu Y, et al. Comparative analysis of Skeleton muscle Proteome Profile between Yak and cattle provides insight into high-altitude adaptation. Curr Proteom. 2021;18(1):62–70.

Xin JW, Chai ZX, Zhang CF, Zhang Q, Zhu Y, Cao HW, et al. Transcriptome profiles revealed the mechanisms underlying the adaptation of yak to high-altitude environments. Sci Rep. 2019;9(1):7558.

Xin JW, Chai ZX, Zhang CF, Zhang Q, Zhu Y, Cao HW, et al. Comparisons of lung and gluteus transcriptome profiles between yaks at different ages. Sci Rep. 2019;9(1):14213.

Hu Q, Ma T, Wang K, Xu T, Liu J, Qiu Q. The yak genome database: an integrative database for studying yak biology and high-altitude adaption. BMC Genomics. 2012;13:600.

Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nat Comp Sci. 2021;1(6):395–402.

Article   Google Scholar  

Ji QM, Xin JW, Chai ZX, Zhang CF, Dawa Y, Luo S, et al. A chromosome-scale reference genome and genome-wide genetic variations elucidate adaptation in yak. Mol Ecol Res. 2021;21(1):201–11.

Jiangfeng F, Yuzhu L, Sijiu Y, Yan C, Gengquan X, Libin W, et al. Transcriptional profiling of two different physiological states of the yak mammary gland using RNA sequencing. PLoS ONE. 2018;13(7):e0201628.

Lucibelli F, Valoroso MC, Aceto S, Plant DNA, Methylation. An epigenetic Mark in Development, Environmental interactions, and evolution. Int j mol sci. 2022;23(15):8299.

Chai Z, Wu Z, Ji Q, Wang J, Wang J, Wang H, et al. Genome-wide DNA methylation and hydroxymethylation changes revealed epigenetic regulation of Neuromodulation and Myelination in Yak Hypothalamus. Front Genet. 2021;12:592135.

Download references

This work was supported by the Program of Provincial Department of Finance of the Tibet Autonomous Region, the Major Special Projects of Tibet Autonomous Region (XZ202101ZD0002N-01), the Second Tibetan Plateau Scientific Expedition and Research Program (2019QZKK0501), and the program National Beef Cattle and Yak Industrial Technology System (CARS-37).

Author information

Hui Jiang, Zhi-Xin Chai and Xiao-Ying Chen contributed equally to this work.

Authors and Affiliations

State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, 850000, Lhasa, Tibet, China

Hui Jiang, Xiao-Ying Chen, Cheng-Fu Zhang, Yong Zhu, Qiu-Mei Ji & Jin-Wei Xin

Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences, 850000, Lhasa, Tibet, China

Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization, Sichuan Province and Ministry of Education, Southwest Minzu University, 610041, Chengdu, Sichuan, China

Zhi-Xin Chai

You can also search for this author in PubMed   Google Scholar

Contributions

HJ and ZXC performed the analysis. XYC conducted the database. CFZ and YZ wrote the paper. QMJ supervised the database and JWX revised the manuscript. All authors have approved the final article.

Corresponding authors

Correspondence to Qiu-Mei Ji or Jin-Wei Xin .

Ethics declarations

Ethics approval and consent to participate.

All procedures and experiments involving animals followed the guidelines for the Care and Use of Laboratory Animals. The Ethics Committee at Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences (Permit Number: 2015 − 216) approved this study.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jiang, H., Chai, ZX., Chen, XY. et al. Yak genome database: a multi-omics analysis platform. BMC Genomics 25 , 346 (2024). https://doi.org/10.1186/s12864-024-10274-6

Download citation

Received : 30 October 2023

Accepted : 31 March 2024

Published : 05 April 2024

DOI : https://doi.org/10.1186/s12864-024-10274-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Multi-omics
  • Plateau environment

BMC Genomics

ISSN: 1471-2164

genome research

ORIGINAL RESEARCH article

Evaluation of stripe rust resistance and genome-wide association study in wheat varieties derived from the international center for agricultural research in the dry areas.

Zhonghao Gao&#x;

  • 1 School of Ecological and Environmental Engineering, Qinghai University, Xining, Qinghai, China
  • 2 Qinghai Academy of Agriculture and Forestry Science, Qinghai University, Xining, Qinghai, China
  • 3 National Crop Germplasm Resources Duplicate, Xining, Qinghai, China

159 wheat varieties obtained from ICARDA, CYR32, CYR33 and CYR34 were used to evaluate the stripe rust resistance in this study. Seedling resistance was carried out in the green house at the two-leaf stage. Adult-plant resistance was carried out between 2022 and 2023 in Xining and Guide, respectively. A total of 24,151 high-quality SNP loci were obtained from a 55K SNP chip data. Genome-wide association study was carried out between SNP loci and stripe rust resistance. Seedling resistance screening revealed that 91.8% (146) of wheat varieties were resistant to CYR32 and CYR33, while only 49.7% (79) of wheat varieties were resistant to CYR34. Adult-plant resistance showed 153 (96.2%) germplasms represented resistance in 2022, while only 85 (53.4%) showed resistance in 2023. An association study using the 55K SNP chip data results combined with disease ratings of 159 materials at both the seedling and adult stages discovered 593 loci related to stripe rust resistance (P ≤ 0.0001). These loci exhibited contribution rates ranging from 11.1% to 18.7%. Among them, 71 were significantly related to resistance against CYR32 at the seedling stage, with a contribution rate of 12.7%-17.2%. Constituting the vast majority, 518 loci distributed across 21 chromosomes were significantly related to CYR33 at the seedling stage, with a contribution rate of 12.6%-18.7%. Fewer loci were found to be associated with disease resistance in adult plants. In E1 environment, a sole locus was detected on chromosome 2B with a contribution rate of 14.4%. In E2 environment, however, exhibited three loci across chromosomes 2B, 4A, and 7B with contribution rates ranging from 11.1% to 16.9%. A total of 68 multi-effect loci were significantly related to resistance against both CYR32 and CYR33 at the seedling stage, and one stable locus was significantly associated with stripe rust resistance at the adult plant stage.

1 Introduction

As one of the world’s most important cereal crops, wheat plays a central role in global food security ( Juliana et al., 2017 ). Wheat stripe rust is prevalent in China but is especially troublesome in some provinces such as Qinghai, Guizhou, Shaanxi, Shanxi, Chongqing, and Sichuan ( Dong et al., 2019 ). The disease is recognized as a significant threat to Chinese food security due to its wide distribution range, rapid transmission, and capacity to cause substantial yield losses ( Wang et al., 2016 ; Dai et al., 2019 ). Since its initial discovery in Europe in 1777, the pathogen has been reported in over 60 countries ( Line, 2002 ; Chen, 2005 ).Recently, scientists have reported a total of 86 ( Yr1 - Yr86 ) stripe rust resistance genes distributed across all chromosomes except chromosome 1A ( Klymiuk et al., 2022 ). An additional 100 genes have been tentatively named but await official adoption. Moreover, researchers have mapped over 300 QTLs controlling wheat stripe rust resistance ( Li et al., 2020 ; Feng et al., 2023 ; Zhu et al., 2023 ). Previous virulence analyses revealed that only Yr5 and Yr15 confer high resistance against races CYR32, CYR33, and CYR34 throughout the whole wheat growth period, while Yr10 , Yr24 , and Yr26 are only effective against CYR32 and CYR33. Yr9 , Yr18 , and Yr41 were susceptible against all three races during both the seedling and adult stages ( Zeng et al., 2015 ; Wang et al., 2019 ). Previous research has tested thousands of wheat production varieties, reserve varieties, high-generation strains, and disease-resistant source materials for reactions to stripe rust at the adult plant stage. Many varieties previously considered resistant to wheat stripe rust were found to be susceptible to CYR34, indicating that the pathotype exhibits strong pathogenicity and high parasitic potential. These findings suggested that CYR34 posed a serious threat to wheat production in China ( Huang et al., 2018 ). Furthermore, a new physiological race of stripe rust, tentatively named “TAS-6”, has been reported to overcome resistance conferred by Yr5 ( Zhang et al., 2020 ). Therefore, uncovering new sources of genetic disease resistance for use in wheat breeding is the key to ensuring global food security ( Xu et al., 2020 ).

Plant breeders often rely on Genome-Wide Association Studies (GWAS), which allow them to analyze the genetic basis of traits at the population level by testing the significance of the association between genetic markers and phenotypic variations ( Liu et al., 2017 ). GWAS has quickly become popular with plant researchers due to its faster results, greater breadth, and higher accuracy since 2001 ( Thornsberry et al., 2001 ; Flint-Garcia et al., 2005 ). In 2006, researchers effectively used a GWAS to discover influential QLTs in wheat for the first time ( Breseghello and Sorrells, 2006 ). The widespread use of the technology has broadened our understanding of the plant. To date, a total of 37 genes/QTLs, including 10 potential new QTLs, have been detected using 616 spring wheat varieties and breeding lines ( Liu L. et al., 2020 ). New disease resistance loci have been discovered through GWAS by combining 9K SNPs with phenotypic data at the adult-plant stage ( Maccaferri et al., 2015 ). In a previous association study utilizing 660K SNPs from 411 spring wheat lines grown over multiple years and environments, researchers identified TraesCS2B01G513100, a candidate gene that is associated with stripe rust resistance. The researchers then developed molecular markers to assist with disease resistance breeding ( Wu et al., 2020 ). Based on 90K SNP chip data, a separate association study of 375 natural wheat varieties from home and abroad identified 26, 22, and 25 significantly associated SNP loci on the short arms of 2AS, 2BS, and 2DS in the second homology group, respectively. These loci can explain 4.61%-11.75% of the observed phenotypic variation ( Zhang et al., 2019 ). Taken together, the results of these studies provide a basis for the cultivation of new stripe rust-resistant wheat varieties and offer guidance for the discovery of new genes.

ICARDA’s (the International Center for Agricultural Research in the Dry Areas) mission is to address challenges caused by harsh and ever-changing environments; and ICARDA’s breeding program also emphasizes the importance of utilizing varieties with local adaptability, as well as the necessity of plant biodiversity for human survival; in addition, ICARDA preserves thousands of wheat crop resources, which provides necessary conditions for us to introduce wheat materials from ICARDA and cultivate new varieties ( Lu, 2022 ). With the continuous development of molecular markers and the emergence of new technologies, novel stripe rust resistance genes are becoming more plentiful and more accessible for wheat breeders ( Luo et al., 2010 ). Various molecular marker techniques have been adopted by researchers to analyze stripe rust resistance and to discover new resistance loci ( Bulli et al., 2016 ; Liu et al., 2018 ). Recently, there is little molecular-based research regarding agronomic traits, disease resistance, and stress resistance in introduced wheat varieties in Qinghai Province. In this study, we conducted a GWAS using introduced ICARDA wheat varieties and 55K SNP to uncover potentially significant disease resistance loci in both wheat seedlings and adult plants, which offered a theoretical reference for the stripe rust resistance of ICARDA wheat varieties and excavating new stripe rust resistance genes or QTLs.

2 Materials and methods

2.1 test materials.

159 wheat varieties derived from ICARDA were obtained and designated as ICARDA001-ICARDA159 ( Supplementary Table 1 ). CYR32, CYR33, and CYR34 were provided by the wheat stripe rust breeding laboratory of Northwest A&F University.

2.2 Disease resistance evaluation at the seedling stage

To assess disease resistance in wheat seedling stage, 10 seeds of a wheat variety were sown in a container and grown under suitable conditions. Once the plants reached the two-leaf stage, they were uniformly sprayed with a 1:100 ratio of Tween to water. Next, each physiological race of stripe rust was mixed with talc in a 1:20 ratio and shaken onto the test varieties. After the wheat leaves were completely diseased, the infection type was rated according to Line and Qayoum’s 9-level reaction pattern.

2.3 Disease resistance evaluation at the adult-plant stage

The stripe rust resistance of adult wheat was evaluated between 2022 and 2023 in experimental fields located in Xining and Guide of Qinghai Province. In 2022, trials in Xining and Guide were designated E1 and E2, respectively, and were inoculated with a mixture of CYR32 and CYR33. In 2023, trials in Xining and Guide were designated as E3 and E4, respectively, and were inoculated with a mixture of CYR32 and CYR34. Wheat lines were scored by recording the number of rust lesions on each leaf during the height of infection.

2.4 Genotypic analysis

The genome of 159 wheat varieties was analyzed using a 55K SNP chip developed by the Affymetrix Axiom platform, provided by ZhongYuJin Marker (Beijing) Biotechnology Co., Ltd. Affymetrix Axiom Analysis Suite software (Thermo Fisher) was employed for preliminary screening and genotyping. A total of 53,063 loci were obtained from the raw data, following criteria of DQC (Data Quality Control) of ≥0.82 and CR (Call Rate) of ≥0.95 ( https://datadryad.org/stash/share/aphTKrewmqNQnQgdj7vgUlw7I8jPMCT0GYiCyDNneSw ). VCFtools software ( Danecek et al., 2011 ) was used for site filtering with a missing rate of >20% and an MAF (Minor Allele Frequency) <0.05, resulting in 24,151 high-quality SNP sites. The marker density map for the obtained high-quality SNP loci was plotted using the CMplot package in the R suite.

2.5 Genetic diversity and population structure

The genetic diversity was calculated with PowerMarker V3.25 software ( Liu and Muse, 2005 ) using the obtained 24,151 high-quality SNP sites. These calculations adhered to the PIC (Polymorphic Information Content) parameter, represented by PIC=1-Σ(P ij ) 2 (P ij represents the frequency of the j th allele at the i locus). Plink 1.9 software ( Purcell et al., 2007 ) was used to perform LD filtering on SNP sites, resulting in 2,821 markers. Admixture software ( Alexander et al., 2009 ) was used to analyze the population structure. The range of K values for the subgroups was preset to be 2-15, and the average value was calculated based on six repetitions. The average Cross-Validation ( CV ) value was calculated for each K value, and the K value associated with the lowest CV value was selected as the optimal subgroup number. Finally, a subgroup map was drawn using R software.

2.6 GWAS of stripe rust resistance

A GWAS was conducted on the 159 wheat varieties using 24,151 high-quality SNPs, phenotypic stripe rust resistance data collected at both the seedling and adult plant stages, a kinship matrix (K-matrix) generated by Tassel 5.0 software ( Bradbury et al., 2007 ) and a Q-matrix created by Admixture.

3 Results and analysis

3.1 stripe rust resistance identification at seedling and adult-plant stages.

During disease resistance screening of wheat seedlings, 21 (13.2%) varieties showed immunity to CYR32, while 125 (78.6%) varieties showed high resistance to CYR32, the remaining 13 (8.2%) lines showed susceptibility to rust stripe race CYR32, including IR002, IR017, IR037, IR053, IR055, IR061, IR073, R077, IR078, IR125, IR128, IR135, and IR149. For CYR33, 18 (11.3%) varieties showed immunity, 128 (80.5%) wheat materials showed high resistance, 1 (0.6%, IR133) varieties showed moderate resistance, and 12 (7.6%) varieties showed susceptibility (except for IR135 among the 13 materials mentioned above). For CYR34, 11 (6.9%) materials showed immunity, 68 (42.8%) materials showed high resistance, 1 (0.6%, IR107) material showed moderate resistance, and 79 (49.7%, including the 13 materials aforementioned) materials showed susceptibility, and most of them are highly susceptible to diseases ( Supplementary Table 2 ; Figure 1 ). With the pervalent race of CYR34, most wheat varieties lost resistance at the seedling stage.

www.frontiersin.org

Figure 1 Venn plot of resistance distribution of wheat materials to CYRThe numbers in the figure represent the number of wheat materials that are resistant to this CYR, and the overlap represents the number of wheat materials that are resistant to both two or three CYR species.

For the stripe rust evaluation of the adult plant stage in Xining, under the E1 environment, 6 materials (3.8%, IR017, IR055, IR125, IR128, IR135, and IR149) showed susceptibility, 111 materials (69.8%) showed high resistance, and 42 materials (26.4%) showed moderate resistance; in the E3 environment, 74 (46.5%) materials showed susceptibility, of which 68 showed resistance in 2022, 72 (45.3%) showed high resistance, and 13 (8.2%) showed moderate resistance. Compared with the E1and E3 environment, there were 50 materials (31.4%) with enhanced disease resistance in the E3 environment, 12 materials (7.5%) with unchanged resistance, and 23 materials (14.5%) with reduced resistance ( Supplementary Table 3 ).

For the identification results of the adult plant stage in Guide, under E2 environment, 6 (3.8%, the same as E1 environment) materials showed susceptibility, 113 (71.1%) materials showed high resistance, and 40 (25.2%) materials showed moderate resistance; in the E4 environment, 74 (45.3%, similar to the E3 environment) materials showed susceptibility, of which 68 materials showed resistance in 2022, 66 (41.5%) materials showed high resistance, and 19 (11.9%) materials showed moderate resistance. Compared with E2 and E4 environment, there were 49 materials (30.8%) with enhanced resistance in the E4 environment, 12 materials (7.5%) with unchanged resistance, and 24 materials (15.1%) with reduced resistance ( Supplementary Table 3 ).

3.2 Genotype analysis

The SNP density map illustrated a non-uniform distribution of 24,151 high-quality SNP loci across three genomes and 21 chromosomes, there was a portion of fragments without SNP markers, and the density was almost entirely concentrated at 5-15 loci/Mb. Among them, the SNP density on chromosome 3D was the highest at 1.7424/Mb, followed by 4D chromosome at 1.4263/Mb, and chromosome 6A has the lowest density at 0.3982/Mb; the number of SNPs on chromosome 4B was the highest, with 1675, followed by chromosome 5B with 1652, and the smallest was chromosome 4D with only 357. There were 9,338 (38.7%), 9,865 (40.8%), and 4,948 (20.5%) polymorphisms identified in genomes A, B, and D respectively. The overall polymorphism ratio was 45.5% (2,4151/53,063), with the highest occurrence in genome B, followed by A and then D. Among them, chromosome 2A possessed the highest PIC value of 0.3526, followed by chromosome 4A with a PIC value of 0.3467. The lowest PIC value was chromosome 3B with 0.2433; the PIC value of the genome A was the highest at 0.3264, followed by the B at 0.2999, and D chromosome at 0.2937. The average PIC value of all chromosomes is 0.3067 ( Table 1 ; Figure 2 ).

www.frontiersin.org

Table 1 SNP markers and polymorphisms of 159 wheat materials.

www.frontiersin.org

Figure 2 SNP density map. Indicates the number of SNPs per 1Mb of length on each chromosome, and different colors indicate different SNP densities.

3.3 Genetic diversity and population structure

The analysis of genetic diversity (GD) revealed an average GD of 0.3744 among the 159 wheat lines, with variations ranging from 0.284 3to 0.4387. Genome A demonstrated the highest GD, followed by B and D. Among all chromosomes, the 2A chromosome had the highest GD value of 0.4387, followed by the 4A chromosome with a GD value of 0.4385, and the 3B chromosome had the lowest value of only 0.2843.

The population structure analysis of the 159 wheat materials revealed that at the minimum CV value, K=11, suggesting that the materials could be divided into 11 subgroups. The Q-matrix generated when K=11 was employed as the covariate in the subsequent association study. The subgroups were comprised of the following: subgroup I with 7 (4.4%) varieties; subgroup II with 21 (13.2%) varieties; subgroup III with 16 (10.1%) varieties; subgroup IV with 11 (6.9%) varieties; subgroup V with 12 (7.6%) varieties; subgroup VI with 18 (11.3%) varieties; subgroup VII with 15 (9.4%) varieties; subgroup VIII with 22 (13.8%) varieties, subgroup IX with 4 (2.5%) varieties; subgroup X with 10 (6.3%) varieties; and subgroup XI with 23 (14.5%) varieties ( Figure 3 ).

www.frontiersin.org

Figure 3 Population structure. (A) represents the CV value for the number of subgroups. The abscissa is the K value of the population structure, and the ordinate is the CV value corresponding to the K value, and the figure shows that when K=11, the CV value is the smallest, that is, 11 is the optimal number of subgroups. (B) represents the composition of each subgroup. Each histogram represents a wheat material. The color and proportion of this histogram indicates which subgroup the material belongs to and what the proportion of descent it belongs to.

3.4 Association study

The association study of ICARDA wheat varieties using 24,151 high-quality SNP loci and phenotypic data identified 593 loci associated with stripe rust resistance, contributing rates ranging from 11.1% to 18.7%. During the seedling stage, 71 loci were found to be related to stripe rust race CYR32. These loci were distributed across chromosomes 1A, 1D, 2A, 2B, 2D, 3A, 3B, 4A, 5A, 5B, 6B, 6D, 7B, and 7D, with contribution rates ranging from 12.7% to17.2%. With the exception of AX-109949596, AX-108783340, and AX-111778082, these loci were also involved in resistance against CRY33 with a contribution rate of 12.6%-17.2%, suggesting that they were multi-effect loci. An additional 518 loci were related to CYR33 and were distributed across all chromosomes with contribution rates of 12.6%-18.7%. In the association study of adult plants, we identified one stable locus, AX-109318462, on chromosome 2B with contribution rates of 14.4%-17.0% in E1 and E2 ( Figures 4 – 6 ; Supplementary Table 4 ). An additional three loci on chromosomes 2B, 4A, and 7B were detected in E2, with contribution rates ranging from 11.1% to 16.9%. No significant loci were detected in E3 and E4. According to the results of the QQ plot, there were loci that were significantly associated with the stripe rust CYR32 and CYR33, E1 and E2 environments in all loci. Although there was a deviation between the observed P value and the expected P value in the QQ plot, which may be due to errors in statistics during the seedling and adult stage phenotype periods by manpower, it can have a certain impact on the results, but overall it can correspond to the results in the Manhattan plot.

www.frontiersin.org

Figure 4 Association study of disease resistance at seedling stage. The Manhattan plot shows the degree of association between SNP loci and traits on each chromosome, and the stronger the association, the higher the height in the figure (left), and the QQ plot shows the consistency between the observed value (ordinate) of P value (ordinate) and the expected value of P value (abscissa), with a significant locus above the dashed line and no significant locus below the dotted line (right). (A, B) are Manhattan plot and QQ plot of CYR32 identification, respectively, and there are loci significantly related to CYR32 on most chromosomes, (C, D) are Manhattan plot and QQ plot of CYR33 identification, respectively, and there are loci significantly related to CYR33 on all chromosomes, and (E, F) are Manhattan plot and QQ plot of CYR34 identification, respectively, and none of the tested materials contain loci that are significantly related to CYR34.

www.frontiersin.org

Figure 5 Association study of disease resistance at adult-plant stage in Xining (A, B) are the Manhattan plot and QQ plots of E1, respectively, and only one chromosome has a significant correlation locus, and (C, D) are the Manhattan plot and QQ plot of E3 identification, respectively, and none of the tested materials contain significant correlation.

www.frontiersin.org

Figure 6 Association study of disease resistance at adult-plant stage in Guide (A, B) are the Manhattan plot and QQ plots of E2, respectively, with a total of 3 significant loci distributed on the 3 chromosomes, and (C, D) are the Manhattan plot and QQ plot of E4 respectively, and none of the test materials contain significant correlation.

4 Discussion

With the global warming and extreme weather, changes in temperature and precipitation have made the harm of stripe rust increasingly aggravated. A recent stripe rust pandemic was reported in 2017, and experts believed that the pandemic was caused by the year’s extremely warm winters, heavy spring rains and a lack of resistance diversity in the main wheat breeding areas ( Zhang, 2022 ). In 2020, wheat stripe rust was included in the list of first-class crop diseases and pests by China, and how to prevent and control effectively is also a key research topic in the world. Therefore, plant protection experts have been constantly updating and improving the management system and technology for many years, but due to the continuous evolution of stripe rust races, abnormal global climate change, etc., the research on stripe rust and disease-resistant varieties will be continuous. At present, the basic and long-term effective means of stripe rust control is the breeding of disease-resistant varieties. However, there is a co-evolution between stripe rust and the host, new types of pathogenic physiological races continue to appear, and the resistance of most wheat varieties in China will be susceptible after 3~5 years of large-scale application, resulting in a disease pandemic ( Chen et al., 2013 ). Due to its special geographical location, Qinghai Province is one of the main summer epidemic areas of wheat stripe rust in China. Because of the influence of climatic conditions, spring wheat is the main wheat plantation, and a large number of stripe rust fungus sources can be found in late spring wheat varieties over summer ( Zhang and Hou, 2022 ). At the same time, since wheat is the main food crop in Qinghai Province, new disease resistance genes were obtained after disease resistance identification and genome-wide association study of wheat materials, which were then introduced into Qinghai Province to enrich the wheat germplasm resources and lay a certain foundation for further wheat breeding research.

4.1 Disease resistance identification

In the screenings of 159 wheat varieties at the seedling stage and during a two-year, two-environment adult field trial, we identified 146 (91.8%) lines that were resistant to both CYR32 and CYR33, while only 80 (50.3%) were resistant to CYR34. These results are consistent with previous studies and suggest that most of the introduced wheat varieties had good resistance to CYR32 and CYR33, indicating that these varieties contained genes, gene combinations and even new genes that could effectively resist CYR32 and CYR33, which could effectively prevent the epidemic of these two epidemic races in the future, and the current materials did not have enough resistance to the new toxic race CYR34 ( Zhang et al., 2022 ). This may be attributed to the early application of CYR32 and CYR33 promoting resistance in wheat ( Zeng et al., 2015 ). Yao conducted a preliminary study on the overwintering conditions of wheat stripe rust in the eastern wheat area of Qinghai, and the results showed that the amount of bacteria before winter was the primary factor affecting the overwintering of wheat stripe rust ( Yao et al., 2014 ). Therefore, the utilization of disease resistance during the whole growth period is of great significance for the prevention and control of stripe rust fungus in Qinghai wheat. Subsequently, these 159 materials can be used for molecular testing to obtain the disease resistance genes that may be contained in them, so as to provide a basis for the development of new varieties.

4.2 Association study of disease resistance

The PIC variation of the 159 materials ranged from 0.243 to 0.353, with an average of 0.307. This indicates that the SNP markers were moderately polymorphic, which is consistent with previous studies ( Kumar et al., 2020 ; Liu YK. et al., 2020 ). Additionally, the genetic diversity ranged from 0.284 to 0.439, with an average of 0.374. This suggests a relatively concentrated genetic background between the breeds, aligning with previous research ( Cao et al., 2015 ; Li et al., 2019 ). An analysis of the 55K SNP chip data from 159 ICARDA wheat varieties combined with the phenotypic stripe rust resistance data revealed that genome B contained the highest quantity of polymorphisms, while genome D exhibited the least. These findings reflect the works of Zhang ( Zhang et al., 2016 ) and Tehseen ( Tehseen et al., 2021 ). A total of 593 significant loci were detected, 68 of which were significant in conferring resistance against both CYR32 and CYR33. These loci are distributed across chromosomes 1A, 1D, 2A, 2B, 2D, 3A, 3B, 4A, 5A, 5B, 6B, 6D, 7B, and 7D. The analysis of stripe rust resistance at the adult-plant stage identified locus AX-109318462 on chromosome 2B as the sole locus associated with resistance in both E1 and E2. Subsequently, referring to the IWGSC_RefSeq_v1.0, the loci within the 5Mb interval on the physical map were considered as one significant locus. Using the QTL genetic map of wheat stripe rust resistance constructed by Cheng ( Cheng et al., 2019 ) and the comparison diagram by Yao (2022) , the 69 loci were compared and analyzed. It was found that 9 of these 69 loci may be newly discovered loci, namely AX-110974432 (3A), AX-110447030 (3A) AX-108736767 (3A), AX-89776892 (3A), AX-110653920 (5B), AX-109997800 (5B), AX-95252437 (6D), AX-86163952 (7B), AX-110962394 (7B) (the chromosome where the locus is located is indicated in parentheses); the remaining loci are the same or similar to the reported QTL or Yr gene positions. The two loci located on chromosome 5B partially coincided with Qyr.pd-5B.1 between the markers Xbarc275 and XIWA2095, and were very close to the reported QTLs related to stripe rust disease type and QTLs associated with infection type, respectively, so it was speculated that there may be genes related to stripe rust type or reactive type or both in the vicinity of these two loci, and two loci located on chromosome 7B were close to the Yr63 gene on this chromosome, Yr63 is located at a hot spot known for against pests and diseases in plants and animals, which was enriched with multi-nucleotide-binding and leucine rich repeat (NLR) and kinase domain encoding genes ( Mackenzie et al., 2023 ), it is speculated that it is possible to discover new defense genes near this site.

5 Conclusion

In this study, 159 ICARDA wheat varieties were tested for stripe rust resistance. The majority of these lines were resistant to races CYR32 and CYR33, suggesting their potential for widespread cultivation in Qinghai Province. These findings further enrich the repository of disease-resistance genes deployed in local wheat breeding in Qinghai Province. During the association study using 55K SNP chip data and phenotypic disease resistance data from both adult and seedling stages uncovered a total of 593 loci, 589 of which were related to CYR32 and CYR33 and distributed across 21 chromosomes. Four loci were specifically related to the adult plant stage and were distributed on chromosomes 2B, 4A, and 7B. Additionally, we identified 68 multi-effect loci which were associated with disease resistance at the seedling stage. Only one stable locus was related to disease resistance at the adult-plant stage. After analysis and comparison, 60 of the 69 loci were the same or close to the reported QTL or Yr gene positions, which were likely to be the same locus, and the specific relationship needed further study; while 9 loci were relatively far apart, which can exclude the possibility of being the same locus, they may be newly discovered loci in this study and required further precise localization research. This study layed the foundation for preventing wheat stripe rust in China by serving as a theoretical reference for the utilization of novel resistance loci and by providing guidance for the discovery of new resistance genes.

Data availability statement

The data presented in the study are deposited in the Dryad Digital Repository, DOI number: 10.5061/dryad.4xgxd25ht , and the dataset has already been published.

Author contributions

ZG: Data curation, Formal analysis, Investigation, Software, Writing – original draft. XW: Data curation, Formal analysis, Investigation, Software, Writing – original draft. YL: Formal analysis, Investigation, Software, Writing – original draft. WH: Investigation, Supervision, Writing – review & editing. XZ: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing, Data curation, Investigation.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was funded by the National Natural Science Foundation of China, and the project number is 32060612 which was applied for by the corresponding author Zhang.

Acknowledgments

We would like to thank Professor Han Dejun from Northwest A&F University for providing the experimental materials for this study, and to Prof. Hou and Prof. Zhang for providing project management, funding sources, and supervising the conduct of this research and the writing of the article, as well as Senior Sister Xie, Senior Sister Li and Senior Sister Wang who helped complete this paper.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2024.1377253/full#supplementary-material

Alexander, D. H., Novembre, J., Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664. doi: 10.1101/gr.094052.109

PubMed Abstract | CrossRef Full Text | Google Scholar

Bradbury, P. J., Zhang, Z., Kroon, D. E., Casstevens, T. M., Ramdoss, Y., Buckler, E. S. (2007). TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23, 2633–2635. doi: 10.1093/bioinformatics/btm308

Breseghello, F., Sorrells, M. E. (2006). Association mapping of kernel size and milling quality in wheat (Triticum aestivum L.) cultivars. Genetics 172, 1165–1177. doi: 10.1534/genetics.105.044586

Bulli, P., Zhang, J., Chao, S., Chen, X, Pumphrey, M. (2016). Genetic architecture of resistance to stripe rust in a global winter wheat germplasm collection. G3 (Bethesda) 6, 2237–2253. doi: 10.1534/g3.116.028407

Cao, Y. J., Xie, J. Z., Wu, Q. H., Chen, Y. X., Wang, Z. Z., Zhao, H. (2015). Genetic diversity of registered wheat varieties in Henan province based on pedigree and single-nucleotide polymorphism. Acta Agronomica Sin. 41, 197–206. doi: 10.3724/SP.J.1006.2015.00197

CrossRef Full Text | Google Scholar

Chen, W. Q., Kang, Z. S., Ma, Z. H., Xu, S. C., Jin, S. L., Jiang, Y. Y. (2013). Integrated Management of Wheat Stripe Rust Caused by Puccinia striiformis f. sp. tritici in Chin. Scientia Agricultura Sin. 46, 4254–4262. doi: 10.3864/j.issn.0578-1752.2013.20.008

Chen, X. M. (2005). Epidemiology and control of stripe rust Puccinia striiformis f. sp. tritici on wheat. Can. J. Plant Pathol. 27, 314–337. doi: 10.1080/07060660509507230

Cheng, Y. K., Yao, F. J., Ye, X. L., Jiang, Q. T., Li, W., Deng, M., et al. (2019). Construction of linkage map of the meta quantitative trait loci (MQTL) on stripe rust resistance in wheat (Triticum aestivum L.). Acta Phytopathologica Sin. 49, 632–649. doi:  10.13926/j.cnki.apps.000292

Dai, M. F., Mu, J. M., Wang, X. T., Wang, Q. L., Yu, S. Z., Huang, S., et al. (2019). Screening of stripe rust resistance molecular detection of yr genes of wheat germplasms from ICARDA. J. Triticeae Crops 39, 934–940.

Google Scholar

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., et al. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156–2158. doi: 10.1093/bioinformatics/btr330

Dong, N., Hu, H. Y., Hu, T. Z., Li, G., Li, X. J., Chen, X. D., et al. (2019). Molecular detection and distribution of stripe rust resistance genes yr5 , yr10 and yr18 among 384 wheat germplasms. Acta Agriculturae Boreali-occidentalis Sin. 28, 1960–1968.

Feng, J., Yao, F., Wang, M., See, D. R., Chen, X. (2023). Molecular mapping of yr85 and comparison with other genes for resistance to stripe rust on wheat chromosome 1B. Plant Dis. 107, 3585–3591. doi: 10.1094/PDIS-11-22-2600-RE

Flint-Garcia, S. A., Thuillet, A. C., Yu, J., Pressoir, G., Romero, S. M., Mitchell, S. E., et al. (2005). Maize association population: a high-resolution platform for quantitative trait locus dissection. Plant J. 44, 1054–1064. doi: 10.1111/j.1365-313X.2005.02591.x

Huang, J., Jia, Q. Z., Zhang, B., Sun, Z. Y., Huang, M. M., Jin, S. L., et al. (2018). Epidemic forecasting of new strains G22-9 (CYR34) and G22-14 of Puccinia striiformis f. sp. tritici in wheat in Gansu Province. J. Plant Prot. 45, 101–108. doi: 10.13802/j.cnki.zwbhxb.2018.2018910

Juliana, P., Singh, R. P., Singh, P. K., Crossa, J., Huerta-Espino, J., Lan, C., et al. (2017). Genomic and pedigree-based prediction for leaf, stem, and stripe rust resistance in wheat. Theor. Appl. Genet. 130, 1415–1430. doi: 10.1007/s00122-017-2897-1

Klymiuk, V., Chawla, H. S., Wiebe, K., Ens, J., Fatiukha, A., Govta, L., et al. (2022). Discovery of stripe rust resistance with incomplete dominance in wild emmer wheat using bulked segregant analysis sequencing. Commun. Biol. 5, 826. doi: 10.1038/s42003-022-03773-3

Kumar, D., Chhokar, V., Sheoran, S., Singh, R., Sharma, P., Jaiswal, S., et al. (2020). Characterization of genetic diversity and population structure in wheat using array based SNP markers. Mol. Biol. Rep. 47, 293–306. doi: 10.1007/s11033-019-05132-8

Li, J., Dundas, I., Dong, C., Li, G., Trethowan, R., Yang, Z., et al. (2020). Identification and characterization of a new stripe rust resistance gene Yr83 on rye chromosome 6R in wheat. Theor. Appl. Genet. 133, 1095–1107. doi: 10.1007/s00122-020-03534-y

Li, S. S., Yi, T. F., Xu, K., Zhang, S. H., Zhao, Y., Yang, X. J. (2019). Genetic diversity of wheat varieties in hebei province based on single-nucleotide polymorphism. Mol. Plant Breed. 17, 6850–6859. doi: 10.13271/j.mpb.017.006850

Line, R. F. (2002). Stripe rust of wheat and barley in North America: a retrospective historical review. Annu. Rev. Phytopathol. 40, 75–118. doi: 10.1146/annurev.phyto.40.020102.111645

Liu, J., Feng, B., Xu, Z. B., Fan, X. L., Wang, T. (2017). “Genome-wide association study of wheat plant height traits,” in Summary Collection of the 8th Wheat Genomics and Molecular Breeding Conference. (No. 12 Zhongguancun South Street, Haidian District, Beijing: The Crop Science Society of China), 126.

Liu, K., Muse, S. V. (2005). PowerMarker: an integrated analysis environment for genetic marker analysis. Bioinformatics 21, 2128–2129. doi: 10.1093/bioinformatics/bti282

Liu, L., Wang, M. N., Feng, J. Y., See, D. R., Chao, S. M., Chen, X. M. (2018). Combination of all-stage and high-temperature adult-plant resistance QTL confers high-level, durable resistance to stripe rust in winter wheat cultivar Madsen. Theor. Appl. Genet. 131, 1835–1849. doi: 10.1007/s00122-018-3116-4

Liu, L., Wang, M., Zhang, Z., See, D. R., Chen, X. (2020). Identification of stripe rust resistance loci in U.S. Spring wheat cultivars and breeding lines using genome-wide association mapping and yr gene markers. Plant Dis. 104, 2181–2192. doi: 10.1094/PDIS-11-19-2402-RE

Liu, Y. K., Zhu, Z. W., Chen, L., Zou, J., Tong, H. W., Zhu, G., et al. (2020). Revealing the genetic diversity of wheat varieties (Lines) in China based on SNP markers. Acta Agronomica Sin. 46, 307–314. doi: 10.3724/SP.J.1006.2020.91039

Lu, J. N. (2022). Identification of the drought tolerance at seedling stage and discovery of the excellent germplasm in the ICARDA introduced wheat . (China National Knowledge Infrastructure (CNKI)).

Luo, R., Wu, W. L., Zhang, Y., Li, Y. H. (2010). SSR marker and its application to crop genetics and breeding. Genomics Appl. Biol. 29, 137–143.

Maccaferri, M., Zhang, J., Bulli, P., Abate, Z., Chao, S., Cantu, D., et al. (2015). A genome-wide association study of resistance to stripe rust (Puccinia striiformis f. sp. tritici) in a worldwide collection of hexaploid spring wheat (Triticum aestivum L.). G3 (Bethesda) 5, 449–465. doi: 10.1534/g3.114.014563

Mackenzie, A., Norman, M., Gessese, M., Chen, C., Sørensen, C., Hovmøller, M., et al. (2023). Wheat stripe rust resistance locus YR63 is a hot spot for evolution of defence genes - a pangenome discovery. BMC Plant Biol. 23, 590. doi: 10.1186/s12870-023-04576-2

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575. doi: 10.1086/519795

Tehseen, M. M., Istipliler, D., Kehel, Z., Sansaloni, C. P., da Silva Lopes, M., Kurtulus, E., et al. (2021). Genetic diversity and population structure analysis of triticum aestivum L. Landrace panel from Afghanistan. Genes (Basel) 12 (3), 340. doi: 10.3390/genes1203034

Thornsberry, J. M., Goodman, M. M., Doebley, J., Kresovich, S., Nielsen, D., Buckler, E. S. (2001). Dwarf8 polymorphisms associate with variation in flowering time. Nat. Genet. 28, 286–289. doi: 10.1038/90135

Wang, X., Dong, L., Hu, J., Pang, Y., Hu, L., Xiao, G. (2019). Dissecting genetic loci affecting grain morphological traits to improve grain weight via nested association mapping. Theor. Appl. Genet. 132, 3115–3128. doi: 10.1007/s00122-019-03410-4

Wang, C., Li, X., Gao, Q., Wang, B. T. (2016). The occurrence patterns and control strategies of wheat stripe rust in Ningqiang County. Shaanxi J. Agric. Sci. 62, 88–90.

Wu, J. H., Wei, R., Liu, S. J., Wang, Q. L., Nie, X. J., Song, W. N., et al. (2020). “Genome-wide association study combined with multi-environmental phenotypes for rapid identification of candidate genes for wheat stripe rust resistance,” in Summary Collection of the 19th Academic Annual Conference of the Crop Science Society of China. (No. 12 Zhongguancun South Street, Haidian District, Beijing: The Crop Science Society of China), 191.

Xu, X., Yang, X. C., Xu, R. H. (2020). Molecular detection of resistance to stripe rust in 29 wheat germplasm resources. J. Guiyang Univ. (Natural Sciences) 15, 80–83. doi: 10.16856/j.cnki.52-1142/n.2020.03.018

Yao, F. J. (2022). Identification and genome-wide association studies of stripe rust resistance in wheat germplasm . (China National Knowledge Infrastructure (CNKI)).

Yao, Q., Guo, Q. Y., Yan, J. H., Zhang, G., Hou, S. Y., Chen, W. Q. (2014). Survey on overwintering Puccinia striiformis f. sp. tritici at different altitudes in eastern Qinghai. J. Plant Prot. 41, 578–583. doi: 10.13802/j.cnki.zwbhxb.2014.05.030

Zeng, Q. D., Shen, C., Yuan, F. P., Wang, Q. L., Wu, J. H., Xue, W. B., et al. (2015). The resistance evaluation of the Yr genes to the main prevalent pathotypes of Puccini a striiformis f. sp. tririci in China. Acta Phytopathologica Sin. 45, 641–650. doi: 10.13926/j.cnki.apps.2015.06.011

Zhang, Y. B. (2022). Evaluation and genetic testing of stripe rust resistance of candidate varieties in winter wheat production areas in China. China National Knowledge Infrastructure (CNKI). doi: 10.27409/dcnki.gxbnu.2022.000214

Zhang, X. F., Chen, J. X., Liu, Y. X., Yao, Q. (2022). Stripe rust resistance evaluation and gene detection of 44 wheat varieties (Lines) in Qinghai. J. Northwest. A & F Univ. (Natural Science Edition) . 50, 11, 101–109. doi: 10.13207/j.cnki.jnwafu.2022.11.012

Zhang, X. J., Hou, W. W. (2022). Identification and molecular detection of stripe rust resistance of spring wheat in qinghai province. Mol. Plant Breed. 20, 149–154. doi: 10.13271/j.mpb.020.000149

Zhang, Z. J., Ren, Y. M., Ni, Y., Wang, D., Zhang, H. Y., Chen, J. (2019). “Evaluation of wheat resistance to stripe rust and exploration of resistance sites,” in Summary Collection of the 2019 Annual Conference of the Crop Science Society of China. (No. 12 Zhongguancun South Street, Haidian District, Beijing: The Crop Science Society of China), 319.

Zhang, D. Q., Song, X. P., Feng, J., Lian, J. F., Sun, D. J. (2016). Genetic dissection on the derived lines from wheat cultivar zhou 8425 band widely grown cultivars in Huang-huai region. J. Triticeae Crops 36, 1328–1334.

Zhang, G. S., Zhao, Y. Y., Kang, Z. S., Zhao, J. (2020). First Report of a Puccinia striiformis f. sp. tritici Race Virulent to Wheat Stripe Rust Resistance Gene Yr5 in China. Plant Dis. 104, 284. doi: 10.1094/pdis-05-19-0901-pdn

Zhu, Z., Cao, Q., Han, D., Wu, J., Wu, L., Tong, J., et al. (2023). Molecular characterization and validation of adult-plant stripe rust resistance gene Yr86 in Chinese wheat cultivar Zhongmai 895. Theor. Appl. Genet. 136, 142. doi: 10.1007/s00122-023-04374-2

Keywords: ICARDA, wheat varieties, stripe rust, 55K SNP, genome-wide association study

Citation: Gao Z, Wang X, Li Y, Hou W and Zhang X (2024) Evaluation of stripe rust resistance and genome-wide association study in wheat varieties derived from the International Center for Agricultural Research in the Dry Areas. Front. Plant Sci. 15:1377253. doi: 10.3389/fpls.2024.1377253

Received: 27 January 2024; Accepted: 21 March 2024; Published: 09 April 2024.

Reviewed by:

Copyright © 2024 Gao, Wang, Li, Hou and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xiaojuan Zhang, [email protected]

† These authors have contributed equally to this work and share first authorship

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

genome research

Gene Editing Breakthrough: CRISPR-Cas9 Successfully Excises HIV from T-cells in Major Research Leap

T he CRISPR-Cas9 genome editing tool has demonstrated the capability to eradicate HIV viruses from cells cultured in laboratory settings, as per preliminary research slated for presentation at this year’s European Congress of Clinical Microbiology and Infectious Diseases (ECCMID) in Barcelona.

While emphasizing that these findings are currently in the “proof-of-concept” stage and are distant from potential clinical application, the research represents a significant advancement that could potentially advance the pursuit of an HIV cure.

The recipients of the 2020 Nobel Prize in Chemistry were recognized for their pioneering work in uncovering the CRISPR-Cas9 genome editing system.

Functioning akin to molecular scissors, this editing tool precisely excises and alters segments of an organism’s genetic sequence. Its applications encompass a range of uses, including the removal of undesirable or defective genes and the insertion of fresh genetic material in their stead.

Progress in CRISPR-based therapies is already underway, with the inaugural treatment receiving approval from the US Food and Drug Administration (FDA) in 2023 for the management of sickle cell disease.

The researchers, whose findings are slated for presentation at ECCMID, harnessed this technology to target HIV, a virus known to infect immune cells such as T cells, macrophages, and dendritic cells.

Currently, HIV infection is manageable with lifelong antiviral therapy, which reduces the viral load to undetectable levels but does not offer a cure. This is primarily attributed to the virus’s ability to integrate its genetic material into the host’s DNA, rendering it challenging to eradicate. Moreover, discontinuation of antiviral treatment can lead to HIV rebounding from reservoirs of other infected cells.

In a press release, the researchers articulated their goal of developing a robust and safe combinatorial CRISPR-Cas regimen aimed at achieving an inclusive “HIV cure for all.” Their objective is to deactivate diverse strains of HIV across various cellular contexts.

The researchers pursued a comprehensive strategy by employing CRISPR to target two regions of the HIV genome that exhibit conservation across all identified strains of the virus. Yet, they encountered a challenge when they discovered that the size of the vehicle carrying the CRISPR-Cas9 reagents to the infected cells was too large. Consequently, they needed to reduce the size of the vehicle to facilitate easier transportation.

Relevant articles:

– CRISPR Gene Editing Eliminates HIV From Infected Cells

– Gene Editing Can Eliminate HIV From Cell Culture, Researchers Claim , Technology Networks, Sat, 23 Mar 2024 07:00:00 GMT

– Preclinical safety and biodistribution of CRISPR targeting SIV in non-human primates , Nature.com, Thu, 17 Aug 2023 07:00:00 GMT

– CRISPR gene therapy appears safe, but claims of an imminent HIV cure are premature , aidsmap, Fri, 03 Nov 2023 07:00:00 GMT

The CRISPR-Cas9 genome editing tool has demonstrated th […]

Skip to main page content

  • AUTHOR INFO
  • Institution: Google Indexer
  • Sign In via User Name/Password

AACR Annual Meeting

Plant genome evolution in the genus Eucalyptus driven by structural rearrangements that promote sequence divergence

  • Scott Ferguson 1 , 4 ,
  • Ashley Jones 1 ,
  • Kevin Murray 2 ,
  • Rose Andrew 3 ,
  • Benjamin Schwessinger 1 and
  • Justin O Borevitz 1
  • 1 The Australian National University;
  • 2 Max Planck Institute for Biology Tübingen;
  • 3 University of New England
  • ↵ * Corresponding author; email: scott.ferguson.papers{at}gmail.com

Genomes have a highly organized architecture (nonrandom organization of functional and nonfunctional genetic elements within chromosomes) that is essential for many biological functions, particularly, gene expression and reproduction. Despite the need to conserve genome architecture, a high level of structural variation has been observed within species. As species separate and diverge, genome architecture also diverges, becoming increasingly poorly conserved as divergence time increases. However, within plant genomes, the processes of genome architecture divergence are not well described. Here we use long-read sequencing and de novo assembly of 33 phylogenetically diverse, wild and naturally evolving Eucalyptus species, covering 1-50 million years of diverging genome evolution to measure genome architectural conservation and describe architectural divergence. The investigation of these genomes revealed that following lineage divergence genome architecture is highly fragmented by rearrangements. As genomes continue to diverge, the accumulation of mutations and subsequent divergence beyond recognition of rearrangements becomes the primary driver of genome divergence. The loss of syntenic regions also contribute to genome divergence, but at a slower pace than rearrangements. We hypothesise that duplications and translocations are potentially the greatest contributors to Eucalyptus genome divergence.

  • Received April 19, 2023.
  • Accepted March 22, 2024.
  • Published by Cold Spring Harbor Laboratory Press

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml ). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ .

Add to Google+

What's this?

This Article

  • Published in Advance April 8, 2024 , doi: 10.1101/gr.277999.123 Genome Res. 2024. Published by Cold Spring Harbor Laboratory Press
  • » Abstract Free
  • Full Text (PDF)

Article Category

  • Alert me when this article is cited
  • Alert me if a correction is posted
  • Similar articles in this journal
  • Article Metrics
  • Similar articles in PubMed
  • Download to citation manager
  • Permissions

Citing Articles

  • Load citing article information

Google Scholar

  • Articles by Ferguson, S.
  • Articles by Borevitz, J. O.

PubMed/NCBI

  • PubMed citation
  • Profile for Ferguson, S. http://orcid.org/0000-0002-4821-7490
  • Profile for Jones, A. http://orcid.org/0000-0002-7368-1666
  • Profile for Murray, K. http://orcid.org/0000-0002-2466-1917
  • Profile for Andrew, R. http://orcid.org/0000-0003-0099-8336
  • Profile for Schwessinger, B. http://orcid.org/0000-0002-7194-2922
  • Profile for Borevitz, J. O. http://orcid.org/0000-0001-8408-3699

Preprint Server

Current issue.

  • February 2024, 34 (2)

From the Cover

  • Long cell-free DNA in human plasma
  • Sampling biases in single-cell/nucleus RNA-seq
  • Transcriptional repression in response to DNA damage
  • Z-RNA over intercalated motifs in lncRNA
  • Plant genomics
  • Alert me to new issues of Genome Research
  • Advance Online Articles
  • Submit a Manuscript
  • GR in the News
  • Editorial Board
  • E-mail Alerts & RSS Feeds
  • Recommend to Your Library
  • Job Opportunities

AACR

  • Author Info

Copyright © 2024 by Cold Spring Harbor Laboratory Press

  • Print ISSN: 1088-9051
  • Online ISSN: 1549-5469

IMAGES

  1. The Basics of Genome Research and Its Usefulness

    genome research

  2. A Brief Guide to Genomics

    genome research

  3. Genome Research cover image: Pavlidis lab contributes to the

    genome research

  4. First complete sequence of a human genome

    genome research

  5. Center for Human Genetics & Genomics Research

    genome research

  6. Genome Comparison Casts Light on Dark Areas of DNA

    genome research

VIDEO

  1. Identifying the Genomic Basis of Rare Diseases

  2. How to Sequence a Genome: 11. Working Draft Sequence

  3. Genomics for everyone: UCSC researchers release first human pangenome

  4. Genome Browsers

  5. Description of the 1000 Genomes data

  6. Overview of the evolution of genome technology

COMMENTS

  1. Genome Research

    Genome Research is a peer-reviewed journal that publishes original research and reviews on the genome biology of all organisms, including genomic medicine. The journal covers topics such as long-read sequencing, single-cell/nucleus RNA-seq, transcriptional repression, plant genomics, and more.

  2. A Brief Guide to Genomics

    The Human Genome Project, which was led at the National Institutes of Health (NIH) by the National Human Genome Research Institute, produced a very high-quality version of the human genome sequence that is freely available in public databases. That international project was successfully completed in April 2003, under budget and more than two ...

  3. National Human Genome Research Institute Home

    NHGRI is a leader in genomics research and technology development. Learn about its mission, organization, strategic vision, and news on the human genome and its applications.

  4. Introduction to Genomics

    Each genome contains the information needed to build and maintain that organism throughout its life. Your genome is the operating manual containing all the instructions that helped you develop from a single cell into the person you are today. It guides your growth, helps your organs to do their jobs, and repairs itself when it becomes damaged ...

  5. Genomics

    Genomics is the study of the full genetic complement of an organism (the genome). It employs recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyse the ...

  6. The Human Genome Project changed everything

    The joint announcement of the release of the human 'draft' genome sequences occurred 20 years ago, at a ceremony in the White House. The first analyses by two groups, the publicly funded ...

  7. Genome

    The genome is the full genetic complement of an organism, encoded in either DNA or, in many viruses, RNA. It includes both the genes and non-coding sequences. Genetic variation. Interspersed ...

  8. Human Molecular Genetics and Genomics

    In 1987, the New York Times Magazine characterized the Human Genome Project as the "biggest, costliest, most provocative biomedical research project in history." 2 But in the years between the ...

  9. National Human Genome Research Institute (NHGRI)

    NHGRI is a leading authority in genomics, driving research, developing technologies, and studying the impact of genomics on society. Learn about its mission, history, and important events from the Human Genome Project to the present.

  10. Scientists Finish the Human Genome at Last

    In 2019, two scientists — Adam Phillippy, a computational biologist at the National Human Genome Research Institute, and Karen Miga, a geneticist at the University of California, Santa Cruz ...

  11. Genome Research

    Genome Research is a peer-reviewed scientific journal published by Cold Spring Harbor Laboratory Press. Disregarding review journals, Genome Research ranks 2nd in the category 'Genetics and Genomics' after Nature Genetics. The focus of the journal is on research that provides novel insights into the genome biology of all organisms, including ...

  12. Genome

    A genome sequence is the complete list of the nucleotides (A, C, G, and T for DNA genomes) ... Since genomes are very complex, one research strategy is to reduce the number of genes in a genome to the bare minimum and still have the organism in question survive.

  13. Home

    This resource organizes information on genomes including sequences, maps, chromosomes, assemblies, and annotations.

  14. Guide to genomics and genetics

    Learn the basics of genetics and genomics, the study of heredity and the human genome. Explore how genetics and genomics apply to disease research, genetic testing and gene therapy. Find out the future trends and challenges of this field.

  15. About Genomics

    Your genome contains many important clues about you—from your ancestry to the way your body responds to diseases, medications, and aging. ... NHGRI offers access to reliable and timely information about genomics research and the human genome. Our resources and partnerships help spark scientific curiosity, improve genomic literacy, and foster ...

  16. Genomics in medicine: A new era in medicine

    Core Tip: The field of Genomics is the future of medicine, as evidenced by the unprecedented research and clinical application which pushed the time boundaries for the coronavirus disease 2019 mRNA vaccines.However the path to unleashing the potential from genomic tools is far from perfect. A thorough research with international collaboration and cooperation is a necessity and the need of the ...

  17. A roadmap to increase diversity in genomic studies

    Two decades ago, the sequence of the first human genome was published. Since then, advances in genome technologies have resulted in whole-genome sequencing and microarray-based genotyping of ...

  18. Home page

    Genome Biology publishes high-quality research articles on all aspects of genome biology and biomedicine. Browse the latest articles, collections, reviews, and blogs on topics such as epigenetics, transcriptomics, genomics, and more.

  19. The genomic architecture of inherited DNA variants

    This work was supported in part by the U.S. National Human Genome Research Institute /National Heart Lung and Blood Institute grant number UM1HG006542 to the Baylor Hopkins Center for Mendelian ...

  20. Scientist who gene-edited babies is back in lab and 'proud' of past

    In an interview with a Japanese newspaper, He Jiankui said he had resumed research on human embryo genome editing, despite the controversy over the ethics of artificially rewriting genes, ...

  21. The Human Genome Project

    Learn about the Human Genome Project, a scientific feat that generated the first sequence of the human genome and provided fundamental information about human biology and medicine. Explore the project's timeline, fact sheet, video, poster, and commentary from prominent scientists involved in the project.

  22. Can language models read the genome? This one decoded mRNA to make

    The research team used the trained model to create a library of 211 new sequences. Each was optimized for a desired function, primarily an increase in the efficiency of translation. Those proteins, like the spike protein targeted by COVID-19 vaccines, drive the immune response to infectious disease.

  23. The road ahead in genetics and genomics

    However, in recent years, the sequencing cost curve has flattened, as is apparent in publicly reported cost estimates provided by the US National Human Genome Research Institute 13. The cost per ...

  24. Yak genome database: a multi-omics analysis platform

    Background The yak (Bos grunniens) is a large ruminant species that lives in high-altitude regions and exhibits excellent adaptation to the plateau environments. To further understand the genetic characteristics and adaptive mechanisms of yak, we have developed a multi-omics database of yak including genome, transcriptome, proteome, and DNA methylation data. Description The Yak Genome Database ...

  25. Frontiers

    An analysis of the 55K SNP chip data from 159 ICARDA wheat varieties combined with the phenotypic stripe rust resistance data revealed that genome B contained the highest quantity of polymorphisms, while genome D exhibited the least. These findings reflect the works of Zhang (Zhang et al., 2016) and Tehseen (Tehseen et al., 2021). A total of ...

  26. Gene Editing Breakthrough: CRISPR-Cas9 Successfully Excises HIV ...

    The CRISPR-Cas9 genome editing tool has demonstrated the capability to eradicate HIV viruses from cells cultured in laboratory settings, as per preliminary research slated for presentation at this ...

  27. Plant genome evolution in the genus Eucalyptus driven by structural

    Here we use long-read sequencing and de novo assembly of 33 phylogenetically diverse, wild and naturally evolving Eucalyptus species, covering 1-50 million years of diverging genome evolution to measure genome architectural conservation and describe architectural divergence. The investigation of these genomes revealed that following lineage ...