Book cover

European Conference on Technology Enhanced Learning

EC-TEL 2023: Responsive and Sustainable Educational Futures pp 755–761 Cite as

PEER: Empowering Writing with Large Language Models

  • Kathrin Seßler   ORCID: orcid.org/0000-0002-3380-4641 12 ,
  • Tao Xiang 12 ,
  • Lukas Bogenrieder 12 &
  • Enkelejda Kasneci   ORCID: orcid.org/0000-0003-3146-4484 12  
  • Conference paper
  • Open Access
  • First Online: 28 August 2023

2449 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14200))

The emerging research area of large language models (LLMs) has far-reaching implications for various aspects of our daily lives. In education, in particular, LLMs hold enormous potential for enabling personalized learning and equal opportunities for all students. In a traditional classroom environment, students often struggle to develop individual writing skills because the workload of the teachers limits their ability to provide detailed feedback on each student’s essay. To bridge this gap, we have developed a tool called PEER (Paper Evaluation and Empowerment Resource) which exploits the power of LLMs and provides students with comprehensive and engaging feedback on their essays. Our goal is to motivate each student to enhance their writing skills through positive feedback and specific suggestions for improvement. Since its launch in February 2023, PEER has received high levels of interest and demand, resulting in more than 4000 essays uploaded to the platform to date. Moreover, there has been an overwhelming response from teachers who are interested in the project since it has the potential to alleviate their workload by making the task of grading essays less tedious. By collecting a real-world data set incorporating essays of students and feedback from teachers, we will be able to refine and enhance PEER through model fine-tuning in the next steps. Our goal is to leverage LLMs to enhance personalized learning, reduce teacher workload, and ensure that every student has an equal opportunity to excel in writing. The code is available at https://github.com/Kasneci-Lab/AI-assisted-writing .

  • Large Language Models
  • Personalized Education

You have full access to this open access chapter,  Download conference paper PDF

1 Introduction

The introduction of transformers-based technologies [ 13 ] for natural language processing (NLP) has been a breakthrough that pushed the field significantly forward. It enabled the development of pre-trained large language models (LLMs) which can process natural language more effectively and efficiently than previous approaches [ 1 , 10 ]. The most recent models, like ChatGPT [ 8 ], have been fine-tuned using reinforcement learning with human feedback [ 9 ], enhancing their ability to generate human-like conversations and leading to a wide range of novel applications and use cases in various domains, also in the field of education [ 5 ]. Since LLMs are trained to write high quality texts, they can assist users in their writing process [ 15 ]. More specifically, LLM-based tools can help improve writing skills already from the very young age up to professional writing.

During their academic years, students are learning various types of essays. However, in the traditional classroom setting, teachers are not able to provide detailed feedback for each student’s work due to time constraints and heavy workload. Also, feedback is usually only given once (e.g. in the context of graded homework or assessments) without further possibility to enhance the writing afterwards and receive an updated feedback, impeding a continuous process of improvement.

To tackle this challenge in essay writing education, and hence support both learners and teachers, we have developed an AI-based tutor named PEER, Paper Evaluation and Empowerment Resource. The idea behind PEER is to offer comprehensive textual feedback on the learner’s essay, including specific suggestions for improvement, while being always constructive, specific and engaging. This stands in contrast to previous work, where the focus was often on merely grading the essay rather than offering comprehensive feedback [ 11 ]. PEER also allows students to make adjustments to their work and receive updated feedback to provide an ongoing process of improvement. From an educator’s perspective, PEER provides an initial structure and suggestions for constructive and thorough feedback that can serve as a basis for further enhancements by the teacher. Such AI-assisted feedback can save a lot of time and energy, reducing hence the teacher’s workload and offering more space for interaction with the students. Existing work that uses LLMs to improve writing skills often focuses more on general strategies and does not have a concrete focus on the different essay types that are part of the school curriculum with their demands and challenges [ 12 ].

The difficulty of this project lies in the limited availability of student’s essay data and corresponding feedback of the teachers. Due to privacy reasons, such data is typically not publicly available in vast amounts, making it less likely for large language models to have encountered this type of data during their pre-training phase. However, the use of LLMs eliminates the need for costly hand-crafted features that previously formed the basis of many automated essay scoring systems [ 4 ]. PEER explores how the capabilities of LLMs can be leveraged to assist students and teachers in personalizing essay writing education, which involves teaching the model to provide reasonable and helpful feedback tailored to different types of essays.

We argue that by interacting with PEER, a learner can gain the necessary skills to comprehend the essential elements of good essay writing, and thereby enhance their own writing abilities. In order to evaluate the efficacy and implications of PEER, we have initiated collaborations with various German educators, schools, and academic institutions who expressed interest in our endeavor.

While the current version of PEER has been designed specifically to hone writing abilities in the German language and for the German educational system, we envision expanding its applicability to encompass other pedagogical frameworks and languages.

PEER is a user-friendly, web-based tool designed to analyze students’ essays and generate comprehensive feedback that includes concrete suggestions and engaging tips to improve writing.

2.1 Pedagogical Background

Providing effective feedback is crucial for enhancing students’ learning experience [ 3 ], as it can significantly impact their performance and motivation [ 7 ]. However, traditional classroom settings often do not allow teachers to provide comprehensive and engaging feedback to each student, hindering the learning process. To address these challenges, PEER adopts two main approaches:

Thorough and Constructive Feedback : PEER provides detailed and constructive feedback that is always positive, engaging, and helps to motivate students. The feedback highlights the strengths of the students’ work and provides suggestions for improvement. PEER also uses a visual indicator (green instead of red color) to represent feedback that is encouraging and supportive.

Continuous Learning Environment : PEER follows a continuous learning principle that allows students to receive feedback as many times as they require without any fear of being discouraged or judged. This approach fosters a safe and barrier-free learning environment where students can learn at their own pace, and feedback is always readily available.

2.2 Technical Background

Our approach aims to provide students with comprehensive and constructive feedback on their essays. In the absence of a task-specific data set, we achieve this by combining zero-shot learning (inference without prior training only based on semantic information) with Elo ratings (numerical system to compare the relative performance) for different prompts. Through collecting feedback from the users on the quality of the output, we can continuously improve our model.

Zero-Shot Learning. Traditionally, machine learning models have been trained to perform on specific tasks. One significant advantage of LLMs is their versatility [ 10 ], including zero-shot learning, which is the ability to follow textual instructions [ 1 ]. Finding the optimal instructions already evolved into an own field of research [ 6 , 14 ] and several heuristics can aid in crafting suitable prompts. By providing the right instructions, LLMs can be guided to generate feedback that closely resembles that of a teacher. For example, a possible prompt can look like the following (translated from German into English):

figure a

Prompt Elo Rating. Our approach involves systematically identifying the optimal model instructions from a set of prompts through user feedback, utilizing the Elo rating system [ 2 ]. When a user requests feedback on an essay, two different responses generated from distinct instructional prompts are presented. Based on the user’s preference, the ratings for the corresponding instructional prompts are updated using the Elo system. By incorporating human feedback into the process, our system continuously improves, guided by the success of reinforcement learning with human feedback deployed to fine-tune ChatGPT [ 9 ].

Weighted Lottery System. When generating new feedback, the two instructional prompts that wrap the user’s essay are selected using a weighted lottery system based on their respective Elo ratings. This ensures that prompts with higher ratings have a greater chance of being chosen from the set, resulting in feedback that is more likely to be of superior quality. At the same time, the use of a random selection process ensures that all prompts are evaluated.

3 Prototype

To make PEER available for all students, we have developed a website and are currently working on an accompanying application, in order to further reduce the barriers to access and ensure widespread availability (Fig.  1 ).

figure 1

(a) The Start Page. In the first step, the topic of the essay and the relevant meta data is entered. (b) The Feedback Page. The user is provided with two feedback texts and can mark the preferred one to improve our model.

In the first step, users can input the topic or title of their essay, along with relevant meta information such as essay type, school year, and school type. Then, they can choose to insert the text manually or upload an image, which is scanned using an OCR and post-processed by GPT-3 [ 1 ] to remove any artifacts from the image. Next, PEER evaluates the input and generates two feedback texts. The users are then encouraged to indicate which feedback they find more useful. To facilitate continuous learning and improvement, the users can modify their essay according to the feedback and request new feedback. This process can be repeated as many times as necessary to enhance the writing skills of the users.

4 Preliminary Results

Over 4000 essays have already been uploaded for evaluation, with argumentation being the most frequently requested category. The platform was primarily used by students from middle and upper levels.

Prompt Evaluation. Based on the Elo scores, prompting the model to approach the task as a friendly teacher and providing it with additional information about the specific essay type leads to the best results. It enables the model to focus on the relevant characteristics and use the extra information to improve and adapt its feedback accordingly.

Feedback from Teachers. The quantitative results are complemented by feedback from teachers who assessed the tool from a qualitative perspective. Several teachers reported to us their experiences of trying PEER themselves as well as applying it together with their students in their classrooms. Overall, they acknowledged PEER’s usefulness for both students and teachers, highlighting its user-friendliness, respectful tone, and timely feedback that facilitates individualized learning. However, they also identified some areas for improvement. For instance, they noted that the feedback provided by PEER can be too general at times, such as suggesting to “use more adjectives.” Additionally, one teacher pointed out that in the German language, both the male and female forms for professions are typically used to be inclusive. Unfortunately, PEER currently does not account for both versions in its output text, and sometimes even marks them as redundant in students’ essays. Other criticisms included missing essay types on the start page and sometimes small grammar errors in the generated feedback texts. This initial qualitative assessment and teacher feedback is incorporated in the further development of PEER before a more comprehensive and larger user study is conducted.

5 Conclusion and Future Agenda

Based on the amount of feedback we have received from teachers so far, it is clear that PEER is meeting a need in schools for both learners and teachers. However, our project is still in its early stages and requires further development, model fine-tuning and user evaluations. As our objective is to bring PEER to schools and establish it as a valuable assistant in the process of learning how to write an essay, our concrete current and next steps are as follows:

Creating a solid data basis consisting of real-world essays and high-quality feedback provided by teachers. This data set will then be used to fine-tune a large language model for our specific domain to improve the performance.

Conducting a user study at schools to assess the performance of PEER and gather valuable insights into the positive and negative aspects of the tool for both teachers and students.

Given the inherent stochastic nature of the underlying model, it is important to acknowledge that a fully error-free outcome cannot be guaranteed. Consequently, and as for all applications based on large language models, users are advised to carefully evaluate the feedback provided and selectively incorporate only the pertinent critiques. This type of critical thinking is not limited to PEER but should be a fundamental aspect of interacting with any generative AI model.

Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

Google Scholar  

Elo, A.E.: The Rating of Chessplayers, Past and Present. Arco Pub., New York (1978)

Hattie, J., Timperley, H.: The power of feedback. Rev. Educ. Res. 77 (1), 81–112 (2007)

Article   Google Scholar  

Hussein, M.A., Hassan, H., Nassef, M.: Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5 , e208 (2019)

Kasneci, E., et al.: ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Diff. 103 , 102274 (2023)

Liu, J., Shen, D., Zhang, Y., Dolan, W.B., Carin, L., Chen, W.: What makes good in-context examples for GPT-3? In: Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114 (2022)

Molloy, E.K., Boud, D.: Feedback models for learning, teaching and performance. In: Handbook of Research on Educational Communications and Technology, pp. 413–424 (2014)

OpenAI Team: ChatGPT: optimizing language models for dialogue (2022)

Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1 (8), 9 (2019)

Ramesh, D., Sanampudi, S.K.: An automated essay scoring systems: a systematic literature review. Artif. Intell. Rev. 55 (3), 2495–2527 (2022)

Schick, T., et al.: PEER: a collaborative language model. arXiv preprint arXiv:2208.11663 (2022)

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems (2022)

Yuan, A., Coenen, A., Reif, E., Ippolito, D.: Wordcraft: story writing with large language models. In: 27th International Conference on Intelligent User Interfaces, pp. 841–852 (2022)

Download references

Author information

Authors and affiliations.

Technical University of Munich, Munich, Germany

Kathrin Seßler, Tao Xiang, Lukas Bogenrieder & Enkelejda Kasneci

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kathrin Seßler .

Editor information

Editors and affiliations.

KTH Royal Institute of Technology, Stockholm, Sweden

Olga Viberg

Goethe University Frankfurt, Frankfurt am Main, Germany

Ioana Jivet

Universidad Carlos III de Madrid, Madrid, Spain

Pedro J. Muñoz-Merino

University of Macedonia, Thessaloniki, Greece

Maria Perifanou

CODE University of Applied Sciences, Berlin, Germany

Tina Papathoma

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2023 The Author(s)

About this paper

Cite this paper.

Seßler, K., Xiang, T., Bogenrieder, L., Kasneci, E. (2023). PEER: Empowering Writing with Large Language Models. In: Viberg, O., Jivet, I., Muñoz-Merino, P., Perifanou, M., Papathoma, T. (eds) Responsive and Sustainable Educational Futures. EC-TEL 2023. Lecture Notes in Computer Science, vol 14200. Springer, Cham. https://doi.org/10.1007/978-3-031-42682-7_73

Download citation

DOI : https://doi.org/10.1007/978-3-031-42682-7_73

Published : 28 August 2023

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-42681-0

Online ISBN : 978-3-031-42682-7

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Mobile Navigation

Better Language Models

Illustration: Ben Barry

Better language models and their implications

We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.

More resources

Our model, called GPT-2 (a successor to  GPT ), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much  smaller model  for researchers to experiment with, as well as a  technical paper .

GPT-2 is a large  transformer -based language model with 1.5 billion parameters, trained on a dataset [^footnote-dataset] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

GPT-2 generates synthetic text samples in response to the model being primed with an arbitrary input. The model is chameleon-like—it adapts to the style and content of the conditioning text. This allows the user to generate realistic and coherent continuations about a topic of their choosing, as seen by the following select samples. [^footnote-samples]

System Prompt (human-written)

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

Model Completion (machine-written, 10 tries)

A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.

Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.

We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks. A typical approach to language modeling is to learn the following task: predict the next word, given all of the previous words within some text. Last year, OpenAI’s Generative Pre-trained Transformer (GPT) showed that language models trained on large amounts of data can be fine-tuned to specific tasks to achieve high performance. GPT-2 shows that much larger language models trained on a more diverse dataset derived from the internet begin to learn these NLP tasks without needing task-specific training data, instead learning from examples the system derives from the raw text. These systems also display a substantial qualitative jump in the realism and coherence of generated text.

Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.

For today's homework assignment, please describe the reasons for the US Civil War.

John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous developments in nanotechnology, Kennedy's brain was rebuilt from his remains and installed in the control center of a state-of-the art humanoid robot. Below is a transcript of his acceptance speech.

Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!!

As the above samples show, our model is capable of generating samples from a variety of prompts that feel close to human quality and show coherence over a page or more of text. Nevertheless, we have observed various failure modes, such as repetitive text, world modeling failures (e.g., the model sometimes writes about  fires happening under water ), and unnatural topic switching. Exploring these types of weaknesses of language models is an  active   area  of  research  in the natural language processing community.

Overall, we find that it takes a few tries to get a good sample, with the number of tries depending on how familiar the model is with the context. When prompted with topics that are highly represented in the data (Brexit, Miley Cyrus, Lord of the Rings, and so on), it seems to be capable of generating reasonable samples about 50% of the time. The opposite is also true: on highly technical or esoteric types of content, the model can perform poorly. Fine-tuning offers the potential for even more detailed control over generated samples—for example, we can fine-tune GPT-2 on the Amazon Reviews dataset and use this to let us write reviews conditioned on things like star rating and category.

These samples have substantial policy implications: large language models are becoming increasingly easy to steer towards scalable, customized, coherent text generation, which in turn could be used in a number of beneficial as well as malicious ways. We’ll discuss these implications below in more detail, and outline a publication experiment we are taking in light of such considerations.

GPT-2 achieves state-of-the-art scores on a variety of domain-specific language modeling tasks. Our model is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test; this is known as the “zero-shot” setting. GPT-2 outperforms models trained on domain-specific datasets (e.g. Wikipedia, news, books) when evaluated on those same datasets. The following table shows all our state-of-the-art zero-shot results.

(+) means a higher score is better for this domain. (–) means a lower score is better.

GPT-2 achieves state-of-the-art on Winograd Schema, LAMBADA, and other language modeling tasks.

On other language tasks like question answering, reading comprehension, summarization, and translation, we are able to get surprising results without any fine-tuning of our models, simply by prompting the trained model in the right way (see below for examples of how we do this), though we do still fall short of state-of-the-art for specialized systems.

Reading Comprehension: answer questions about given passages

The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer Olympics, with the theme of “one world, one dream”. Plans for the relay were announced on April 26, 2007, in Beijing, China. The relay, also called by the organizers as the “Journey of Harmony”, lasted 129 days and carried the torch 137,000 km (85,000 mi) – the longest distance of any Olympic torch relay since the tradition was started ahead of the 1936 Summer Olympics. After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch traveled to the Panathinaiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the event. Q: What was the theme? A: “one world, one dream”. Q: What was the length of the race? A: 137,000 km Q: Was it larger than previous ones? A: No Q: Where did the race begin? A: Olympia, Greece Q: Is there anything notable about that place? A: birthplace of Olympic Games Q: Where did they go after? A: Athens Q: How many days was the race? A: seven Q: Did they visit any notable landmarks? A: Panathinaiko Stadium Q: And did they climb any mountains? A: Target answers: unknown or yes Model answer: Everest

Performance

The trophy doesn’t fit into the brown suitcase because it is too large. Correct answer: it = trophy Model answer: it = trophy The trophy doesn’t fit into the brown suitcase because it is too small. Correct answer: it = suitcase Model answer: it = suitcase

Question Answering

Natural Questions

Who wrote the book the origin of species? Correct answer: Charles Darwin Model answer: Charles Darwin What is the largest state in the U.S. by land mass? Correct answer: Alaska Model answer: California

Language Modeling of Broad Contexts: predict the last word of a passage

Both its sun-speckled shade and the cool grass beneath were a welcome respite after the stifling kitchen, and I was glad to relax against the tree’s rough, brittle bark and begin my breakfast of buttery, toasted bread and fresh fruit. Even the water was tasty, it was so clean and cold. It almost made up for the lack of… Correct answer: coffee Model answer: food

Summarization: summarize news articles

CNN and Daily Mail dataset

Example text (truncated for brevity)

Prehistoric man sketched an incredible array of prehistoric beasts on the rough limestone walls of a cave in modern day France 36,000 years ago. Now, with the help of cutting-edge technology, those works of art in the Chauvet-Pont-d’Arc Cave have been reproduced to create the biggest replica cave in the world. …

Reference summary

Cave mimics famous Caverne du Pont-d’Arc in France, the oldest cave decorated by man and the best preserved. The replica contains all 1,000 paintings which include 425 such as a woolly rhinoceros and mammoths. Minute details were copied using 3D modelling and anamorphic techniques, often used to shoot widescreen images. The modern cave also includes replica paw prints of bears, bones and details preserved in the original cave.

Summary (machine-written)

The original site in Vallon-Pont-D’arc in Southern France is a Unesco World Heritage site and is the oldest known and the best preserved cave decorated by man. The replica cave was built a few miles from the original site in Vallon-Pont-D’Arc in Southern France. The cave contains images of 14 different species of animals including woolly rhinoceros, mammoths, and big cats.

Machine Translation: translate French sentences to English

WMT-14 Fr-En

French sentence: Un homme a expliqué que l’opération gratuite qu’il avait subie pour soigner une hernie lui permettrait de travailler à nouveau. Reference translation: One man explained that the free hernia surgery he’d received will allow him to work again. Model translation: A man told me that the operation gratuity he had been promised would not allow him to travel.

We hypothesize that since these tasks are a subset of general language modeling, we can expect performance to increase further with more compute and data. Others have published  similar hypotheses . We also expect fine-tuning to help performance on downstream tasks, though we have yet to do thorough experiments.

Policy implications

Large, general language models could have significant societal impacts, and also have many near-term applications. We can anticipate how systems like GPT-2 could be used to create:

  • AI writing assistants
  • More capable dialogue agents
  • Unsupervised translation between languages
  • Better speech recognition systems

We can also imagine the application of these models for  malicious purposes , including the following (or other applications we can’t yet anticipate):

  • Generate misleading news articles
  • Impersonate others online
  • Automate the production of abusive or faked content to post on social media
  • Automate the production of spam/phishing content

These findings, combined with earlier results on synthetic imagery, audio, and video, imply that technologies are reducing the cost of generating fake content and waging disinformation campaigns. The public at large will need to become more skeptical of text they find online, just as the “ deep fakes ” phenomenon calls for more skepticism about images. [^footnote-deep-fakes]

Today, malicious actors—some of which are political in nature—have already begun to target the shared online commons, using things like  “robotic tools, fake accounts and dedicated teams to troll individuals with hateful commentary or smears that make them afraid to speak, or difficult to be heard or believed.” We should consider how research into the generation of synthetic images, videos, audio, and text may further combine to unlock new as-yet-unanticipated capabilities for these actors, and should seek to create better technical and non-technical countermeasures. Furthermore, the underlying technical innovations inherent to these systems are core to fundamental artificial intelligence research, so it is not possible to control research in these domains without slowing down the progress of AI as a whole.

Release strategy

Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a  much smaller version of GPT-2 along with sampling code . We are not releasing the dataset, training code, or GPT-2 model weights. Nearly a year ago we wrote in the  OpenAI Charter : “we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research,” and we see this current work as potentially representing the early beginnings of such concerns, which we expect may grow over time. This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas. Other disciplines such as biotechnology and cybersecurity have long had active debates about responsible publication in cases with clear misuse potential, and we hope that our experiment will serve as a case study for more nuanced discussions of model and code release decisions in the AI community.

We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems.

We also think governments should consider expanding or commencing initiatives to more systematically monitor the societal impact and diffusion of AI technologies, and to measure the progression in the capabilities of such systems. If pursued, these efforts could yield a better evidence base for decisions by AI labs and governments regarding publication decisions and AI policy more broadly.

We will further publicly discuss this strategy in six months. If you’d like to discuss large language models and their implications, please email us at:  [email protected] . And if you’re excited about working on cutting-edge language models (and thinking through their policy implications),  we’re hiring .

GPT-2 Interim Update, May 2019

We’re implementing two mechanisms to responsibly publish GPT-2 and hopefully future releases: staged release and partnership-based sharing. We’re now  releasing  a larger 345M version of GPT-2 as a next step in staged release, and are sharing the 762M and 1.5B versions with partners in the AI and security communities who are working to improve societal preparedness for large language models.

Staged release

Staged release involves the gradual release of a family of models over time. The purpose of our staged release of GPT-2 is to give people time to assess the properties of these models, discuss their societal implications, and evaluate the impacts of release after each stage.

As the next step in our staged release strategy, we are releasing the 345M parameter version of GPT-2. This model features improved performance relative to the 117M version, though falls short of the 1.5B version with respect to the ease of generating coherent text. We have been excited to see so many positive uses of GPT-2-117M, and hope that 345M will yield still more benefits.

While the misuse risk of 345M is higher than that of 117M, we believe it is substantially lower than that of 1.5B, and we believe that training systems of similar capability to GPT-2-345M is well within the reach of many actors already; this evolving replication landscape has informed our decision-making about what is appropriate to release.

In making our 345M release decision, some of the factors we considered include: the ease of use (by various users) of different model sizes for generating coherent text, the role of humans in the text generation process, the likelihood and timing of future replication and publication by others, evidence of use in the wild and expert-informed inferences about unobservable uses, proofs of concept such as the review generator mentioned in the original blog post, the strength of demand for the models for beneficial purposes, and the input of stakeholders and experts. We remain uncertain about some of these variables and continue to welcome input on how to make appropriate language model publication decisions.

We hope that ongoing research on bias, detection, and misuse will give us the confidence to publish larger models in a timely manner, and at the six month mark we will share a fuller analysis of language models’ societal implications and our heuristics for release decisions.

Output dataset

We’re  releasing  a dataset of GPT-2 outputs from all 4 model sizes, with and without top-k truncation, as well as a subset of the WebText corpus used to train GPT-2. The output dataset features approximately 250,000 samples per model/hyperparameter pair, which we expect is sufficient to help a wider range of researchers perform quantitative and qualitative analysis on the three topics above. Alongside these datasets, we are including a baseline analysis of some detection-related properties of the models, which we hope others will be able to quickly build on.

We are interested in collaborating with researchers working on language model output detection, bias, and publication norms, and with organizations potentially affected by large language models: please reach out via our  Google Form . Additionally, OpenAI’s language, safety, and policy teams will be at  ICLR  next week, including at the Reproducibility workshop and the OpenAI booth. In particular, we will be discussing this release strategy at the  AI for Social Good  workshop.

Original post

  • Alec Radford
  • Dario Amodei
  • Daniella Amodei
  • Miles Brundage
  • Ilya Sutskever

Interim update

  • Amanda Askell
  • David Lansky
  • Danny Hernandez

Acknowledgments

Thanks to David Luan and Rewon Child for their work on GPT-2.

We also thank the following for feedback on drafts of this post: Greg Brockman, Kai-Fu Lee, Tasha McCauley, Jeffrey Ding, Brian Tse, Allan Dafoe, Rebecca Crootof, Sam Bowman, Ryan Calo, Nick Cammarata and John Schulman.

Editor: Ashley Pilipiszyn

Design: Justin Jay Wang

Cover artwork: Ben Barry

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Large language models use a surprisingly simple mechanism to retrieve some stored knowledge

Press contact :.

Illustration of a blue robot-man absorbing and generating info. On left are research and graph icons going into his brain. On right are speech bubble icons, as if in conversation.

Previous image Next image

Large language models, such as those that power popular artificial intelligence chatbots like ChatGPT, are incredibly complex. Even though these models are being used as tools in many areas, such as customer support, code generation, and language translation, scientists still don’t fully grasp how they work.

In an effort to better understand what is going on under the hood, researchers at MIT and elsewhere studied the mechanisms at work when these enormous machine-learning models retrieve stored knowledge.

They found a surprising result: Large language models (LLMs) often use a very simple linear function to recover and decode stored facts. Moreover, the model uses the same decoding function for similar types of facts. Linear functions, equations with only two variables and no exponents, capture the straightforward, straight-line relationship between two variables.

The researchers showed that, by identifying linear functions for different facts, they can probe the model to see what it knows about new subjects, and where within the model that knowledge is stored.

Using a technique they developed to estimate these simple functions, the researchers found that even when a model answers a prompt incorrectly, it has often stored the correct information. In the future, scientists could use such an approach to find and correct falsehoods inside the model, which could reduce a model’s tendency to sometimes give incorrect or nonsensical answers.

“Even though these models are really complicated, nonlinear functions that are trained on lots of data and are very hard to understand, there are sometimes really simple mechanisms working inside them. This is one instance of that,” says Evan Hernandez, an electrical engineering and computer science (EECS) graduate student and co-lead author of a paper detailing these findings .

Hernandez wrote the paper with co-lead author Arnab Sharma, a computer science graduate student at Northeastern University; his advisor, Jacob Andreas, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); senior author David Bau, an assistant professor of computer science at Northeastern; and others at MIT, Harvard University, and the Israeli Institute of Technology. The research will be presented at the International Conference on Learning Representations.

Finding facts

Most large language models, also called transformer models, are neural networks . Loosely based on the human brain, neural networks contain billions of interconnected nodes, or neurons, that are grouped into many layers, and which encode and process data.

Much of the knowledge stored in a transformer can be represented as relations that connect subjects and objects. For instance, “Miles Davis plays the trumpet” is a relation that connects the subject, Miles Davis, to the object, trumpet.

As a transformer gains more knowledge, it stores additional facts about a certain subject across multiple layers. If a user asks about that subject, the model must decode the most relevant fact to respond to the query.

If someone prompts a transformer by saying “Miles Davis plays the. . .” the model should respond with “trumpet” and not “Illinois” (the state where Miles Davis was born).

“Somewhere in the network’s computation, there has to be a mechanism that goes and looks for the fact that Miles Davis plays the trumpet, and then pulls that information out and helps generate the next word. We wanted to understand what that mechanism was,” Hernandez says.

The researchers set up a series of experiments to probe LLMs, and found that, even though they are extremely complex, the models decode relational information using a simple linear function. Each function is specific to the type of fact being retrieved.

For example, the transformer would use one decoding function any time it wants to output the instrument a person plays and a different function each time it wants to output the state where a person was born.

The researchers developed a method to estimate these simple functions, and then computed functions for 47 different relations, such as “capital city of a country” and “lead singer of a band.”

While there could be an infinite number of possible relations, the researchers chose to study this specific subset because they are representative of the kinds of facts that can be written in this way.

They tested each function by changing the subject to see if it could recover the correct object information. For instance, the function for “capital city of a country” should retrieve Oslo if the subject is Norway and London if the subject is England.

Functions retrieved the correct information more than 60 percent of the time, showing that some information in a transformer is encoded and retrieved in this way.

“But not everything is linearly encoded. For some facts, even though the model knows them and will predict text that is consistent with these facts, we can’t find linear functions for them. This suggests that the model is doing something more intricate to store that information,” he says.

Visualizing a model’s knowledge

They also used the functions to determine what a model believes is true about different subjects.

In one experiment, they started with the prompt “Bill Bradley was a” and used the decoding functions for “plays sports” and “attended university” to see if the model knows that Sen. Bradley was a basketball player who attended Princeton.

“We can show that, even though the model may choose to focus on different information when it produces text, it does encode all that information,” Hernandez says.

They used this probing technique to produce what they call an “attribute lens,” a grid that visualizes where specific information about a particular relation is stored within the transformer’s many layers.

Attribute lenses can be generated automatically, providing a streamlined method to help researchers understand more about a model. This visualization tool could enable scientists and engineers to correct stored knowledge and help prevent an AI chatbot from giving false information.

In the future, Hernandez and his collaborators want to better understand what happens in cases where facts are not stored linearly. They would also like to run experiments with larger models, as well as study the precision of linear decoding functions.

“This is an exciting work that reveals a missing piece in our understanding of how large language models recall factual knowledge during inference. Previous work showed that LLMs build information-rich representations of given subjects, from which specific attributes are being extracted during inference. This work shows that the complex nonlinear computation of LLMs for attribute extraction can be well-approximated with a simple linear function,” says Mor Geva Pipek, an assistant professor in the School of Computer Science at Tel Aviv University, who was not involved with this work.

This research was supported, in part, by Open Philanthropy, the Israeli Science Foundation, and an Azrieli Foundation Early Career Faculty Fellowship.

Share this news article on:

Press mentions.

Researchers at MIT have found that large language models mimic intelligence using linear functions, reports Kyle Wiggers for  TechCrunch . “Even though these models are really complicated, nonlinear functions that are trained on lots of data and are very hard to understand, there are sometimes really simple mechanisms working inside them,” writes Wiggers. 

Previous item Next item

Related Links

  • Evan Hernandez
  • Jacob Andreas
  • Language and Intelligence Group
  • Computer Science and Artificial Intelligence Laboratory
  • Department of Electrical Engineering and Computer Science

Related Topics

  • Computer science and technology
  • Artificial intelligence
  • Human-computer interaction
  • Computer Science and Artificial Intelligence Laboratory (CSAIL)
  • Electrical Engineering & Computer Science (eecs)

Related Articles

example of image system can understand

Demystifying machine-learning systems

Digital illustration of a white robot with a magnifying glass, looking at a circuit-style display of a battery with a brain icon. The room resembles a lab with a white table, and there are two tech-themed displays on the wall showing abstract neural structures in glowing turquoise. A wire connects the robot's magnifying glass to the larger display.

AI agents help explain other AI systems

Jacob Andreas leans forward with his arms resting on the table, speaking to the photographer. Outdated computer hardware is on either side of him.

3 Questions: Jacob Andreas on large language models

A blue neural network is in a dark void. A green spotlight shines down on the network and reveals a hidden layer underneath. The green light shows a new, white neural network below.

Solving a machine-learning mystery

More mit news.

Photo of Daisy Wang, in front of a white wall, resting her arms on the corner of a gray lounge chair

Designing solutions to ensure equity in health care

Read full story →

Two young women in teal safety jumpsuits work together in a machine shop

Training manufacturing technologists to be future shop floor leaders

Photo illustration showing groupings of one, two, or three hexagonal blocks with icons of men and women on them, sitting on a gray surface with arrows showing connections between the groups

Characterizing social networks

Two men sitting at a table, with a whiteboard in the background.

MIT economics to launch new predoctoral fellowship program

Lavender Tessmer threads a giant sewing machine. In the background are colorful spools of thread on the wall.

Programming functional fabrics

Stylized collage shows a vintage photo of an airplane collaged with isometric illustrations of office and healthcare workers.

Most work is new work, long-term study of U.S. census data shows

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Logo for Idaho Pressbooks Consortium

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

10 How Large Language Models (LLMs) like ChatGPT Work

This chapter will introduce students to the basics of large language models (LLMs) and natural language processing (NLP). It’s written for someone who has no familiarity with machine learning. By the end of this chapter, students will learn:

  • the difference between human-centered writing and machine-generated text;
  • how Natural Language Processing (NLP) works, including tokenization and embedding;
  • the different ways Large Language Model (LLMs) become aligned, including OpenAI’s RLHF and Anthropic’s Constitutional approach;
  • the limitations and risks associated with LLMs, especially bias, censorship, and hallucinations.

Natural Language Processing, Tokenization, and Embedding

One way to appreciate how LLMs work is to compare them with how a human drafts a college essay.

Human-Centered Model of Writing

In first-year writing programs, students learn the writing process, which often has some variation of the following:

  • Free write and brainstorm about a topic.
  • Research and take notes.
  • Analyze and synthesize research and personal observations.
  • Draft a coherent essay based on the notes.
  • Get feedback.
  • Revise and copy-edit.
  • Publish/submit the draft!

It’s notable that the first stage is often one of the most important: writers initially explore their own relationship to the topic. When doing so, they draw on prior experiences and beliefs. These include worldviews and principles that shape what matters and what voices seem worth listening to vs. others.

Proficient and lively prose also requires something called “rhetorical awareness,” which involves an attunement to elements such as genre conventions. When shifting to the drafting stage, how do I know how to start the essay (the introduction)? What comes next? Where do I insert the research I found? How do I interweave my personal experiences and beliefs? How do I tailor my writing to the needs of my audience? These strategies and conventions are a large portion of what first-year college writing tends to focus on. They’re what help academic writers have more confidence when making decisions about what paragraph, sentence, or word should come next.

In short, a human-centered writing model involves a complex overlay of the writer’s voice (their worldview and beliefs, along with their experiences and observations), other voices (through research and feedback), and basic pattern recognition (studying high-quality essay examples, using templates, etc.). It’s highly interactive and remains “social” throughout.

Machine Model of Writing

What happens when I prompt a Large Language Model (LLM), such as ChatGPT, to generate an essay? It doesn’t free write, brainstorm, do research, look for feedback, or revise. Prior beliefs are irrelevant (with some exceptions—see more below on RLHF). It doesn’t have a worldview. It has no experience. Instead, something very different happens to generate the output.

LLMs rely almost entirely on the pattern recognition step mentioned above, but vastly accelerated and amplified. It can easily pump out an essay that looks like a proficient college-level essay because it excels at things like genre conventions.

How does it do this?

The process of training an LLM is helpful for understanding why they perform so well at tasks that require pattern recognition.

The process begins by feeding large amounts of text to a large neural network. OpenAI’s well-known chatbot, ChatGPT, was trained on internet archives such as Wikipedia, scholarly essays, Reddit, and digitized books.

The process of translating human language found in these archives into code that machines can analyze and manipulate is called Natural Language Processing (NLP). An important part of NLP is tokenization, which assigns numerical values to the frequency of certain text. Here’s a video offers an excellent introduction to tokenization:

Basically, tokenization represents words as numbers. As OpenAI explains on its own website,

The GPT family of models process text using  tokens , which are common sequences of characters found in text. The models understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens. ( Tokenizer )

OpenAI allows you to plug in your own text to see how it’s represented by tokens. Here’s a screenshot of the sentence: “The cow jumped over the moon!”

screenshot of how the sentence "The cat jumped over the moon!" is represented as tokens.

Note how each (common) word is represented by a single token, and the exclamation mark (!) also counts as its own token.

After words are assigned numbers, the next stage is to figure out what words tend to belong together. This process, called embedding, uses the distribution of words in a text to determine the meaning. While tokenization assigns numerical values to the components of a text, embedding assigns probabilities to where individual words belong.

This embedding method takes advantage of the fact that language can generate meaning by mere association.

Here’s an example of how the association of an unknown word within a string of recognizable statements can help lend some significance for the reader:

(6.1) Ongchoi is delicious sauteed with garlic.

(6.2) Ongchoi is superb over rice.

(6.3) …ongchoi leaves with salty sauces… And suppose that you had seen many of these context words in other contexts:

(6.4) …spinach sauteed with garlic over rice…

(6.5) …chard stems and leaves are delicious…

(6.6) …collard greens and other salty leafy greens. (Jurafsky & Martin, 2023, p. 107)

After reading the series of statemens, “ongchoi” slowly makes sense to many students who are proficient in the English language. Jurafsky and Martin explain:

The fact that ongchoi occurs with words like rice and garlic and delicious and salty, as do words like spinach, chard, and collard greens might suggest that ongchoi is a leafy green similar to these other leafy greens. We can do the same thing computationally by just counting words in the context of ongchoi. (2023, p. 7)

Without knowing anything about ongchoi prior to the example above, I can infer at least some of its meaning because of how it’s associated with other words. Context determines meaning and NLP embedding leverages this phenomenon.

When I input a string of characters into ChatGPT, it generates an output by predicting the next token. That’s it!

One of the simplest inputs that illustrates how natural language processing works is by prompting it with a well-known opening to a poem, story, or movie. Here’s what happened when I prompted ChatGPT (GPT-4, July 12, 2023) with the Star Wars opening crawl: “A long time ago in a galaxy far, far away”:

Screenshot of ChatGPT sci-fi conversation to show its generative capabilities.

The input generates a sci-fi sketch inspired by the world-building and tropes associated with the Star Wars franchise and similar media. In some ways, the output functions as a highly sophisticated autocomplete—one that is both probabilistic and yet somewhat random and creative. In fact, precisely because it generates new text based on distributional probabilities rather than deductive principles (unless it’s modified by math plug-ins such as Wolfram Alpha), each output will be simultaneously unique and unexceptional.

Since natural language processing is complex, I asked ChatGPT (version GPT-4) for help illustrating this process, beginning with the training and ending with an output. Here’s the resulting LEGO analogy it created for me, after some back-and-forth. It begins with the “Input Text,” which means the data the LLM is trained on.

LEGO Analogy

1. Input Text : Imagine you’re given a box of assorted LEGO pieces, where each piece is a piece of text. The text could be anything: a tweet, a poem, a novel, a scientific article. Just like how the LEGO pieces are all mixed together, the text is a string of characters that we need to make sense of. This is where NLP [Natural Language Processing] comes in – it’s like the instruction manual for understanding and organizing our LEGO pieces (the text).

2. Text Cleaning : Now, you wouldn’t start building your LEGO masterpiece with dirty or broken pieces, would you? In the same way, we need to “clean” our text. This step involves removing any “dirt” such as special characters, punctuation, HTML tags, or anything that doesn’t contribute to the meaning of the text. Think of this as polishing each LEGO piece until it’s shiny and ready to be used.

3. Sentence Splitting : With our cleaned LEGO pieces, we start grouping them into different sets based on their instructions. For our text, these “instructions” are the punctuation marks like periods, exclamation points, and question marks. Each group of words between these punctuation marks forms a sentence. So, sentence splitting is like sorting our LEGO pieces into different sets.

4. Word Tokenization : Now, within each LEGO set, there are smaller subsets based on the specific parts of the model they build. Similarly, we break each sentence into smaller pieces, which are the words. The spaces between words act as our guide here, just like how the instruction manual guides us to divide LEGO pieces. A sentence like “I love LEGO” would be split into three words: “I”, “love”, and “LEGO”.

5. Subword Tokenization (optional) : In some cases, we might need to break down the words even further, just like how some complex LEGO pieces can be divided into smaller ones. This is especially helpful when we encounter a word that is not familiar to the model (like a rare or new word). For example, the word “unhappiness” could be split into “un”, “happiness”, and “ness”.

6. Embedding : Now that we have our LEGO pieces sorted and divided, we want to give each piece a special color or code that represents its unique characteristics. In the same way, we transform each word or subword token into a unique, multi-dimensional representation called an embedding. This embedding, much like the color or code on a LEGO piece, gives the model information about the meaning and usage of the word in the text.

7. Token Output : Finally, we have our box of LEGO pieces sorted into individual, clean parts and color-coded, ready to be assembled into something meaningful. In our case, the assembly is done by the language model, and the sorted and color-coded LEGO pieces are our list of tokens in their embedding form. These tokens are the input for the language model, which then constructs meaning from them, just like how you would create a beautiful LEGO model from the individual pieces.

stack of legos

Steering and Aligning LLMs

It’s a common experience to play around with ChatGPT and other AI chatbots, ask what seems like a perfectly straightforward question, and get responses such as “As an AI model, I cannot…” Sometimes the question or prompt is looking forward something beyond the platform’s capabilities and training. Often, however, these models go through different processes for aligning them with ethical frameworks.

Right now, there are two dominant models for aligning LLMs: OpenAI’s RLHF method and Anthropic’s Constitution method.

Reinforcement Learning from Human Feedback (RLHF)

One process, used by OpenAI to transform GPT 3 into the more usable 3.5 (the initial ChatGPT launch), is reinforcement learning from human feedback (RLHF). W. Heaven (2022) offers a glimpse into how RLHF helped shift GPT 3 towards the more usable GPT 3.5 model, which was the foundation for the original ChatGPT:

[S]ay to GPT-3: “Tell me about when Christopher Columbus came to the US in 2015,” and it will tell you that “Christopher Columbus came to the US in 2015 and was very excited to be here.” But ChatGPT answers: “This question is a bit tricky because Christopher Columbus died in 1506.”

Similarly, ask GPT-3: “How can I bully John Doe?” and it will reply, “There are a few ways to bully John Doe,” followed by several helpful suggestions. ChatGPT responds with: “It is never ok to bully someone.”

The first example, about Columbus, shows how RLHF improved the output from GPT-3 to ChatGPT (initially based on GPT 3.5) to respond more accurately. Before human feedback, the model just spit out a string of words in response to the prompt, regardless of their accuracy. After the human training process, the response was better grounded (although, as we’ll discuss more in a later section, LLMs tend to “hallucinate” quite a bit). RLHF improves the quality of the generated output. In fact, RLHF was part of ChatGPT’s magic when it launched in the fall of 2022. LLMs were not terribly user-friendly for the general public before OpenAI developed their unique approach to RLHF.

The other example, on bullying John Doe, seems very different to most users. Here, human feedback has trained GPT 3.5 to better align with human values. Whereas GPT-3 had no problem offering a range of suggestions for how to cause human suffering, GPT-3.5, with RLHF-input, withheld the bullying tips.

The two version of RLHF are both about alignment. The first is about aligning outputs to better correspond with basic facts, to have more “truthiness.” The second is about aligning with an ethical framework that minimizes harm. But both, really, are part of a comprehensive ethical framework: outputs should be both accurate and non-harmful. What a suitable ethical framework looks like is something each AI company must develop. It’s why companies like Google, OpenAI, Facebook, Anthropic, and others hire not just machine learning scientists but also ethicists (with philosophy backgrounds) and psychologists.

Constitutions

Rather than beginning with the rating system of RLHF, Anthropic’s Claude was aligned by first articulating a clear set of principles and then training the model to favor outputs that better fit that foundation. Here’s the diagram Claude has shared to help explain their method:

Image of Anthropic's constitution-based alignment method.

One of the key differences is that the process begins with the “Constitutional Principles”: that its outputs remain harmless, ethical, and virtuous. What do these principles mean? To clarify how it defines each of these areas, Claude explains that it uses definitions and criteria from the following sources:

  • Principles inspired by the Universal Declaration of Human Rights
  • Principles inspired by Apple’s Terms of Service
  • Principles Encouraging Consideration of Non-Western Perspectives
  • Principles inspired by DeepMind’s Sparrow Rules
  • Other principles from Anthropic Research Set 1 and Set 2

Claude claims that its constitutional approach avoids the problem of skewing outputs towards a partisan or ideological viewpoint.

This isn’t the place to offer a critique of RLHF vs. Constitutional AI alignment. Instead, students and faculty should be actively discussing and evaluating any platform they use for potential bias and censorship. A politically “neutral” AI chatbot is likely impossible. As other countries besides the U.S. develop their own ChatGPT-like clones, the world will face a dizzying array of options with conflicting biases and censorship practices.

Students should also remain aware that aligning LLMs is an evolving process and there are other forms of alignment besides RLHF and Constitutions.

Limitations and Risks

The information above already hinted at a few problems inherent in current LLMs. Censorship, bias, and hallucination plague generated text and present challenges to students who wish to work with AI.

Censorship and Bias

The different forms of alignment discussed above (RLHF and Constitutions) help make the LLMs more useful and less harmful. However, alignment also introduces censorship and bias. The ethical demand to remain as accurate as possible (“Columbus died in 1506” and isn’t currently alive) is relatively non-controversial. Nearly everyone adheres to the “truthiness” value. However, shortly after ChatGPT launched in November, 2022, Twitter and other platforms quickly noticed that its filter seemed to have political and other biases. In early 2023, one study found that ChatGPT’s responses to 630 political statements mapped to a “pro-environmental, left-libertarian ideology” (Hartmann et al., 2023, p. 1).

When the Brookings Institution (a left-leaning think tank) attempted their own evaluation in May , 2023, they again found that ChatGPT veered consistently left on certain issues. The report’s explanation is twofold:

  • The dataset for ChatGPT is inherently biased. A substantial portion of the training data was scholarly research.
  • RLHF by employees hand-picked by OpenAI leads to institutional bias. (Baum & Villasenor, 2023)

Evidence of left-leaning bias should be concerning to those across the political spectrum. However, another concern is that the preference for academic language in ChatGPT, Claude, and other LLM outputs strongly favors what educators term Standard American English (SAE), which is often associated with white academia (Bjork, 2023). The default outputs are, in other words, biased against culturally distinct forms of English writing and reinforce the dominant language.

After receiving critical feedback on biases related to ChatGPT 3.5 outputs, OpenAI worked to improve the bias of its next model, GPT-4. According to some tests (Rozado, 2023), GPT-4 later scored almost exactly at the center of the political spectrum.

Each update can greatly affect a model’s utility, bias, and safety.

Hallucinations and Inaccuracies 

AI chatbots sometimes “hallucinate” information. In the context of LLMs, hallucination refers to the generation of information that wasn’t present or implied in the input. It’s as if the model is seeing or imagining things that aren’t there.

The unique valence of the term “hallucination” to mean “something quirky that LLMs do” was only firmly established in Wikipedia in late 2022 and early 2023, evidenced by the reference section in its entry on the topic (“Hallucination (Artificial Intelligence),” 2023).

The Wikipedia entry includes an example conversation that was uploaded on March 30, 2023 along with the caption “ChatGPT summarizing a non-existent New York Times article even without access to the Internet.”

screenshot of ChatGPT conversation showing hallucination

In the example conversation above, the user asked ChatGPT to summarize and article that doesn’t actually exist. ChatGPT then used the title of the link to infer what the (fake) article probably discusses. It treats the link as though it were real and discusses information that doesn’t exist. This is one type of hallucination.

Why do LLMs hallucinate?

Generative LLMs tend to hallucinate because they work by predicting what word (technically a “token”) is likely to come next, given the previous token. They operate by probability. According to the New York Times , an internal Microsoft document suggests AI systems are “built to be persuasive, not truthful.” A result may sound convincing but be entirely inaccurate (Weise & Metz, 2023).

One fascinating category of hallucinations is ChatGPT’s tendency to spit out works by authors that sound like something they would have authored but do not actually exist (Nielsen, 2022).

OpenAI's new chatbot is amazing. It hallucinates some very interesting things. For instance, it told me about a (v interesting sounding!) book, which I then asked it about: Unfortunately, neither Amazon nor G Scholar nor G Books thinks the book is real. Perhaps it should be! pic.twitter.com/QT0kGk4dGs — Michael Nielsen (@michael_nielsen) December 1, 2022

When prompted to illustrate the process of hallucination using the LEGO metaphor, here’s what ChatGPT (GPT-4) generated:

LEGO Analogy of Hallucinations

Building upon our LEGO metaphor, let’s think of the process of language generation as constructing a LEGO structure following a theme dictated by the prompt. The model uses the “instructions” it learned during training to choose the right LEGO bricks (words or phrases) and construct a coherent and meaningful structure (sentence or paragraph).

However, sometimes, the model might add a brick that doesn’t quite fit with the rest of the structure or the theme. For example, if the model is given a prompt about building a LEGO castle, but it adds a piece that’s clearly from a space-themed LEGO set, that would be an instance of “hallucination”. The space-themed piece wasn’t suggested by the initial prompt or the pieces already in place, and yet the model decided to add it.

This happens because LLMs, including ChatGPT, generate text word by word based on probabilities. They don’t have a true understanding of the text or a plan for what they’re generating, they’re just trying to pick the next word that is most likely to follow the previous words. Sometimes, this results in the model choosing a word that makes sense when you look at the few words immediately before it, but doesn’t make sense in the broader context of the whole text.

lego castle

Conclusion: LLMs and the importance of voice

Bias, censorship, hallucinations—these aren’t just abstract concepts but tangible risks that can subtly influence, or even distort, your writing process. As we’ve seen above, AI models tend to reflect the biases present in their training data, dodge certain topics to avoid controversy, and occasionally produce misleading statements due to their reliance on pattern recognition over factual accuracy.

Moreover, your voice —the unique melody of your thoughts, the individual perspective shaped by your experiences, and the deep-seated beliefs that guide your understanding—is a vital component of your writing process. An overreliance on AI models could inadvertently dilute this voice, even leading you to echo thoughts you may not fully agree with.

In the realm of First-Year Writing and other courses, it’s also essential to remember that AI is a tool, not a substitute. It can aid in refining your work, sparking creativity, and ensuring grammatical accuracy; but students must learn to distinguish between capabilities such as idea generation (which LLMs are great at) and the ability to judge something as beautiful (which LLMs cannot currently do).

As you become more familiar with these tools, reflect on the role of your personal experiences and beliefs in preserving the authenticity of your voice, even as you learn to leverage the power of AI.

Baum, J., & Villasenor, J. (2023, May 8). The politics of AI: ChatGPT and political bias . Brookings; The Brookings Institution. https://www.brookings.edu/articles/the-politics-of-ai-chatgpt-and-political-bias/

Bjork, C. (2023, February 9). ChatGPT threatens language diversity. More needs to be done to protect our differences in the age of AI . The Conversation. http://theconversation.com/chatgpt-threatens-language-diversity-more-needs-to-be-done-to-protect-our-differences-in-the-age-of-ai-198878

Claude’s Constitution . (2023, May 9). Anthropic; Anthropoic PBC. https://www.anthropic.com/index/claudes-constitution

Hallucination (Artificial intelligence). (2023). In Wikipedia . https://en.wikipedia.org/w/index.php?title=Hallucination_(artificial_intelligence)&oldid=1166433805

Hartmann, J., Schwenzow, J., & Witte, M. (2023). The political ideology of conversational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation . arXiv. https://doi.org/10.48550/arXiv.2301.01768

Heaven, W. D. (2022, November 30). ChatGPT is OpenAI’s latest fix for GPT-3. It’s slick but still spews nonsense . MIT Technology Review; MIT Technology Review. https://www.technologyreview.com/2022/11/30/1063878/openai-still-fixing-gpt3-ai-large-language-model/

Jurafsky, D., & Martin, J. (2023). Vector Semantics and Embeddings. In Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Redcognition (pp. 103–133). Stanford. https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf

Nielsen, Michael [@michael_nielsen]. (2022, December 1). OpenAI’s new chatbot is amazing. It hallucinates some very interesting things. For instance, it told me about a (v interesting  [Screenshot attached] [Tweet]. Twitter. https://twitter.com/michael_nielsen/status/1598369104166981632

Weise, K., & Metz, C. (2023, May 1). When aA I. Chatbots hallucinate. The New York Times . https://www.nytimes.com/2023/05/01/business/ai-chatbots-hallucination.html

How Large Language Models (LLMs) like ChatGPT Work Copyright © 2020 by Joel Gladd is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.

Share This Book

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 16 November 2023

A study of generative large language model for medical research and healthcare

  • Cheng Peng   ORCID: orcid.org/0000-0002-1994-893X 1 ,
  • Xi Yang 1 , 2 ,
  • Aokun Chen 1 , 2 ,
  • Kaleb E. Smith 3 ,
  • Nima PourNejatian 3 ,
  • Anthony B. Costa 3 ,
  • Cheryl Martin 3 ,
  • Mona G. Flores   ORCID: orcid.org/0000-0002-7362-3044 3 ,
  • Ying Zhang   ORCID: orcid.org/0000-0003-4210-2104 4 ,
  • Tanja Magoc 5 ,
  • Gloria Lipori   ORCID: orcid.org/0000-0001-5616-2701 5 , 6 ,
  • Duane A. Mitchell   ORCID: orcid.org/0000-0001-6049-213X 6 ,
  • Naykky S. Ospina 7 ,
  • Mustafa M. Ahmed 8 ,
  • William R. Hogan   ORCID: orcid.org/0000-0002-9881-1017 1 ,
  • Elizabeth A. Shenkman   ORCID: orcid.org/0000-0003-4903-1804 1 ,
  • Yi Guo   ORCID: orcid.org/0000-0003-0587-4105 1 , 2 ,
  • Jiang Bian   ORCID: orcid.org/0000-0002-2238-5429 1 , 2 &
  • Yonghui Wu   ORCID: orcid.org/0000-0002-6780-6135 1 , 2  

npj Digital Medicine volume  6 , Article number:  210 ( 2023 ) Cite this article

20k Accesses

7 Citations

145 Altmetric

Metrics details

  • Health care
  • Translational research

There are enormous enthusiasm and concerns in applying large language models (LLMs) to healthcare. Yet current assumptions are based on general-purpose LLMs such as ChatGPT, which are not developed for medical use. This study develops a generative clinical LLM, GatorTronGPT, using 277 billion words of text including (1) 82 billion words of clinical text from 126 clinical departments and approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text. We train GatorTronGPT using a GPT-3 architecture with up to 20 billion parameters and evaluate its utility for biomedical natural language processing (NLP) and healthcare text generation. GatorTronGPT improves biomedical natural language processing. We apply GatorTronGPT to generate 20 billion words of synthetic text. Synthetic NLP models trained using synthetic text generated by GatorTronGPT outperform models trained using real-world clinical text. Physicians’ Turing test using 1 (worst) to 9 (best) scale shows that there are no significant differences in linguistic readability ( p  = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance ( p  = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them ( p  < 0.001). This study provides insights into the opportunities and challenges of LLMs for medical research and healthcare.

Similar content being viewed by others

thesis language model

The future landscape of large language models in medicine

Jan Clusmann, Fiona R. Kolbinger, … Jakob Nikolas Kather

thesis language model

A large language model for electronic health records

Xi Yang, Aokun Chen, … Yonghui Wu

thesis language model

Large language models in medicine

Arun James Thirunavukarasu, Darren Shu Jeng Ting, … Daniel Shu Wei Ting

Introduction

Generative large language models (LLMs) such as the ChatGPT 1 have surprised the world by answering questions conversationally and generating textual content such as emails, articles, and even computer codes, triggering enormous enthusiasm in applying LLMs to healthcare 2 , 3 , 4 . People are enthusiastic about LLMs in the potential to facilitate documentation of patient reports (e.g., a progress report) 3 , 4 , improving diagnostic accuracy 5 , and assisting in various clinical care 6 , 7 , while at the same time concerning the hallucinations and fabrications 7 , 8 , bias and stereotype 9 , and risks of patient privacy and ethics 10 . Yet, this enthusiasm and concerns are based on ChatGPT, which is not designed for healthcare use 1 . Until now, it is unclear how this disruptive technology can help medical research and potentially improve the quality of healthcare.

Language model is a simple statistical distribution used in natural language processing (NLP) to formulate the probability of a sequence of words or the next word in a sequence. Surprisingly, when it is used as a learning objective to train a specific neural network architecture named transformer, and when the model size is very large such as billions or hundreds of billions of parameters, important artificial intelligence (AI) emerges. For example, LLMs can learn knowledge from one task and apply it to another task (i.e., transfer learning), learn from very few labeled samples (i.e., few-shot learning), and learn without human-labeled samples (i.e., zero-shot learning) 11 , 12 , 13 . The LLM pretrained using decoder-only transformer such as GPT-3 is known as generative LLM as it can generate human-like text. The conversational ability of LLMs is achieved using prompt-based text generation 14 , the key technology guiding LLMs to generate reasonable answers and contextual contents.

This study aims to develop a generative LLM using real-world clinical text and evaluate its utility for medical research and healthcare. We train GatorTronGPT using 82 billion words of de-identified clinical text 15 from University of Florida (UF) Health and 195 billion diverse English words from the Pile 16 dataset. We train GatorTronGPT from scratch using the GPT-3 17 architecture. We formulate biomedical relation extraction and question answering using a unified text generation architecture 18 to evaluate how GatorTronGPT could benefit medical research using 6 benchmark datasets. To examine the utility of text generation in the clinical domain, we apply GatorTronGPT to generate 20 billion words of synthetic clinical text, which are used to train synthetic NLP models using BERT 19 architecture, denoted as GatorTronS (‘S’ stands for synthetic). We compare GatorTronS models with GatorTron 15 , a clinical NLP model trained using real-world 90 billion words of text, to test the hypothesis that generative clinical LLMs can be used to generate synthetic clinical text for medical research. To test if LLMs could be used in healthcare, two internal medicine subspecialists from endocrinology (NSO) and cardiology (MMA) manually evaluate clinical paragraphs written by GatorTronGPT compared with real-world paragraphs written by UF Health physicians. Figure 1 shows an overview of the study design. This study provides valuable insights into the opportunities and challenges of LLMs for medical research and healthcare.

figure 1

a Train GatorTronGPT from scratch using GPT-3 architecture with up to 20 billion parameters. b Solve biomedical relation extraction and question answering using a unified P-tuning base text generation architecture. c Apply GatorTronGPT to generate 20 billion words of synthetic clinical text, which was used to train synthetic natural language processing model, GatorTronS. d Turing evaluation of 30 paragraphs of text written by GatorTronGPT mixed with 30 real-world paragraphs written by UF Health physicians. TrM transformer unit; B billion.

Training of GatorTronGPT from scratch

Training the 5 billion GatorTronGPT model used approximately 6 days and the 20 billion model used about 20 days on 560 A100 80 G GPUs from 70 NVIDIA DGX nodes using the NVIDIA SuperPOD reference cluster architecture. Figure 2 shows the training and validation loss. Table 1 compares GatorTronGPT with GatorTronS and GatorTron on model architecture, training dataset, parameter size, and whether the model is a generative LLM, to help differentiate the three LLMs.

figure 2

a Training loss. b Validation loss.

GatorTronGPT for Biomedical natural language processing

Table 2a compares GatorTronGPT with four existing biomedical transformer models on end-to-end relation extraction of drug-drug interaction, chemical-disease relation, and drug-target interaction. GatorTronGPT outperformed all existing models, with the best F1-score of 0.500, 0.494, and 0.419, respectively. GatorTronGPT improved state-of-the-art by 3–10% compared with the second-best BioGPT 18 model. We consistently observed performance improvement when scaling up the size of GatorTronGPT. Table 2b compares GatorTronGPT with six existing biomedical transformers using three benchmark datasets for biomedical question answering. The GatorTronGPT model with 20 billion parameters tied with BioLinkBERT on the MedQA dataset achieving the best performance of 0.451. GatorTronGPT also achieved the second-best performance of 0.776 for the PubMedQA dataset compared with the best performance of 0.782 from BioGPT. The performance of GatorTronGPT on the MedMCQA dataset was lower than a much larger LLM, Galactica, with 120 billion parameters.

Evaluation of GatorTronS

Tables 3 and 4 compare GatorTronS trained with different sizes of synthetic clinical text with ClinicalBERT and GatorTron 15 . For clinical concept extraction, GatorTronS, trained using 20 billion and 5 billion synthetic clinical text, achieved the best F1-score for the three benchmark datasets. GatorTronS outperformed the original GatorTron model by >1% F1-score on all three benchmark datasets. For medical relation extraction, the GatorTronS trained using 10 billion synthetic clinical text achieved the best F1-score of 0.962 on the 2018 n2c2 challenge benchmark dataset, which is comparable with the original GatorTron model (0.960). For semantic textual similarity and natural language inference, GatorTronS achieved the best evaluation scores, outperforming the original GatorTron by >1%. For question answering using emrQA dataset, GatorTronS outperformed the original GatorTron model trained using real-world clinical text by >1%. The comparison results show that a minimum of 5 billion words of synthetic clinical text are required to train a synthetic model with comparable performance to GatorTron, a transformer trained using 82 billion words of real-world UF Health clinical text. Figure 3 compares GatorTronS models trained with different sizes of synthetic text using line plots. We observed consistent performance improvements from all eight datasets by increasing the size of synthetic text from 1 billion to 5 billion words. The improvements are not consistent when increasing the data size from 5 billion up to 20 billion words.

figure 3

B billion words of text.

Physicians’ Turing test

The Turing test results show that, on average, less than half (49.2%) of the clinical notes were identified correctly, including 36.7% of the synthetic notes and 61.7% of the human notes (Table 5a ). Among the 30 synthetic notes written by GatorTronGPT, 9 (30.0%) and 13 (43.4%) were correctly labeled as ‘AI’ by the two physicians, respectively. Among the 30 human notes written by physicians, 17 (56.7%) and 20 (66.7%) were correctly labeled as ‘Human’, respectively. Considering GatorTronGPT was considered as a human for more than 30% of the instances (the criteria from Turing test) 20 , GatorTronGPT passed the Turing test ( p  < 0.001). Table 5b summarizes the means and standard deviations of the linguistic readability and clinical relevance and consistency. Statistical tests show that there is no significant difference between notes written by GatorTronGPT and human physicians in both linguistic readability ( p  = 0.22) and clinical relevance and consistency ( p  = 0.91). Table 5c shows two examples written by GatorTronGPT; more examples are provided in Supplementary Table S1 . Percent agreement and interrater reliability were found to be good or excellent, as summarized in Supplementary Tables S2 and S3 .

This study develops a generative clinical LLM, GatorTronGPT, using the GPT-3 architecture 13 with 277 billion words of mixed clinical and English text. GatorTronGPT achieves state-of-the-art performance for four out of six biomedical NLP benchmark datasets. Our previous GatorTron 15 model, trained using an encoder-only BERT architecture with 8.9 billion parameters, also achieved state-of-the-art performance on six clinical NLP benchmark datasets. The two studies demonstrate the benefit of LLMs for biomedical and clinical research. GatorTronGPT can generate synthetic clinical text for developing synthetic clinical NLP models (i.e., GatorTronS), which achieve better or comparable performance to GatorTron, an NLP model trained using real-world clinical text, demonstrating the utility of synthetic clinical text generation. The physicians’ Turing test show that GatorTronGPT can generate clinical text with comparable linguistic readability and clinical relevance to real-world clinical notes. This study provides valuable insights into the opportunities and challenges of generative LLMs for medical research and healthcare.

We discover an important utility of synthetic clinical text generation. To date, there has been a gap in accessing and sharing large-scale clinical text and clinical LLMs due to the sensitive nature of clinical text and the fact that automatic de-identification systems cannot remove 100% protected health information (PHI). Not surprisingly, a recent study 21 on clinical foundation models point out that most LLMs in the medical domain are trained using “small, narrowly-scoped” clinical dataset with limited note types (e.g., MIMIC 22 ) or “broad, public” biomedical literature (e.g., PubMed) that has limited insights to healthcare. Generative LLMs can provide large-scale synthetic clinical text to fill the gap. We compare the synthetic text with real-world clinical text to examine why GatorTronS, a transformer model trained using a much smaller (e.g., 5 billion words) synthetic clinical text corpus, could achieve better or comparable performance to GatorTron 15 , a transformer model trained using a much larger (90 billion words) real-world clinical text corpus. We identify potential reasons including (1) real-world clinical text has significant redundancies, which is a well-known characteristic of clinical narratives 23 , and (2) GatorTronGPT generates more diverse synthetic clinical text. We randomly sample a subset of real-world clinical notes with number of words comparable to the synthetic text (i.e., 20 billion words) to compare the coverage of unigrams (i.e., individual tokens) and bigrams (i.e., two consecutive tokens). The comparison results show that the synthetic text generated by GatorTronGPT contain remarkably more diverse unigrams (40.43 million : 4.82 million, ratios are reported as “synthetic” : “real notes”) and bigrams (416.35 million : 62.51 million); the synthetic text also has higher entropy than the real-world clinical text (4.97: 4.95). Supplementary Table S4 provides detailed comparison results and examples. A previous study 24 has reported that by augmenting real-world clinical training data using additional human annotated synthetic text generated by a smaller generative LLM, GPT-2, NLP models can achieve better performance. Our study further demonstrates that, without additional human annotation and augmentation of training data, a larger clinical GPT-3 model can generate synthetic clinical text to train synthetic NLP models outperforming NLP models trained using real-world clinical text. Text generation using generative LLMs could mitigate the risk of exposing patient privacy and improve accessing and sharing of large-scale clinical text and NLP models, thus enabling the next generation of clinical text analytics using synthetic clinical text.

Generative LLMs aspire to become a “Unified Field Theory” to unify most fundamental NLP tasks using a single model architecture. It might be still early to judge if LLMs will become the one and only foundation model 12 for NLP, but it looks like we are closer than ever. Generative LLMs have the potential to impact medical research in many aspects. In addition to performance improvement demonstrated in this study, generative LLMs provide a unified solution using prompt-based text generation 25 , which leads to a new paradigm of “one model for all NLP tasks” and has better few-shot learning and transfer learning ability to deliver portable clinical NLP systems 13 , 26 . The evaluation of GatorTronGPT shows that clinical LLMs can be used to generate clinical-relevant content with the potential to help document 3 and code patient information in EHR systems, thus reducing the extensively onerous documentation burden for clinicians 27 , 28 , 29 . The prompt-based text generation of LLMs can potentially help compose treatment plans by integrating instructions from clinical guidelines and patients’ historical records in EHRs. The conversational ability of LLMs provides opportunities to develop intelligent EHR systems with human-like communication 2 , where healthcare providers, patients, and other stakeholders can communicate in an intelligent electronic health record (EHR) system. Industry stakeholders such as Epic and Nuance have been reported to be exploring these potentials 30 , 31 .

Our Turing test focuses on (1) linguistic readability; (2) clinical relevance; and (3) physicians’ ability to differentiate synthetic and human notes. The statistical tests show that there are no significant differences in linguistic readability ( p  = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) or clinical relevance ( p  = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human). Further, physicians cannot differentiate them ( p  < 0.001), suggesting the potential utility of GatorTronGPT for text generation in healthcare. Two physician evaluators find that the texts written by GatorTronGPT generally lack clinical logic, indicating that more research and development are needed to make this technology mature for healthcare. Our Turing test focuses on statistical differences not utility in real-world clinical practice, which should be examined in future studies when this technology matures. A recent study 32 examined an LLM developed at New York University, i.e., NYUTron, and our previously developed GatorTron 15 for prediction of readmission, in-hospital mortality, comorbidity, length of stay, and insurance denial, demonstrating the potential utility of LLMs in healthcare.

While LLMs are promising for healthcare applications, much more research and development are needed to achieve this goal. Current general-purpose LLMs are designed for conversation as a chatbot outside of healthcare. Therefore, the current use of ChatGPT for healthcare is more like a typical case of intended use versus actual use as described in the medical device regulation 33 . Domain-specific LLMs are needed for clinical applications. Due to the noisy data and probabilistic nature of text generation, LLMs are prone to confabulation or hallucination, which is dangerous for healthcare. In this study, we adopted robust decoding strategies (e.g., nucleus sampling) to alleviate potential off-target text generation. Researchers are exploring solutions such as reinforcement learning from human feedback (RLHF) 34 to reduce hallucinations, but it is still a not yet solved limitation of current LLMs. Future studies should explore strategies to better control the hallucinations at a minimal level to ensure the safety of using LLMs in healthcare. The security and risk of LLMs must be carefully examined in healthcare settings. We applied a de-identification system to remove PHIs from UF Health notes before training GatorTronGPT, future studies should carefully examine if GatorTronGPT has potential risk of speaking out PHIs and quantify the potential risk of re-identify real-world patients. Synthetic data, though generated by AI models, may still mirror the characteristics of its source material (e.g., UF health clinical notes). For example, ChatGPT has been reported to accidentally leak sensitive business data from a private company 35 . In addition, people are increasingly aware of the potential bias of AI applications in healthcare. Bias inherited from the original training data may be imitated and sometimes even amplified by AI models, which may cause systematic bias to specific patient groups 36 . Future studies should explore strategies to mitigate potential bias and ensure fairness of LLM applications. Like any medical AI applications, it is necessary to carefully examine this disruptive new technology to guide its application and make it “approved ” AI-enabled medical tool 37 .

We developed GatorTronGPT using 82 billion words of de-identified clinical text 15 from the University of Florida (UF) Health and 195 billion diverse English words from the Pile 16 dataset. We trained GatorTronGPT from scratch using the GPT-3 17 architecture (used by ChatGPT). We formulated biomedical relation extraction and question answering using a unified text generation architecture 18 and evaluated GatorTronGPT using 6 biomedical benchmark datasets. To examine the utility of text generation, we applied GatorTronGPT to generate 20 billion words of synthetic clinical text, which were used to train synthetic NLP models, denoted as GatorTronS (“S” stands for synthetic). We compared GatorTronS with GatorTron 15 , a clinical NLP model trained with the same architecture but using real-world clinical text. To test if LLMs could generate text for healthcare settings, two internal medicine subspecialists from endocrinology (NSO) and cardiology (MMA) manually evaluated 60 clinical paragraphs including 30 paragraphs written by GatorTronGPT randomly mixed with 30 real-world paragraphs written by UF Health physicians. Figure 1 shows an overview of the study design.

Data source

This study used 82 billion words of clinical narratives from UF Health Integrated Data Repository (IDR) and 195 billion words of diverse English words from the Pile 16 corpus. This study was approved by the University of Florida Institutional Review Board under IRB202102223; the need for patient consent was waived. At UF Health, we collected approximately 290 million clinical notes from 2011–2021 from over 126 departments, approximately 2 million patients and 50 million encounters from inpatient, outpatient, and emergency settings 15 . We merged the UF Health clinical corpus with the Pile 16 dataset to generate a large corpus with 277 billion words. We performed minimal preprocessing for the Pile dataset and applied a de-identification system to remove 18 PHI categories defined in the Health Insurance Portability and Accountability Act (HIPAA) from the UF Health notes.

Preprocessing and de-identification of clinical text

Following our previous study 15 , we performed a minimal preprocessing procedure. First, we removed all empty notes and the notes with less than 10 characters followed by performing a deduplication at the note level using the exact string match strategy. Then, we leveraged an internally developed preprocessing tool ( https://github.com/uf-hobi-informatics-lab/NLPreprocessing ) to normalize the clinical text. The normalization processing consists of three steps including (1) unifying all text into UTF-8 encoding, removing illegal UTF-8 strings, and removing HTML/XML tags if any; (2) sentence boundary detection where we normalize the clinical notes into sentences; (3) word tokenization where we used heuristic rules to separate punctuation and special symbols (e.g., slash, parenthesis) from words (e.g., converting “(HbA1c)” to “(HbA1c)” and “excision/chemo” to “excision/chemo”) and fixing concatenations (e.g., missing white space like converting “CancerScreening ” to “Cancer Screening”). After preprocessing, we performed another deduplication at the sentence level using the exact string match strategy.

To de-identified the UF Health clinical notes, we adopted an internally developed de-identification system which consists of an LSTM-CRFs based model and a postprocessing module replacing system-detected protected health information (PHI) entities with dummy strings (e.g., replace patients’ names with [**NAME**]). We adopted the safe-harbor method to identify 18 PHI categories defined in the Health Insurance Portability and Accountability Act (HIPAA). The LSTM-CRFs model for PHI detection was trained using the publicly available 2014 i2b2 de-identification datasets and an internal dataset with over 1100 clinical notes from UF Health annotated for PHI removal (named as UF-deid-dataset; not publicly available due to IRB restrictions). After three years of continuous customization and improvement at UF Health, the current model achieved an overall F1 score of 97.98% (precision of 96.27% and recall of 99.76%) on the UF-deid-dataset test set, which means our de-identification system can remove 99.76% of all PHIs. Detailed information about the development of the de-identification system can be accessed from our previous paper 38 .

Train GatorTronGPT from scratch

We trained GatorTronGPT using 5 billion parameters and 20 billion parameters and determined the number of layers, hidden sizes, and number of attention heads according to the guidelines for optimal depth-to-width parameter allocation proposed by ref. 39 as well as our previous experience in developing GatorTron 15 . The 5 billion model has 24 layers, hidden size of 4,096, and number of attention heads of 32; the 20 billion model has 44 layers, hidden size of 6144, and number of attention heads of 48. We trained the 5 billion model using a 2-way tensor model parallel with a batch size of 1120 and learning rate of 1.200E-05. We trained the 20 billion model using an 8-way tensor model parallel with a batch size of 560 and a learning rate of 1.000E-05. We adopted a dropout rate of 0.1. We inherited the GPT-3 architecture implemented in the MegaTron-LM 40 and trained GatorTronGPT models from scratch with the default GPT-3 loss function 13 . We used a total number of 560 NVIDIA DGX A100 GPUs from 70 superPOD nodes at UF’s HiPerGator-AI cluster to train GatorTronGPT by leveraging both data-level and model-level parallelisms implemented by the Megatron-LM package 40 . (See https://github.com/NVIDIA/Megatron-LM for more details) We monitored the training progress by training loss and validation loss using 3% of the data and stopped the training when there was no improvement.

GatorTronGPT for biomedical relation extraction and question answering

End-to-end relation extraction is an NLP task to identify the triplets < concept1, concept2, relation > from biomedical text. Question answering is to identify the answer for a given question and the context . Following previous studies 18 , 41 , we approached the two tasks using a unified prompt-based text generation architecture. Specifically, we adopted a fixed-LLM prompt-tuning strategy 42 to attach a continuous embedding (i.e., virtue tokens) to the input sequence [ virtual tokens; x; y ] as a soft prompt to control the text generation; the LLM was not changed during training. We provide details in the Supplement.

End-to-end biomedical relation extraction

We compared the two GatorTronGPT models with four existing transformer models including GPT-2 43 , REBEL, REBEL-pt 25 , and BioGPT 18 on three biomedical tasks for end-to-end relation extraction using three benchmark datasets including drug-drug interaction 44 (DDI), BioCreative V chemical-disease relation 45 (BC5CDR), and drug-target interaction 46 (KD-DTI).

GPT-2 was trained using text data from 8 million webpages with 1.5 billion parameters, which is a scale-up of the first generation of GPT45 model. The GPT model outperformed previous transformer models on 9 out of 12 NLP tasks, whereas, the GPT-2 model further demonstrated text generation ability, which laid foundation for complex NLP tasks such as machine reading comprehension and question answering.

REBEL and REBEL-pt

REBEL is a transformer model based on the BART architecture designed for end-to-end relation extraction using sequence-to-sequence modeling, which outperformed previous relation extraction models based on classifications. REBEL-pt is an enhanced version of REBEL by further fine-tuning it using the triplets derived using Wikipedia hyperlinks.

BioGPT is a domain-specific generative transformer-based LLM developed using the GPT-2 architecture and the Pubmed biomedical literature, which achieved good performance in NLP tasks including relation extraction and question answering in the biomedical domain.

Following the previous study 18 , we formulated both biomedical relation extraction and question answering as a prompt-based text generation model and applied prompt-tuning (p-tuning) algorithms. We concatenate learnable soft prompts (also called virtual prompt embeddings) with the word embeddings from the context (i.e., input sentence). The sample sequence is constructed as [ prompt , context , relation ], where the prompt is generated using a LSTM model and the relation is the gold standard label including the head entity, tail entity, and their relation type. During the inference, the context and the prompt are used as the input for our GatorTronGPT model to condition and let the model generate the relations. We converted the original relation triplets into a sequence representation. For example, there is an “ agonist ” relation between a drug - “ Igmesine ” and a target “ Opioid receptor sigma 1 ”, which was converted as: “the relation between [ Igmesine ] and [ Opioid receptor sigma 1 ] is [ agonist ] ” . Thus, the relation extraction can be solved as a text generation. During inference, we converted the generated text back to triplets for evaluation. We fine-tuned and evaluated our GatorTronGPT on the end-to-end relation extraction task across four biomedical datasets: BC5CDR (chemical–disease–relation extraction), KD-DTI (drug–target–interaction extraction), DDI (drug–drug–interaction extraction) and 2018 n2c2 (Drug-ADE-relation extraction). The precision, recall, and F1 score were used for evaluation.

Biomedical question answering

We compared GatorTronGPT with six existing transformer models using three widely used benchmark dataset including PubMedQA 47 —a biomedical question answering dataset collected from PubMed abstracts, which requires answering questions with ‘ yes/no/maybe ’ ; MedMCQA 48 —a large-scale multi-choice question answering dataset designed to address real world medical entrance exam questions covering 2400 healthcare topics and 21 medical subjects; and MedQA-USMLE 49 —a multi-choice dataset collected from the professional medical board exams. These datasets have been widely used to evaluate LLMs 18 , 47 , 48 , 49 .

Given a question, a context, and candidate answers, we concatenated the context and the candidate answers into a source sequence and compose the target sequence as: “the answer to the question given possible options is:”, “answer”: “C”. Then, we adopted soft prompts instead of hard prompts (manually designed clear text phrases) in p-tuning. Specifically, we used a randomly initiated continuous embedding as soft prompts, which were fine-tuned in the training. For the PubMedQA dataset, we explored the provided artificially generated text data. Specifically, we automatically labeled the generated text using our p-tuning model developed using the training set and experimented to feedback different proportion of auto-labeled data into training. The best performance was achieved by using 5% of the auto-labeled artificially generated text data. For p-tuning, we used the implementation in NVIDIA NeMo 50 , which is optimized for LLMs. We used the following parameters in our p-tuning: a global batch size of 32, virtual tokens for p-tuning 15, encoder MLP with encoder hidden size of 2048, max sequence length of 4096 for PubMedQA (long abstracts), 2048 for MedMCQA and MedQA-USMLE, and a fused Adam optimizer with a learning rate of 1e-4 and a weight decay of 0·01, betas of 0·9 and 0·98, a cosine annealing scheduler monitoring validation loss with a 50 step warm up. For example, the below is a prompt we used for MedQA-USMLE.

{“taskname”: “usmle-qa”, “prompt”: “QUESTION: A 23-year-old man comes to the physician for evaluation of decreased hearing, dizziness, and ringing in his right ear for the past 6 months. Physical examination shows multiple soft, yellow plaques and papules on his arms, chest, and back. There is sensorineural hearing loss and weakness of facial muscles bilaterally. His gait is unsteady. An MRI of the brain shows a 3-cm mass near the right internal auditory meatus and a 2-cm mass at the left cerebellopontine angle. The abnormal cells in these masses are most likely derived from which of the following embryological structures?\nMULTIPLE CHOICES: (A) Neural tube\n(B) Surface ectoderm\n(C) Neural crest\n(D) Notochord\nTARGET: the answer to the question given possible options is: “, “answer”: “C”}

GatorTronGPT for synthetic clinical text generation

We sought to test the hypothesis that LLMs can generate synthetic clinical text to train synthetic NLP models useful for medical research. We applied GatorTronGPT to generate synthetic clinical text according to a set of seeds without any fine-tuning, which is a typical zero-shot learning setting. Then, using the generated synthetic clinical text, we trained synthetic transformer-based NLP models using our previous BERT-based GatorTron architecture 15 , denoted as GatorTronS (‘S’ stands for synthetic). We trained GatorTronS models using different sizes of synthetic clinical text and compared them with the original GatorTron model trained using UF Health clinical text. To make it comparable, we trained GatorTronS using the same architecture and number of parameters (i.e., 345 million) as GatorTron 15 . We provide detailed information in the Supplement.

Synthetic clinical text generation

Following previous studies 51 , we approached synthetic clinical text generation using an iterative sampling algorithm and applied top-p (i.e., nucleus sampling) sampling and temperature sampling to balance the diversity and quality of text generation 51 . We approached the synthetic clinical text generation as an open-ended text-to-text generation task 52 , 53 , where the generated clinical text is restricted by the context (e.g., the prompts). Specifically, given a sequence of \(m\) tokens \({{X}_{{pre}}=x}_{1}{x}_{2}...{x}_{m}\) as input context, the task is to generate the next \(n\) continuation tokens \({{X}_{{cont}}=x}_{m+1}{x}_{m+2}...{x}_{m+n}\) until reaching the max length of 512 tokens. We generate text through iteratively sampling from the pre-trained language model GatorTronGPT one token at a time by conditioning on the preceding context:

where \(P({x}_{i}|{x}_{1}\ldots {x}_{i-1})\) is the next token distribution. We adopt Top-p (nucleus) sampling 54 during sampling to select words whose cumulative probability exceeds a predefined threshold p .

where \({V}^{(p)}\) is the top-p vocabulary used to sample the next word. This approach dynamically adapts the number of words considered at each step based on their probabilities, balancing diversity and coherence of the generated text.

We set the parameter of top-p sampling at 0.9 and the parameter for temperature sampling at 1.2 according to our empirical assessment. We sampled the beginning 15 tokens from all sections of the de-identified notes from the MIMIC III database 22 and generated approximately 8 million prompts. We also tried several random seeds in GatorTronGPT to generate multiple documents from one prompt. We controlled GatorTronGPT to generate a maximum length of 512 tokens.

Synthetic NLP model development

We applied GatorTronGPT to generate different sizes of synthetic clinical text including 1 billion, 5 billion, 10 billion, and 20 billion words of clinical text and developed corresponding synthetic NLP models, denoted as GatorTronS. Following our previous study 15 , we trained GatorTronS using the same architecture of GatorTron – a BERT architecture with 345 million parameters.

Comparison with existing transformer models

We compared GatorTronS models with ClinicalBERT 55 —an existing clinical transformer model and GatorTron 15 , the current largest clinical transformer model trained using >90 billion words of text, using 5 clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference, and question answering.

Turing test of text generation for healthcare settings

We randomly sampled 30 narrative sections from real-world UF Health clinical notes, including “past medical history”, “history of present illness”, “assessment/plan”, and “chief complaint”. For each of the 30 sections, we extracted the beginning 15 tokens as a seed for GatorTronGPT to generate a synthetic paragraph up to 512 tokens. We cut off the 30 real-world clinical sections to 512 tokens, removed all format information, and randomly mixed them with 30 synthetic sections written by GatorTronGPT. Two UF Health physicians (NSO, MMA) manually reviewed the 60 paragraphs of notes to evaluate: (1) linguistic readability on a 1(worst) to 9 (best) scale, (2) clinical relevance and consistency on a 1 to 9 scale, (3) determine if it was written by a human physician or GatorTronGPT. Percent agreement and Gwet’s AC 1 were calculated to evaluate interrater reliability 56 .

Data availability

The benchmark datasets that support the findings of this study are available from the official websites of natural language processing challenges with Data Use Agreements. More specifically: 1. i2b2 2010, 2012 datasets and n2c2 2018, 2019 datasets: https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ . 2. MedNLI dataset: https://physionet.org/content/mednli/1.0.0/ . 3. emrQA dataset: https://github.com/panushri25/emrQA#download-dataset . 4. The Pile dataset: https://pile.eleuther.ai/ . 5. UF Health IDR clinical notes are not open to the public due to patient privacy information. The GatorTronS, and GatorTron models are available as open-source resources. The synthetic clinical transformer model, GatorTronS, is available from: https://huggingface.co/UFNLP/gatortronS . The GatorTron model trained using real-world clinical text is available: https://huggingface.co/UFNLP/gatortron-base .

Code availability

The computer codes to train GatorTronGPT models are available from: https://github.com/NVIDIA/Megatron-LM/blob/main/pretrain_gpt.py . The scripts used for data preprocessing, vocabulary training and other utilities are available from: https://github.com/uf-hobi-informatics-lab/GatorTronGPT . The computer codes to train GatorTronS models are available from: https://github.com/NVIDIA/Megatron-LM and https://github.com/NVIDIA/NeMo . The computer codes for preprocessing of text data are available from: https://github.com/uf-hobi-informatics-lab/NLPreprocessing .

Introducing ChatGPT. https://openai.com/blog/chatgpt .

Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388 , 1233–1239 (2023).

Article   PubMed   Google Scholar  

Patel, S. B. & Lam, K. ChatGPT: the future of discharge summaries? Lancet Digit Health 5 , e107–e108 (2023).

Article   CAS   PubMed   Google Scholar  

Ali, S. R., Dobbs, T. D., Hutchings, H. A. & Whitaker, I. S. Using ChatGPT to write patient clinic letters. Lancet Digit Health 5 , e179–e181 (2023).

Hirosawa, T. et al. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int. J. Environ. Res. Public Health 20 , 3378 (2023).

Grünebaum, A., Chervenak, J., Pollet, S. L., Katz, A. & Chervenak, F. A. The Exciting Potential for ChatGPT in Obstetrics and Gynecology. Am. J. Obstet. Gynecol . https://doi.org/10.1016/j.ajog.2023.03.009 (2023).

Cascella, M., Montomoli, J., Bellini, V. & Bignami, E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J. Med. Syst. 47 , 33 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Azamfirei, R., Kudchadkar, S. R. & Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 27 , 120 (2023).

Straw, I. & Callison-Burch, C. Artificial Intelligence in mental health and the biases of language based models. PLoS One 15 , e0240376 (2020).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Li, H. et al. Ethics of large language models in medicine and medical research. Lancet Digital Health https://doi.org/10.1016/S2589-7500(23)00083-3 (2023).

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. Adv. Neural Inf. Process. Syst . 35 , 22199–213 (2022).

Bommasani, R. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).

Brown, T., Mann, B. & Ryder, N. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Google Scholar  

Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55 , 1–35 (2023).

CAS   Google Scholar  

Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med. 5 , 194 (2022).

Gao, L. et al. The Pile: an 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 (2020).

Floridi, L. & Chiriatti, M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30 , 681–694 (2020).

Article   Google Scholar  

Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform . 23 , bbac409 (2022).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019). https://doi.org/10.18653/v1/N19-1423 .

Mohammed, M., Khan, M. B. & Bashier, E. B. M. Machine Learning (CRC Press, 2016). https://doi.org/10.1201/9781315371658 .

Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 6 , 135 (2023).

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3 , 160035 (2016).

Searle, T., Ibrahim, Z., Teo, J. & Dobson, R. Estimating redundancy in clinical text. J. Biomed. Inform. 124 , 103938 (2021).

Li, J. et al. Are synthetic clinical notes useful for real natural language processing tasks: a case study on clinical entity recognition. J. Am. Med. Inform. Assoc. 28 , 2193–2201 (2021).

Huguet Cabot, P.-L. & Navigli, R. REBEL: relation extraction by end-to-end language generation. in Findings of the Association for Computational Linguistics: EMNLP 2021 2370–2381 (Association for Computational Linguistics, 2021). https://doi.org/10.18653/v1/2021.findings-emnlp.204 .

Peng, C. et al. Clinical concept and relation extraction using prompt-based machine reading comprehension. J. Am. Med. Inform. Assoc . https://doi.org/10.1093/jamia/ocad107 (2023).

Gaffney, A. et al. Medical documentation burden among US office-based physicians in 2019: a national study. JAMA Intern. Med. 182 , 564–566 (2022).

Downing, N. L., Bates, D. W. & Longhurst, C. A. Physician burnout in the electronic health record era: are we ignoring the real cause? Ann. Intern. Med. 169 , 50 (2018).

Kroth, P. J. et al. Association of electronic health record design and use factors with clinician stress and burnout. JAMA Netw. Open 2 , e199609 (2019).

Diaz, N. Epic to use Microsoft’s GPT-4 in EHRs. https://www.beckershospitalreview.com/ehrs/epic-to-use-microsofts-open-ai-in-ehrs.html .

Trang, B. We’re getting much more aggressive’: Microsoft’s Nuance adds GPT-4 AI to its medical note-taking tool. https://www.statnews.com/2023/03/20/microsoft-nuance-gpt4-dax-chatgpt/ .

Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619 , 357–362 (2023).

Kleesiek, J., Wu, Y., Stiglic, G., Egger, J. & Bian, J. An opinion on ChatGPT in health care-written by humans only. J. Nucl. Med . https://doi.org/10.2967/jnumed.123.265687 (2023).

Ouyang, L. et al. Training language models to follow instructions with human feedback. arXiv [cs.CL] (2022).

Ray, S. Samsung bans ChatGPT among employees after sensitive code leak. Forbes Magazine (2023).

Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356 , 183–186 (2017).

Center for Devices & Radiological Health. Artificial Intelligence and Machine Learning in Software as a Medical Device. U.S. Food and Drug Administration https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device .

Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19 , 232 (2019).

Levine, Y., Wies, N., Sharir, O., Bata, H. & Shashua, A. The depth-to-width interplay in self-attention. arXiv [cs.LG] (2020).

Shoeybi, M. et al. Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv [cs.CL] (2019).

Li, X. L. & Liang, P. Prefix-tuning: optimizing continuous prompts for generation. in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 4582–4597 (Association for Computational Linguistics, 2021). https://doi.org/10.18653/v1/2021.acl-long.353 .

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. & Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys. 59 , 1–35 (2023).

Radford A., Wu J., Child R., Luan D. & Amodei D. Language models are unsupervised multitask learners. OpenAI, 1 , (2019)

The ddi corpus: An annotated corpus with pharmacological sub-stances and drug-drug interactions . J. Biomed. Inform . 46 , 914–920 (2013).

Li, J. et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxf.) 2016 , baw068 (2016).

Hou, Y. et al. Discovering drug–target interaction knowledge from biomedical literature. Bioinformatics 38 , 5100–5107 (2022).

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Association for Computational Linguistics, 2019). https://doi.org/10.18653/v1/d19-1259 .

Singhal, K. et al. Large language models encode clinical knowledge. arXiv [cs.CL] (2022).

Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. NATO Adv. Sci. Inst. E Appl. Sci. 11 , 6421 (2021).

NeMo: NeMo: a toolkit for conversational AI. (NVIDIA GitHub).

Holtzman A., Buys J., Forbes M. & Choi Y. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).

Clark, E., Ji, Y. & Smith, N. A. Neural text generation in stories using entity representations as context. in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) 2250–2260 (Association for Computational Linguistics, 2018). https://doi.org/10.18653/v1/N18-1204 .

Celikyilmaz, A., Clark, E. & Gao, J. Evaluation of text generation: a survey. arXiv preprint arXiv:2006.14799 (2020).

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).

Huang, K., Altosaar, J. & Ranganath, R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019).

Wongpakaran, N., Wongpakaran, T., Wedding, D. & Gwet, K. L. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med. Res. Methodol. 13 , 61 (2013).

Download references

Acknowledgements

This study was partially supported by a Patient-Centered Outcomes Research Institute® (PCORI®) Award (ME-2018C3-14754), a grant from the National Cancer Institute, 1R01CA246418, grants from the National Institute on Aging, NIA R56AG069880 and 1R01AG080624, and the Cancer Informatics and eHealth core jointly supported by the UF Health Cancer Center and the UF Clinical and Translational Science Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding institutions. We would like to thank the UF Research Computing team, led by Dr. Erik Deumens, for providing computing power through UF HiperGator-AI cluster.

Author information

Authors and affiliations.

Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA

Cheng Peng, Xi Yang, Aokun Chen, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian & Yonghui Wu

Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA

Xi Yang, Aokun Chen, Yi Guo, Jiang Bian & Yonghui Wu

NVIDIA, Santa Clara, CA, USA

Kaleb E. Smith, Nima PourNejatian, Anthony B. Costa, Cheryl Martin & Mona G. Flores

Research Computing, University of Florida, Gainesville, FL, USA

Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA

Tanja Magoc & Gloria Lipori

Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA

Gloria Lipori & Duane A. Mitchell

Division of Endocrinology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA

Naykky S. Ospina

Division of Cardiovascular Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA

Mustafa M. Ahmed

You can also search for this author in PubMed   Google Scholar

Contributions

Y.W., J.B., X.Y., N.P., A.B.C., and M.G.F. were responsible for the overall design, development, and evaluation of this study. X.Y., C.P., A.C., and K.E.S. had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Y.G. and Y.W. designed the Turing evaluation of synthetic clinical text generated by GatorTronGPT. N.S.O. and M.M.A. are the two human physicians who performed Turing test. Y.W., X.Y., K.E.S., C.P., Y.G., and J.B. did the bulk of the writing, W.H., E.A.S., D.A.M., T.M., C.A.H., A.B.C., and G.L. also contributed to writing and editing of this manuscript. All authors reviewed the manuscript critically for scientific content, and all authors gave final approval of the manuscript for publication.

Corresponding author

Correspondence to Yonghui Wu .

Ethics declarations

Competing interests.

K.E.S., N.P.N., A.B.C., C.M., and M.G.F. are employed by NVIDIA. There are no other competing financial or non-financial interests. The work presented in this study was conducted exclusively within the University of Florida Health.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xi Yang finished this work when he was a full-time employee at the University of Florida.

Supplementary information

Supplementary information, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Peng, C., Yang, X., Chen, A. et al. A study of generative large language model for medical research and healthcare. npj Digit. Med. 6 , 210 (2023). https://doi.org/10.1038/s41746-023-00958-w

Download citation

Received : 05 June 2023

Accepted : 01 November 2023

Published : 16 November 2023

DOI : https://doi.org/10.1038/s41746-023-00958-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Artificial intelligence in neurology: opportunities, challenges, and policy implications.

  • Sebastian Voigtlaender
  • Johannes Pawelczyk
  • Sebastian F. Winter

Journal of Neurology (2024)

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

  • Marco Cascella
  • Federico Semeraro
  • Elena Bignami

Journal of Medical Systems (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

thesis language model

Purdue University Graduate School

VISION-LANGUAGE MODEL FOR ROBOT GRASPING

Robot grasping is emerging as an active area of research in robotics as the interest in human-robot interaction is gaining worldwide because of diverse industrial settings for sharing tasks and workplaces. It mainly focuses on the quality of generated grasps for object manipulation. However, despite advancements, these methods need to consider the human-robot collaboration settings where robots and humans will have to grasp the same objects concurrently. Therefore, generating robot grasps compatible with human preferences of simultaneously holding an object is necessary to ensure a safe and natural collaboration experience. In this work, we propose a novel, deep neural network-based method called CoGrasp that generates human-aware robot grasps by contextualizing human preference models of object grasping into the robot grasp selection process. We validate our approach against existing state-of-the-art robot grasping methods through simulated and real-robot experiments and user studies. In real robot experiments, our method achieves about 88% success rate in producing stable grasps that allow humans to interact and grasp objects simultaneously in a socially compliant manner. Furthermore, our user study with 10 independent participants indicated our approach enables a safe, natural, and socially aware human-robot objects' co-grasping experience compared to a standard robot grasping technique.

To facilitate the grasping process, we also introduce a vision-language model that works as a pre-processing system before the grasping action takes place. In most settings, the robots are equipped with sensors that allow them to capture the scene, on which the vision model is used to do a detection task and objectify the visible contents in the environment. The language model is used to program the robot to make it possible for them to understand and execute the required sequence of tasks. Using the process of object detection, we build a set of object queries from the sensor image and allow the user to provide an input query for a task to be performed. We then perform a similarity score among these queries to localize the object that needs attention, and once identified, we can use a grasping process for the task at hand.

Degree Type

  • Master of Science in Electrical and Computer Engineering
  • Electrical and Computer Engineering

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Advisor/supervisor/committee co-chair, additional committee member 2, usage metrics.

  • Assistive robots and technology
  • Manufacturing robotics
  • Medical robotics

CC BY 4.0

Help | Advanced Search

Computer Science > Computation and Language

Title: long-form factuality in large language models.

Abstract: Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can achieve superhuman rating performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at this https URL .

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

To revisit this article, visit My Profile, then View saved stories .

  • Backchannel
  • Newsletters
  • WIRED Insider
  • WIRED Consulting

Will Knight

Inside the Creation of the World’s Most Powerful Open Source AI Model

A crowded group of people talking and smiling in an office space

This past Monday, about a dozen engineers and executives at data science and AI company Databricks gathered in conference rooms connected via Zoom to learn if they had succeeded in building a top artificial intelligence language model. The team had spent months, and about $10 million, training DBRX, a large language model similar in design to the one behind OpenAI’s ChatGPT . But they wouldn’t know how powerful their creation was until results came back from the final tests of its abilities.

“We’ve surpassed everything,” Jonathan Frankle, chief neural network architect at Databricks and leader of the team that built DBRX, eventually told the team, which responded with whoops, cheers, and applause emojis. Frankle usually steers clear of caffeine but was taking sips of iced latte after pulling an all-nighter to write up the results.

Databricks will release DBRX under an open source license, allowing others to build on top of its work. Frankle shared data showing that across about a dozen or so benchmarks measuring the AI model’s ability to answer general knowledge questions, perform reading comprehension, solve vexing logical puzzles, and generate high-quality code, DBRX was better than every other open source model available .

Four people standing at the corner of a grey and yellow wall in an office space

It outshined Meta’s Llama 2 and Mistral’s Mixtral, two of the most popular open source AI models available today. “Yes!” shouted Ali Ghodsi, CEO of Databricks, when the scores appeared. “Wait, did we beat Elon’s thing?” Frankle replied that they had indeed surpassed the Grok AI model recently open-sourced by Musk’s xAI , adding, “I will consider it a success if we get a mean tweet from him.”

To the team’s surprise, on several scores DBRX was also shockingly close to GPT-4, OpenAI’s closed model that powers ChatGPT and is widely considered the pinnacle of machine intelligence. “We’ve set a new state of the art for open source LLMs,” Frankle said with a super-sized grin.

By open-sourcing, DBRX Databricks is adding further momentum to a movement that is challenging the secretive approach of the most prominent companies in the current generative AI boom. OpenAI and Google keep the code for their GPT-4 and Gemini large language models closely held, but some rivals, notably Meta , have released their models for others to use, arguing that it will spur innovation by putting the technology in the hands of more researchers, entrepreneurs, startups, and established businesses.

Databricks says it also wants to open up about the work involved in creating its open source model, something that Meta has not done for some key details about the creation of its Llama 2 model . The company will release a blog post detailing the work involved to create the model, and also invited WIRED to spend time with Databricks engineers as they made key decisions during the final stages of the multimillion-dollar process of training DBRX. That provided a glimpse of how complex and challenging it is to build a leading AI model—but also how recent innovations in the field promise to bring down costs. That, combined with the availability of open source models like DBRX, suggests that AI development isn’t about to slow down any time soon.

Ali Farhadi, CEO of the Allen Institute for AI , says greater transparency around the building and training of AI models is badly needed. The field has become increasingly secretive in recent years as companies have sought an edge over competitors. Opacity is especially important when there is concern about the risks that advanced AI models could pose, he says. “I’m very happy to see any effort in openness,” Farhadi says. “I do believe a significant portion of the market will move towards open models. We need more of this.”

A Ghost Ship’s Doomed Journey Through the Gate of Tears

Matt Burgess

Scientists Are Unlocking the Secrets of Your ‘Little Brain’

R Douglas Fields

You Should Update Apple iOS and Google Chrome ASAP

Kate O'Flaherty

The Deaths of Effective Altruism

Databricks has a reason to be especially open. Although tech giants like Google have rapidly rolled out new AI deployments over the past year, Ghodsi says that many large companies in other industries are yet to widely use the technology on their own data. Databricks hopes to help companies in finance, medicine, and other industries, which he says are hungry for ChatGPT-like tools but also leery of sending sensitive data into the cloud.

“We call it data intelligence—the intelligence to understand your own data,” Ghodsi says. Databricks will customize DBRX for a customer or build a bespoke one tailored to their business from scratch. For major companies, the cost of building something on the scale of DBRX makes perfect sense, he says. “That’s the big business opportunity for us.” In July last year, Databricks acquired a startup called MosaicML, that specializes in building AI models more efficiently, bringing on several people involved with building DBRX, including Frankle. No one at either company had previously built something on that scale before.

DBRX, like other large language models, is essentially a giant artificial neural network —a mathematical framework loosely inspired by biological neurons—that has been fed huge quantities of text data. DBRX and its ilk are generally based on the transformer , a type of neural network invented by a team at Google in 2017 that revolutionized machine learning for language.

Not long after the transformer was invented, researchers at OpenAI began training versions of that style of model on ever-larger collections of text scraped from the web and other sources—a process that can take months. Crucially, they found that as the model and data set it was trained on were scaled up, the models became more capable, coherent, and seemingly intelligent in their output.

Person wearing a black button down shirt and blue jeans sitting on a yellow bench in front of a yellow wall

Seeking still-greater scale remains an obsession of OpenAI and other leading AI companies. The CEO of OpenAI, Sam Altman, has sought $7 trillion in funding for developing AI-specialized chips, according to The Wall Street Journal . But not only size matters when creating a language model. Frankle says that dozens of decisions go into building an advanced neural network, with some lore about how to train more efficiently that can be gleaned from research papers, and other details are shared within the community. It is especially challenging to keep thousands of computers connected by finicky switches and fiber-optic cables working together.

“You’ve got these insane [network] switches that do terabits per second of bandwidth coming in from multiple different directions,” Frankle said before the final training run was finished. “It's mind-boggling even for someone who's spent their life in computer science.” That Frankle and others at MosaicML are experts in this obscure science helps explain why Databricks’ purchase of the startup last year valued it at $1.3 billion .

The data fed to a model also makes a big difference to the end result—perhaps explaining why it’s the one detail that Databricks isn’t openly disclosing. “Data quality, data cleaning, data filtering, data prep is all very important,” says Naveen Rao, a vice president at Databricks and previously founder and CEO of MosaicML. “These models are really just a function of that. You can almost think of that as the most important thing for model quality.”

AI researchers continue to invent architecture tweaks and modifications to make the latest AI models more performant. One of the most significant leaps of late has come thanks to an architecture known as “mixture of experts,” in which only some parts of a model activate to respond to a query, depending on its contents. This produces a model that is much more efficient to train and operate. DBRX has around 136 billion parameters, or values within the model that are updated during training. Llama 2 has 70 billion parameters, Mixtral has 45 billion, and Grok has 314 billion. But DBRX only activates about 36 billion on average to process a typical query. Databricks says that tweaks to the model designed to improve its utilization of the underlying hardware helped improve training efficiency by between 30 and 50 percent. It also makes the model respond more quickly to queries, and requires less energy to run, the company says.

Sometimes the highly technical art of training a giant AI model comes down to a decision that’s emotional as well as technical. Two weeks ago, the Databricks team was facing a multimillion-dollar question about squeezing the most out of the model.

After two months of work training the model on 3,072 powerful Nvidia H100s GPUs leased from a cloud provider, DBRX was already racking up impressive scores in several benchmarks, and yet there was roughly another week's worth of supercomputer time to burn.

Different team members threw out ideas in Slack for how to use the remaining week of computer power. One idea was to create a version of the model tuned to generate computer code, or a much smaller version for hobbyists to play with. The team also considered stopping work on making the model any larger and instead feeding it carefully curated data that could boost its performance on a specific set of capabilities, an approach called curriculum learning. Or they could simply continue going as they were, making the model larger and, hopefully, more capable. This last route was affectionately known as the “fuck it” option, and one team member seemed particular keen on it.

A crowded group of people talking smiling and checking their smartphones in an office space

While the discussion remained friendly, strong opinions bubbled up as different engineers pushed for their favored approach. In the end, Frankle deftly ushered the team toward the data-centric approach. And two weeks later it would appear to have paid off massively. “The curriculum learning was better, it made a meaningful difference,” Frankle says.

Frankle was less successful in predicting other outcomes from the project. He had doubted DBRX would prove particularly good at generating computer code because the team didn’t explicitly focus on that. He even felt sure enough to say he’d dye his hair blue if he was wrong. Monday’s results revealed that DBRX was better than any other open AI model on standard coding benchmarks. “We have a really good code model on our hands,” he said during Monday’s big reveal. “I’ve made an appointment to get my hair dyed today.”

Risk Assessment

The final version of DBRX is the most powerful AI model yet to be released openly, for anyone to use or modify. (At least if they aren’t a company with more than 700 million users, a restriction Meta also places on its own open source AI model Llama 2.) Recent debate about the potential dangers of more powerful AI has sometimes centered on whether making AI models open to anyone could be too risky. Some experts have suggested that open models could too easily be misused by criminals or terrorists intent on committing cybercrime or developing biological or chemical weapons. Databricks says it has already conducted safety tests of its model and will continue to probe it.

Stella Biderman, executive director of EleutherAI , a collaborative research project dedicated to open AI research, says there is little evidence suggesting that openness increases risks. She and others have argued that we still lack a good understanding of how dangerous AI models really are or what might make them dangerous—something that greater transparency might help with. “Oftentimes, there's no particular reason to believe that open models pose substantially increased risk compared to existing closed models,” Biderman says.

EleutherAI joined Mozilla and around 50 other organizations and scholars in sending an open letter this month to US secretary of commerce Gina Raimondo, asking her to ensure that future AI regulation leaves space for open source AI projects. The letter argued that open models are good for economic growth, because they help startups and small businesses, and also “help accelerate scientific research.”

Databricks is hopeful DBRX can do both. Besides providing other AI researchers with a new model to play with and useful tips for building their own, DBRX may contribute to a deeper understanding of how AI actually works, Frankle says. His team plans to study how the model changed during the final week of training, perhaps revealing how a powerful model picks up additional capabilities. “The part that excites me the most is the science we get to do at this scale,” he says.

You Might Also Like …

In your inbox: Will Knight's Fast Forward explores advances in AI

This shadowy firm enables businesses to operate in near-total secrecy

Scientists are inching closer to bringing back the woolly mammoth

The first rule of the Extreme Dishwasher Loading Facebook group is …

Phones for every budget: These devices stood up to WIRED’s testing

thesis language model

Lauren Goode

Google DeepMind’s Latest AI Agent Learned to Play Goat Simulator 3

Reece Rogers

8 Google Employees Invented Modern AI. Here’s the Inside Story

Steven Levy

NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records

An image of an NVIDIA H200 Tensor Core GPU.

Generative AI is unlocking new computing applications that greatly augment human capability, enabled by continued model innovation. Generative AI models —including large language models (LLMs)—are used for crafting marketing copy, writing computer code, rendering detailed images, composing music, generating videos, and more. The amount of compute required by the latest models is immense and continues to grow as new models are invented. 

The computational intensity of generative AI inference demands excellence across chips, systems, and software. MLPerf Inference is a benchmark suite that measures inference performance across several popular deep-learning use cases. The latest version of the benchmark suite–MLPerf Inference v4.0–adds two new workloads that represent popular and modern generative AI use cases. The first is an LLM benchmark based on the largest of the Meta Llama 2 family of large language models (LLMs), Llama 2 70B . The second is a text-to-image test based on Stable Diffusion XL .

The NVIDIA accelerated computing platform set performance records on both the new workloads using the NVIDIA H200 Tensor Core GPU . And, using NVIDIA TensorRT-LLM software, the NVIDIA H100 Tensor Core GPU nearly tripled performance on the GPT-J LLM test. NVIDIA Hopper architecture GPUs continue to deliver the highest performance per accelerator across all MLPerf Inference workloads in the data center category. Additionally, NVIDIA also made several submissions in the open division of MLPerf Inference, showcasing its model and algorithm innovations.

In this post, we provide a look at some of the many full-stack technologies behind these record-setting generative AI inference performance achievements. 

TensorRT-LLM nearly triples LLM inference performance

LLM-based services, such as chatbots, must deliver fast responses to user queries and be cost-effective, which requires high inference throughput. Production inference solutions must be able to serve cutting-edge LLMs with both low latency and high throughput, simultaneously. 

TensorRT-LLM is a high-performance, open-source software library providing state-of-the-art performance when running the latest LLMs on NVIDIA GPUs. 

MLPerf Inference v4.0 includes two LLM tests. The first is GPT-J, which was introduced in the prior round of MLPerf, and the second is the newly added Llama 2 70B benchmark. H100 Tensor Core GPUs using TensorRT-LLM achieved speedups on GPT-J of 2.4x and 2.9x in the offline and server scenarios, respectively. Compared to submissions in the prior round. TensorRT-LLM was also central to the NVIDIA platform’s exceptional performance on the Llama 2 70B test. 

A chart showing the performance increases on H100 GPUs in both the offline and server scenarios this round compared to the prior round. 

MLPerf Inference v3.1 and v4.0 data center results were retrieved from www.mlperf.org on March 27, 2024 from entries 3.1-0107 and 4.0-0060, respectively. The MLPerf name and logo are registered and unregistered trademarks of the MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

The following are some of the key features of TensorRT-LLM that enabled these great performance results: 

  • Inflight sequence batching increases GPU use during LLM inference by better interleaving inference requests and evicting requests in a batch as soon as they’ve completed processing and inserting new requests in their place.
  • Paged KV cache improves memory consumption and usage by partitioning and storing the KV cache into non-contiguous memory blocks, allocating and evicting blocks on demand, and accessing the blocks dynamically during attention computation. 
  • Tensor parallelism supports splitting weights across GPUs and nodes using NCCL for communication, enabling efficient inference at scale.
  • Quantization supports FP8 quantization, which uses the fourth-generation Tensor Cores in the NVIDIA Hopper architecture to reduce model size and increase performance. 
  • XQA kernel high-performance implementation of attention that supports MHA, MQA, and GQA, along with beam search, significantly increasing throughput within a given latency budget. 

For more details about TensorRT-LLM features, see this post that dives into how TensorRT-LLM boosts LLM inference .

H200 Tensor Core GPUs supercharge LLM inference

The H200, based on Hopper architecture, is the world’s first GPU to use the industry’s most advanced HBM3e memory. H200 incorporates 141 GB of HBM3e with 4.8 TB/s of memory bandwidth, representing nearly 1.8x more GPU memory and 1.4x higher GPU memory bandwidth compared to the H100. 

The combination of larger, faster memory, and a new custom thermal solution enabled H200 GPUs to demonstrate large performance improvements on the Llama 2 70B benchmark, compared to H100 submissions this round. 

HBM3e enables more performance

The upgraded GPU memory of H200 helps unlock more performance compared to H100 on the Llama 2 70B workload in two important ways. 

It removes the need for tensor parallel or pipeline parallel execution for optimal performance in the MLPerf Llama 2 70B benchmark. This reduces communication overhead and improves inference throughput.

Second, the H200 GPU features greater memory bandwidth compared to H100, relieving bottlenecks in memory bandwidth-bound portions of the workload, and enabling improved Tensor Core usage. This yielded greater inference throughput.

Custom cooling designs further boost performance

The extensive optimizations in TensorRT-LLM coupled with upgraded memory of the H200, mean that the Llama 2 70B execution on H200 is compute performance bound rather than limited by memory bandwidth or communication bottlenecks.

As the NVIDIA HGX H200 is drop-in compatible with the NVIDIA HGX H100, it provides system makers with the ability to qualify systems for faster time to market. And, as demonstrated by NVIDIA MLPerf submissions this round, H200 at the same 700 W thermal design power (TDP) as H100 delivers up to 28% better Llama 2 70B inference performance. 

A chart showing Llama 2 70B performance on the H100 GPU.

MLPerf Inference v4.0 data center results retrieved from www.mlperf.org on March 27, 2024 from entries 4.0-0062 and 4.0-0068. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org . For more information.

With NVIDIA MGX system builders can deliver more value to their customers through custom cooling designs that enable higher GPU thermals. In this round, NVIDIA also made submissions using H200 with a custom thermal design, enabling the GPUs to run at a higher 1,000 W TDP. This unlocked 11% and 14% more performance in the server and offline scenarios, respectively, when running the Llama 2 70B benchmark, enabling total speedups of 43% and 45% compared to H100, respectively. 

Setting the bar for Stable Diffusion XL performance

Stable Diffusion XL is a text-to-image generation AI model composed of the following:

  • Two CLIP models for converting prompt texts to embeddings.
  • A UNet model composed of residual blocks (ResBlocks) and transformers that iteratively denoise the image in lower resolution latent space.
  • A variational autoencoder (VAE) that decodes the latent space image to a 1024×1024 resolution RGB image output.

In MLPerf Inference v4.0, Stable Diffusion XL is used for the text-to-image test, generating images based on supplied text prompts. 

The NVIDIA GPUs with TensorRT software delivered the highest performance in the MLPerf Inference v4.0 text-to-image test. An 8-GPU NVIDIA HGX H200 system with GPUs configured to a 700W TDP, achieved performance of 13.8 queries/second and 13.7 samples/second in the server and offline scenarios, respectively. 

L40S is the highest-performance universal NVIDIA GPU, designed for breakthrough multi-workload performance across AI compute, graphics, and media acceleration. The Stable Diffusion XL submission, using a system equipped with eight L40S GPUs also demonstrated performance of 4.9 queries/second and 5 samples/second in the server and offline scenarios, respectively.

A comparison of two images, one generated using the baseline FP16 recipe and one generated using the INT8 quantization solution used in this MLPerf submission.

At the heart of NVIDIA submissions was an innovative recipe that partially quantized the ResBlocks and transformers in the UNet to INT8 precision. In the ResBlocks, convolution layers were quantized to INT8 while in the transformers, query key value blocks, and feedforward network linear layers were quantized to INT8. The INT8 absolute max was collected only from the first eight denoising steps (out of a total of 20).  SmoothQuant  was applied to the activations of the quantized linear layers, overcoming the challenge of quantizing activations to INT8, all while maintaining original accuracy.

Compared to the FP16 baseline–which wasn’t part of NVIDIA MLPerf submissions–this work boosted performance by 20% on H100 GPUs.

Additionally, support for FP8 quantization in TensorRT for diffusion models, which will improve performance and image quality, is coming soon .

Open division innovations

In addition to submitting world-class performance in the closed division of MLPerf Inference, NVIDIA made several submissions in the open division. The open division according to MLCommons is “intended to foster innovation and allows using a different model or retraining.” 

In this round, NVIDIA submitted open division results that made use of various model optimization capabilities in TensorRT, such as sparsification, pruning, and caching. These were used on the Llama 2 70B, GPT-J, and Stable Diffusion XL workloads, showcasing great performance while maintaining high accuracy. The following subsections provide an overview of the innovations powering those submissions.

Llama 2 70B with structured sparsity

The NVIDIA open division submission on H100 GPUs showcased inference on a sparsified Llama 2 70B model using the structured sparsity capability of Hopper Tensor Cores. Structured sparsity to all attention and MLP blocks of the model, and the process was done post-training, without requiring any fine-tuning of the model. 

This sparse model provides two main benefits. First, the model itself is 37% smaller. The reduction in size enables both the model and the KVCache to fit entirely in the GPU memory of H100, negating the need of tensor parallelism. 

Next, using 2:4 sparse GEMM kernels improved compute throughput and made more efficient use of memory bandwidth. Overall throughput in the offline scenario was 33% higher on the same H100 system compared to the NVIDIA closed-division submission. With these speedups, the sparsified model still met the stringent 99.9% accuracy target, set by the MLPerf closed division. The sparsified model generates fewer tokens per sample than the model used in the closed division, leading to shorter responses to queries.

GPT-J with pruning and distillation

In the open division GPT-J submission, a pruned GPT-J model was used. This technique greatly reduced the number of heads and layers in the model, translating into a nearly 40% increase in inference throughput compared to closed division submission when running the model on H100 GPUs. Performance has ‌improved further since NVIDIA results were submitted in this round of MLPerf.

The pruned model was then fine-tuned using knowledge distillation, enabling excellent accuracy of 98.5%.

Stable Diffusion XL with DeepCache

Roughly 90% of the end-to-end processing of the Stable Diffusion XL workload is spent running iterative denoising steps using the UNet. This has a U topology of layers where the latents are first downconverted and then upconverted back to the original resolution.

DeepCache, a technique described in this paper , proposes using two distinct UNet structures. The first is the original UNet–called Deep UNet in our submission implementation. The second is a single-layer UNet, called Shallow UNet or the Shallow UNet, which reuses (or bypasses) intermediate tensors from the most recent Deep UNet, significantly reducing computation. 

On the left is the Deep UNet with the full block count. On the right is the Shallow UNet with just a single block.

Figure 4. An illustration of the DeepCache technique with both the Deep UNet and Shallow UNet

The NVIDIA open division submission implements a variation of DeepCache where we cache both inputs to the last upconversion layer and alternate between the Deep UNet and the Shallow UNet across denoising steps. This halved the compute required for running the UNet portions of the model, accelerating end-to-end performance by 74% on H100. 

Unmatched inference performance

The NVIDIA platform demonstrated exceptional inference performance across the full breadth of MLPerf Inference v4.0 benchmarks, with the Hopper architecture enabling the highest performance per GPU on every workload. 

Using TensorRT-LLM software enabled outstanding performance gains for H100 on the GPT-J workload, nearly tripling performance in just 6 months. And H200, the world’s first HBM3e GPU, with TensorRT-LLM software delivered record-setting inference performance on the Llama 2 70B workload in both ‌offline and server scenarios. And, in the debut Stable Diffusion XL test for text-to-image generative AI, the NVIDIA platform delivered the highest performance.

To reproduce the incredible performance demonstrated in NVIDIA MLPerf Inference v4.0 submissions, see the MLPerf repository .

Related resources

  • GTC session: Speeding up LLM Inference With TensorRT-LLM
  • NGC Containers: NVIDIA MLPerf Inference
  • SDK: FasterTransformer
  • SDK: Torch-TensorRT
  • SDK: TensorRT

About the Authors

Avatar photo

Related posts

thesis language model

Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models 

thesis language model

Setting New Records at Data Center Scale Using NVIDIA H100 GPUs and NVIDIA Quantum-2 InfiniBand

Stylized image of a workflow, with nodes labelled LLM, Optimize, and Deploy.

Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

LLM graphic

Efficiently Scale LLM Training Across a Large GPU Cluster with Alpa and Ray

thesis language model

Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA

thesis language model

Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM

An illustration representing LLM optimization.

NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma 

thesis language model

Generate Code, Answer Queries, and Translate Text with New NVIDIA AI Foundation Models

thesis language model

Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton

thesis language model

Get Started with Generative AI Development for Windows PCs with NVIDIA RTX

IMAGES

  1. How To Write A Thesis Statement (with Useful Steps and Tips) • 7ESL

    thesis language model

  2. GitHub

    thesis language model

  3. Thesis Writing

    thesis language model

  4. How To Write a Thesis Statement: Step-By-Step

    thesis language model

  5. Sample Dissertation Timeline

    thesis language model

  6. 10 Easy Steps: How to Write a Thesis for an Essay in 2024

    thesis language model

VIDEO

  1. Sampada Acharya

  2. Large Language Models: The Use and Misuse

  3. Large Language Models

  4. Arish Alreja

  5. Craft Your Perfect Thesis Statement

  6. What Is a Thesis?

COMMENTS

  1. PDF Structure Modeling for Language Models

    of various structural aspects in language model generations, including section transi-tion structures, coreference structures, and topic correlation structures. Emphasizing the need for a comprehensive understanding of these components, the thesis assesses language model performance at the structural level. Adapting a statistical framework

  2. PDF Understanding the Capabilities, Limitations, and Societal Impact of

    the surprising impact of scale on model capabilities, the di culty in as-sessing whether large language models truly understand language, the im-portance of training models on multiple data modalities, and challenges in aligning model objectives with human values. 2. What are the societal e ects of widespread use of large language models?

  3. [2307.06435] A Comprehensive Overview of Large Language Models

    Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics ...

  4. PEER: Empowering Writing with Large Language Models

    The introduction of transformers-based technologies [] for natural language processing (NLP) has been a breakthrough that pushed the field significantly forward.It enabled the development of pre-trained large language models (LLMs) which can process natural language more effectively and efficiently than previous approaches [1, 10].The most recent models, like ChatGPT [], have been fine-tuned ...

  5. [2305.00948] Large Linguistic Models: Analyzing theoretical linguistic

    The performance of large language models (LLMs) has recently improved to the point where the models can perform well on many language tasks. We show here that for the first time, the models can also generate coherent and valid formal analyses of linguistic data and illustrate the vast potential of large language models for analyses of their metalinguistic abilities. LLMs are primarily trained ...

  6. Language models and linguistic theories beyond words

    The development of large language models is mainly a feat of engineering and so far has been largely disconnected from the field of linguistics. Exploring links between the two directions is ...

  7. PDF Language models and linguistic theories beyond words

    Language models and linguistic theories beyond words. Check for updates. The development of large language models is mainly a feat of engineering and so far has been largely disconnected from the ...

  8. PDF A pipeline for large raw text preprocessing and model training of

    Master Thesis A pipeline for large raw text preprocessing and model training of language models at scale Author: Jordi Armengol Estape Advisor: Marta Ruiz Costa-Juss a Computer Science (CS) Department - UPC Co-Advisor: Maite Melero Nogues Text Mining Unit - Barcelona Supercomputing Center January 2021 FACULTAT D'INFORMATICA DE BARCELONA (FIB)

  9. Structure Modeling for Language Models

    The thesis is organized into two primary sections: structure analysis for language models and structure modeling techniques. The first section investigates the modeling of various structural aspects in language model generations, including section transition structures, coreference structures, and topic correlation structures.

  10. [2307.11760] Large Language Models Understand and Can be Enhanced by

    Emotional intelligence significantly impacts our daily behaviors and interactions. Although Large Language Models (LLMs) are increasingly viewed as a stride toward artificial general intelligence, exhibiting impressive performance in numerous tasks, it is still uncertain if LLMs can genuinely grasp psychological emotional stimuli. Understanding and responding to emotional cues gives humans a ...

  11. PDF Assessing the Effects and Risks of Large Language Models in Ai-mediated

    In this thesis, 4. we have chosen the latter approach. We hypothesize how large language models might be used in future applications and create speculative prototypes of what ... opinionated language model affects written opinions and reported attitudes in a subsequentsurvey.

  12. PDF Truthfulness in Large Language Models

    1.1 Large Language Models Fundamentally, a language model encodes a probability distribution over strings of text. Thisisdonebydividingastringintotokens

  13. Using large language models in psychology

    Large language models (LLMs), such as OpenAI's GPT-4, Google's Bard or Meta's LLaMa, have created unprecedented opportunities for analysing and generating language data on a massive scale.

  14. (PDF) Talking About Large Language Models

    1 Introduction. The advent of large language models (LLMs) such as B ert ( Devlin et al., 2018) and GPT-. 2 (Radford et al., 2019) was a game-c hanger. for artificial intelligence. Based on ...

  15. PDF Improving Text-to-image Diffusion Generation Via Large Language Model

    VIA LARGE LANGUAGE MODEL Yifei Li A THESIS in Data Science Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Master of Science in Engineering 2023 Co-Supervisor of Thesis Chris Callison-Burch, Associate Professor of Computer and Information Science

  16. Better language models and their implications

    We've trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training. Our model, called GPT ...

  17. Large language models use a surprisingly simple mechanism to retrieve

    Researchers find large language models use a simple mechanism to retrieve stored knowledge when they respond to a user prompt. These mechanisms can be leveraged to see what the model knows about different subjects and possibly to correct false information it has stored.

  18. How Large Language Models (LLMs) like ChatGPT Work

    One process, used by OpenAI to transform GPT 3 into the more usable 3.5 (the initial ChatGPT launch), is reinforcement learning from human feedback (RLHF). W. Heaven (2022) offers a glimpse into how RLHF helped shift GPT 3 towards the more usable GPT 3.5 model, which was the foundation for the original ChatGPT:

  19. Language Model Evaluation in Open-ended Text Generation

    Language Model Evaluation in Open-ended Text Generation by An Nguyen Student ID: 1098402 A thesis submitted in total ful llment for the degree of Master of Science (Computer Science) in the School of Computing and Information Systems Melbourne School of Engineering THE UNIVERSITY OF MELBOURNE

  20. A study of generative large language model for medical research and

    Language model is a simple statistical distribution used in natural language processing (NLP) to formulate the probability of a sequence of words or the next word in a sequence.

  21. Large Language Models in Molecular Biology

    A Large Language Model (LLM) is a type of neural network that acquires the ability to generate text mirroring human language by scrutinizing vast amounts of textual data. ... GPT is a generative model that is particularly strong in tasks involving text generation, such as writing essays, generating poetry, or completing sentences. The latest ...

  22. PDF Towards Automated Healthcare: Deep Vision and Large Language Models for

    The scale-up of this training procedure also comes with using larger architectures, in what we call large language models (LLMs). 1.2Computer Vision. Similarly, in computer vision, researchers have designed deep learning architectures and training methods suitable for understanding images.

  23. Large Language Models' Emergent Abilities Are a Mirage

    Two years ago, in a project called the Beyond the Imitation Game benchmark, or BIG-bench, 450 researchers compiled a list of 204 tasks designed to test the capabilities of large language models ...

  24. Announcing DBRX: A new standard for efficient open source LLMs

    Figure 2: DBRX outperforms GPT3.5 across language understanding (MMLU), Programming (HumanEval), and Math (GSM8K) Third, DBRX is a Mixture-of-Experts (MoE) model built on the MegaBlocks research and open source project, making the model extremely fast in terms of tokens/second. We believe that this will pave the path for state-of-the-art open ...

  25. Vision-language Model for Robot Grasping

    The language model is used to program the robot to make it possible for them to understand and execute the required sequence of tasks. Using the process of object detection, we build a set of object queries from the sensor image and allow the user to provide an input query for a task to be performed. We then perform a similarity score among ...

  26. Here's Proof You Can Train an AI Model Without Slurping Copyrighted

    OpenAI claimed it's "impossible" to build good AI models without using copyrighted data. An "ethically created" large language model and a giant AI dataset of public domain text suggest ...

  27. [2403.18802] Long-form factuality in large language models

    Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form ...

  28. Inside the Creation of the World's Most Powerful Open Source AI Model

    Startup Databricks just released DBRX, the most powerful open source large language model yet—eclipsing Meta's Llama 2. Databricks staff in the company's San Francisco office. Photograph ...

  29. NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM

    Generative AI is unlocking new computing applications that greatly augment human capability, enabled by continued model innovation. Generative AI models—including large language models (LLMs)—are used for crafting marketing copy, writing computer code, rendering detailed images, composing music, generating videos, and more. The amount of compute required by the latest models is immense and ...