• Architecture and Design
  • Asian and Pacific Studies
  • Business and Economics
  • Classical and Ancient Near Eastern Studies
  • Computer Sciences
  • Cultural Studies
  • Engineering
  • General Interest
  • Geosciences
  • Industrial Chemistry
  • Islamic and Middle Eastern Studies
  • Jewish Studies
  • Library and Information Science, Book Studies
  • Life Sciences
  • Linguistics and Semiotics
  • Literary Studies
  • Materials Sciences
  • Mathematics
  • Social Sciences
  • Sports and Recreation
  • Theology and Religion
  • Publish your article
  • The role of authors
  • Promoting your article
  • Abstracting & indexing
  • Publishing Ethics
  • Why publish with De Gruyter
  • How to publish with De Gruyter
  • Our book series
  • Our subject areas
  • Your digital product at De Gruyter
  • Contribute to our reference works
  • Product information
  • Tools & resources
  • Product Information
  • Promotional Materials
  • Orders and Inquiries
  • FAQ for Library Suppliers and Book Sellers
  • Repository Policy
  • Free access policy
  • Open Access agreements
  • Database portals
  • For Authors
  • Customer service
  • People + Culture
  • Journal Management
  • How to join us
  • Working at De Gruyter
  • Mission & Vision
  • De Gruyter Foundation
  • De Gruyter Ebound
  • Our Responsibility
  • Partner publishers

research articles on corpus linguistics

Your purchase has been completed. Your documents are now available to view.

journal: Corpus Linguistics and Linguistic Theory

Corpus Linguistics and Linguistic Theory

  • Online ISSN: 1613-7035
  • Print ISSN: 1613-7027
  • Type: Journal
  • Language: English
  • Publisher: De Gruyter Mouton
  • First published: May 20, 2005
  • Publication Frequency: 3 Issues per Year
  • Audience: Researchers from different theoretical backgrounds and with different areas of interest that share a commitment to the systematic and exhaustive analysis of naturally occurring language

research articles on corpus linguistics


Open call for papers.

Articles falling within one of the four categories published in RiCL are welcome through the whole year and will be evaluated according to the journal's editorial policies

Current Issue

Issue editor: Sara Laviosa

Book Reviews

Issn: 2243-4712, abstracting & indexing.

Google Scholar

Index Copernicus International

Internet Archive Scholar

Linguistic Bibliography Online

MLA International Bibliography

Norwegian List


Publication Forum


Scimago Journal Rank

Ulrich's Periodicals Directory

  • For Readers
  • For Authors
  • For Librarians

Asociación Española de Lingüística de Corpus /  Spanish Association for Corpus Linguistics Departamento de Filología Inglesa Facultad de Letras | Campus de La Merced Universidad de Murcia, 30003 Murcia, Spain

About this Publishing System

Review of Corpora in Applied Linguistics (2nd Edition), Susan Hunston, Cambridge University Press, Cambridge 2022, ISBN: 9781108616218

  • Book Review
  • Published: 13 November 2023
  • Mehrdad Vasheghani Farahani 1 &
  • Mahsa Pahlevanzade Fini 2  

29 Accesses

Explore all metrics

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Data Availability

Bennett, W. A. (2014). Applied linguistics and language learning . Routledge.

Lavid-López, J., Maíz-Arévalo, C., & Zamorano-Mansilla (2021). J R Corpora in translation and contrastive research in the digital age: Recent advances and explorations. Amsterdam: John Benjamins Publishing Company.

McDonough, S. (2017). Applied linguistics in language education . Routledge.

Nicaise, E. (2021). Native and non-native teacher talk in the EFL classroom: A corpus-informed study . London : Routledge.

Schmitt, N., & Rodgers, M. (2020). An introduction to applied linguistics (3rd ed.). Routledge.

Vasheghani Farahani, M., & Pahlevansadegh, V. (2018). Teaching metadiscourse features and IELTS writing performance of Iranian EFL learners. Journal of Applied Research in Higher Education , 11(3), 538–558. https://doi.org/10.1108/jarhe-10-2018-0222 .

Wei, L. (2011). The Routledge applied linguistics reader (1st ed.). Routledge.

Widdowson, H. J. (2007). Discourse analysis . Oxford University Press.

Zufferey, S. (2020). Introduction to corpus linguistics . John Wiley & Sons.

Download references


Author information, authors and affiliations.

Applied Linguistics & Translation Studies, Leipzig University, Leipzig, Germany

Mehrdad Vasheghani Farahani

Allame Tabataba’i University, Tehran, Iran

Mahsa Pahlevanzade Fini

You can also search for this author in PubMed   Google Scholar


Corresponding author.

Correspondence to Mehrdad Vasheghani Farahani .

Ethics declarations

Ethical approval, consent for publication, conflict of interest, additional information, publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Vasheghani Farahani, M., Fini, M. Review of Corpora in Applied Linguistics (2nd Edition), Susan Hunston, Cambridge University Press, Cambridge 2022, ISBN: 9781108616218. Corpus Pragmatics (2023). https://doi.org/10.1007/s41701-023-00160-z

Download citation

Received : 02 June 2023

Accepted : 31 October 2023

Published : 13 November 2023

DOI : https://doi.org/10.1007/s41701-023-00160-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research
  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Papyrology
  • Greek and Roman Archaeology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Acquisition
  • Language Evolution
  • Language Reference
  • Language Variation
  • Language Families
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Modernism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Religion
  • Music and Media
  • Music and Culture
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Science
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Ethics
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Strategy
  • Business Ethics
  • Business History
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Systems
  • Economic History
  • Economic Methodology
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Theory
  • Politics and Law
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Applied Linguistics (2nd edn)

  • < Previous chapter
  • Next chapter >

38 Research in Corpus Linguistics

Douglas Biber is Regents' Professor of English (Applied Linguistics) at Northern Arizona University. His research efforts have focused on corpus linguistics, English grammar, and register variation (in English and cross-linguistic; synchronic and diachronic). His publications include books on register variation and corpus linguistics published by Cambridge University Press (1988, 1995, 1998, to appear), the co-authored Longman Grammar of Spoken and Written English (1999), and more recent studies of language use in university settings and discourse structure investigated from a corpus perspective (both published by Benjamins: 2006 and 2007).

Randi Reppen is professor of applied linguistics in the Department of English at Northern Arizona University. Her research interests include exploring how corpus linguistics can inform language teaching and materials development. She can be reached at http://[email protected].

Eric Friginal is assistant professor in the Department of Applied Linguistics and English as a Second Language at Georgia State University. His main research interest lies in using corpus linguistics to explore linguistic variation in professional, cross-cultural discourse in the context of outsourced call centers in the Philippines serving American customers. He is the author of The Language of Outsourced Call Centers: A Corpus-Based Study of Cross-Cultural Interaction. He can be reached at http://[email protected].

  • Published: 18 September 2012
  • Cite Icon Cite
  • Permissions Icon Permissions

Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use, resulting in research findings that are have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language. Rather, it can be regarded as primarily a methodological approach; it is empirical, analyzing the actual patterns of use in natural texts. It utilizes a large and principled collection of natural texts, known as a corpus , as the basis for analysis. At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic research offers strong support for the view that language variation is systematic and can be described using empirical, quantitative methods.

1. Introduction

Corpus linguistics is a research approach that has developed over the past several decades to support empirical investigations of language variation and use, resulting in research findings that are have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language. Rather, it can be regarded as primarily a methodological approach:

It is empirical, analyzing the actual patterns of use in natural texts.

It utilizes a large and principled collection of natural texts, known as a corpus , as the basis for analysis.

It makes extensive use of computers for analysis, employing both automatic and interactive techniques.

It depends on both quantitative and qualitative analytical techniques. (Biber, Conrad & Reppen, 1998 : 4)

At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic research offers strong support for the view that language variation is systematic and can be described using empirical, quantitative methods. Variation often involves complex patterns consisting of the interaction among several different linguistic parameters, but, in the end, it is systematic. Beyond this, the major contribution of corpus linguistics is to document the existence of linguistic constructs that are not recognized by current linguistic theories. Research of this type—referred to as a corpus-driven approach—identifies strong tendencies for words and grammatical constructions to pattern together in particular ways, whereas other theoretically possible combinations rarely occur.

A novice student of linguistics could be excused for believing that corpus linguistics evolved only recently, as a reaction against the standard practice of intuition-based linguistics. Introductory linguistics textbooks tend to present linguistic analysis (especially syntactic analysis) as it has been practiced over the past 50 years, employing the analyst's intuitions rather than being based on empirical analysis of natural texts. Against that background, it would be easy for a student to imagine that corpus linguistics developed only in the 1980s and 1990s, responding to the need to base linguistic descriptions on actual language use.

This view is far from accurate. In fact, intuition-based linguistics developed as a reaction to corpus-based linguistics. That is, the standard practice in linguistics up until the 1950s was to base language descriptions on analyses of collections of natural texts: precomputer corpora. Dictionaries have long been based on empirical analysis of word use in natural sentences. For example, Samuel Johnson's Dictionary of the English Language , published in 1755, was based on approximately 150,000 natural sentences recorded on slips of paper, to illustrate the natural usage of words. The Oxford English Dictionary , published in 1928, was based on approximately 5,000,000 citations from natural texts (totaling around 50 million words), compiled by over 2,000 volunteers over a 70-year period. (See the discussion in G. D. Kennedy, 1998 : 14–15.) West's ( 1953 ) creation of the General Service List from a preelectronic corpus of newspapers was one of the first empirical vocabulary studies not motivated by the goal of creating a dictionary.

Grammars were also sometimes based on empirical analyses of natural text corpora before 1960. For example, Jespersen's grammars of English (1909–1949) used natural sentences from newspapers and novels to illustrate the various structures. An even more noteworthy example of this type is the work of C. C. Fries, who wrote two corpus-based grammars of American English. The first, published in 1940, had a focus on usage and social variation, based on a corpus of letters written to the government. The second is essentially a grammar of conversation: It was published in 1952, based on a 250,000-word corpus of telephone conversations. It includes authentic examples taken from the corpus and discussion of grammatical features that are especially characteristic of conversation (e.g., the words well, oh, now , and why when they initiate a “response utterance unit”; Fries, 1952 : 101–102).

In the 1960s and 1970s, most research in linguistics shifted to intuition-based methods, arguing that language was a mental construct and that empirical analyses of corpora were not relevant for describing language competence. However, even during this period, some linguists continued the tradition of empirical linguistic analysis. For example, in the early 1960s, Randolph Quirk began the Survey of English Usage, a precomputer collection of 200 spoken and written texts (each around 5,000 words) that was subsequently used for descriptive grammars of English (e.g., Quirk et al., 1972 ). Functional linguists like Prince and Thompson also continued this descriptive tradition, arguing that (noncomputerized) collections of natural texts could be studied to identify systematic differences in the functional use of linguistic variants. For example, Prince 1978 compares the discourse functions of WH -clefts and IT -clefts in spoken and written texts. Thompson has been especially interested in the study of grammatical variation in conversation; for example, Thompson and Mulac 1991 analyzed factors influencing the retention versus omission of the complementizer that occur in conversation, whereas Fox and Thompson 1990 studied variation in the realization of relative clauses in conversation.

What changed in the 1980s were the widespread availability of large electronic corpora, and the increasing availability of computational tools that facilitated the linguistic analysis of those corpora. Work on large electronic corpora began in the 1960s, when Kucera and Francis 1967 compiled the Brown Corpus (a one-million word corpus of published AmE written texts). This was followed by a parallel corpus of BrE written texts: the LOB Corpus, published in the 1970s.

It was not until the 1980s, though, that major studies of language use based on large electronic corpora began to appear. Thus, in 1982, Francis and Kucera provide a frequency analysis of the words and grammatical part-of-speech categories found in the Brown Corpus, followed in 1989 by a similar analysis of the LOB Corpus (Johansson and Hofland, 1989 ). Book-length descriptive studies of linguistic features began to appear in this period (e.g., Granger, 1983 , on passives; de Haan, 1989 , on nominal postmodifiers) as did the first multidimensional studies of register variation (e.g., Biber, 1988 ). During this same period, English language learner dictionaries based on the analysis of large electronic corpora began to appear, such as the Collins CoBuild English Language Dictionary (1987) and the Longman Dictionary of Contemporary English (1987). Since that time, most descriptive studies of linguistic variation and use in English have been based on analysis of an electronic corpus, either a large standard corpus (such as the British National Corpus) or a small corpus designed for a specific study (e.g., a corpus of 20 biology research articles constructed for a genre analysis). Within applied linguistics, the subfields of English for specific purposes and English for academic purposes have been especially influenced by corpus research, so that nearly all articles published in these areas employ some kind of corpus analysis.

Studies in this research tradition have adopted the tools and techniques available from computer-based corpus linguistics, with its emphasis on the representativeness of the text collection, and its computational tools for investigating distributional patterns across registers and across discourse contexts in large text collections. The textbook treatments by Kennedy 1998 , Biber, Conrad, and Reppen ( 1998 ), and McEnery, Xiao, and Tono ( 2006 ) provide good introductions to the methods used for these studies as well as surveys of previous research.

In the ensuing sections, we survey many of the most important linguistic studies over the past 25 years that have employed corpus analysis. These studies have been motivated by two major research goals (see Biber, Conrad, and Reppen, 1998 : 5–8):

To describe linguistic features, such as vocabulary, lexical combinations, or grammatical features. These studies focus on variation in the choice among related linguistic features (e.g., the simple past tense versus present perfect aspect) or on the discourse functions of a single linguistic feature.

To describe the overall characteristics of a variety: a register or dialect. These studies provide relatively comprehensive linguistic descriptions of a single variety or of a set of related varieties.

Section 2, which follows, introduces studies of the first type, whereas section 3 surveys studies of the second type. Studies of both types have been undertaken for many of the world's languages. However, to limit the scope of the chapter, we survey only studies of English. Then, in section 4, we survey pedagogical applications of these descriptive corpus-based studies, discussing how classroom teaching and materials development have been influenced by the corpus revolution.

2. Descriptive Linguistic Studies

2.1. corpus studies with a lexical focus.

Many of the earliest uses of corpora were designed to provide word lists ranked by frequency, comparing the most frequent words in different varieties. For example, Francis and Kucera 1982 and Johansson and Hofland 1989 catalog the most frequent words in the Brown and LOB Corpora, comparing word frequencies in the fiction versus nonfiction components of the corpora.

One of the major contributions of corpus-based lexical studies has been the insight that collocational associations are a central consideration for describing the meaning of a word. For example, the copular verbs turn, ome , and go all have the same dictionary meaning: “to become, or to change to another state.” However, corpus research (Biber et al., 1999 : 444–445) shows that these three verbs have very different collocational associations: The most common adjectives following turn are color terms, like black, brown, red , and white . The most common adjectives following come describe processes representing a change to a more dynamic condition, such as alive, awake, clean, loose , and unstuck . And in contrast to both other verbs, the most common adjectives following go are all negative: crazy, mad , and wrong . It is not clear whether differences like these should be regarded as part of the core connotational meaning of a word, but it seems uncontroversial that this kind of information is crucially important for language learners.

There have been numerous corpus-based studies of collocation. Probably the best known is Sinclair 1991 , who provides detailed descriptions on the collocations of decline, yield , and set in . Another excellent book-length introduction to the corpus-based study of collocation is by Partington 1998 . For example, in chapter 2 of his book, Partington discusses the word sheer and its supposed synonyms pure, complete , and absolute , showing how these words are not at all interchangeable when considered from the perspective of their frequent collocates. Mahlberg ( 2005 ) provides a book-length treatment of general nouns in English (e.g., time, day, man, woman, people, thing, way ), describing their meanings and use with respect to their collocational associations.

Most studies of collocation have disregarded register differences. One exception to this practice appears in a work by Biber, Conrad, and Reppen 1998 : 43–53), which shows how the near-synonyms big, large , and great co-occur with very different sets of collocates (e.g., big enough versus large number versus great deal ), and further shows how the collocational associations are very different in fiction versus academic writing. Other collocational studies taking a register perspective include those by Gledhill 2000 and Marco 2000 , which both describe the functions of collocations in academic research writing.

Studies of collocation have in turn led to development of the notion of semantic prosody (Louw, 1993 ; Partington, 1998 ): the positive or negative connotations shared by the set of collocates that co-occur with a word. For example, the copular verb go (previously discussed) has a strong negative semantic prosody, whereas the copular verb come has a positive semantic prosody. Partington 1998 : 66–67) discusses another example of this type: the verb commit , which has a strong negative semantic prosody, co-occurring with nouns like crime, suicide , and offenses . Similarly, Sinclair 1991 : 74–75) notes that the nouns that co-occur as the subject of set in are mostly unpleasant states of affairs, such as rot, decay, malaise, despair, infection, disillusion , and so on. Studies have tended to focus on words with negative prosodies rather than positive prosodies. Other examples include cause (Stubbs, 1995 ), signs of (Stubbs, 2001 : 458), and sit through (Hunston, 2002b : 60–62).

A related productive area of research has been the corpus-based (and corpus-driven) investigation of formulaic language in spoken and written registers. The methods and research goals of this line of research are quite different from the typical study of collocation. That is, studies of collocation have typically been case studies focused on a few particular words. These studies have typically disregarded register differences, and they have not attempted to generalize to the textual use of collocational combinations generally. In contrast, corpus studies of longer formulaic expressions are normally carried out in the context of a particular register or for the purposes of describing patterns of variation among multiple registers; in addition, the goals of these studies are to generalize about the use of formulaic language in the target registers rather than case studies restricted to one or two particular formulaic sequences. For example, Simpson ( 2004 ) and Simpson and Mendis 2003 describe the functions of idioms in academic spoken registers.

Many other studies have taken a corpus-driven approach to this research domain, identifying the sequences of words that are most common in different spoken and written registers (rather than starting with a set of formulaic expressions identified a priori based on their perceptual salience). These common word sequences, often referred to as lexical bundles , are usually not idiomatic and are not complete structures, but they are important building blocks of discourse. Thus, for example, Altenberg ( 1998 ) focuses on the recurrent word sequences in spoken English, whereas Biber et al. (1999 , chapter 13 ) compare the lexical bundles in conversation and academic writing. Applying that framework, several studies have considered the types and functions of lexical bundles in additional registers: university classroom teaching and textbooks (Biber, Conrad, and Cortes, 2004 ; Nesi and Basturkmen, 2006 ), university student writing (Cortes, 2004 ), university institutional and advising registers (Biber and Barbieri, 2007 ), and political debate (Partington and Morley, 2004). N. Ellis et al. ( 2008 ) begin with a corpus analysis to identify a set of word sequences that are either frequent or that have strong collocational associations; they then test the psycholinguistic status of those sequences with respect to their perceptual salience and for their role in language production and comprehension (cf. Schmitt, Grandage, and Adolphs, 2004 ).

Corpus studies have shown that the types and functions of lexical bundles are very different among spoken and written registers (see, e.g., Biber, Conrad, and Cortes, 2004 ). First of all, there are generally more lexical bundles used in spoken registers than written registers. In terms of their structural characteristics, the bundles in speech tend to be composed of verb phrase and clause fragments, whereas the bundles in writing tend to be composed of noun phrase and prepositional phrase fragments. Those differences correspond to different discourse functions: The bundles in speech tend to be used for stance and discourse organizing functions, whereas the bundles in writing tend to have referential functions.

Of all subareas of applied linguistics, corpus research has probably had the greatest impact on lexical research and vocabulary studies. As previously noted, West 1953 created the General Service List of important vocabulary items based on analysis of a preelectronic corpus, and that list has been used in countless studies of vocabulary acquisition. One of the central concerns has been efforts to estimate the number of different words that a learner needs to know for different communicative purposes. Waring and Nation ( 1997 ) use corpus analysis to estimate the number of words needed to comprehend general written texts, whereas Coxhead 2000 analyzed a corpus of academic texts from several disciplines to develop a word list specifically for written academic language. Adolphs and Schmitt 2003 utilize analyses of spoken corpora to estimate the number of words required to understand conversational interactions.

Corpus research is similarly accepted as the standard practice in lexicography, so that all major ELT dictionaries are currently based on analysis of actual word use in large corpora (e.g., the Collins CoBuild English Language Dictionary [1987], the Longman Dictionary of Contemporary English [1987], and the Cambridge Advanced Learner's Dictionary [2005]). In sum, it would not be an overstatement to say that corpus research has revolutionized the way that lexicography, vocabulary acquisition, and word use in general are approached in linguistics.

2.2. Corpus Studies with a Grammatical Focus

Within descriptive linguistics, there have been numerous book-length studies over the past 20 years reporting corpus-based investigations of grammar and discourse: for example, Tottie 1991 on negation, Collins 1991 on clefts, Mair 1990 on infinitival complement clauses, Meyer 1992 on apposition, several books on nominal structures (e.g., de Haan, 1989 ; Geisler, 1995 ; Johansson, 1995), Mindt (1995) on modal verbs, Hunston and Francis 2000 on pattern grammar, Lindquist and Mair 2004 on grammaticalization, and Mair 2006 on recent grammatical change within American English and British English—in other words, during the twentieth century).

Most corpus-based grammatical studies take a register perspective. Many of these focus on the linguistic variants associated with a feature, using register differences as one factor to account for the patterns of linguistic variation. However, there are an even larger number of studies that have focused on the use of a particular linguistic feature in a single register; in this case, the goals of the study are to describe both the discourse functions of the linguistic feature as well as the target register itself. Studies of both types can be further subdivided according to the linguistic level of the target feature (e.g., grammatical class, dependent clause type). In addition, both types of studies include descriptions of synchronic patterns of use as well as descriptions of historical patterns of variation.

Corpus-based studies of linguistic features using register as a predictor have investigated linguistic variation from all grammatical levels, from simple part of speech categories to variation in the realization of syntactic phrase and clause types. These studies have shown that descriptions of grammatical variation and use are not valid for the language as a whole. Rather, characteristics of the textual environment interact with register differences so that strong patterns of use in one register often represent only weak patterns in other registers. The Longman Grammar of Spoken and Written English (Biber et al., 1999 ) and Cambridge Grammar of English (Carter and McCarthy, 2006 ) are comprehensive reference works with this goal, applying corpus-based analyses to show how any grammatical feature can be described for structural characteristics as well as patterns of use across spoken and written registers.

As previously noted, many corpus-based studies use register differences as a predictor of linguistic variation, whereas others study linguistic features in the context of a single register. Thus, for example, Tottie 1991 contrasts the choices between synthetic and analytic negation, as in

He could find no words to express his pain. versus He could n't find any words to express his pain.

Among other factors, Tottie shows that synthetic negation is strongly preferred in written rather than in spoken registers, whereas analytic negation is more commonly used in spoken registers. In contrast, Hyland (1998a) focuses on the single register of scientific research articles, describing variation in the use of hedges within that register.

As noted earlier, these studies have documented the use of lexico-grammatical features at all linguistic levels. Several studies analyze a single part-of-speech category, documenting the patterns of variation and use in particular registers. Studies taking the perspective of register variation include Barbieri 2005 on quotative verbs and Römer (2005a) on progressive verbs.

Several other studies describe linguistic variation within the context of a single spoken register, such as conversation. Quaglio and Biber 2006 survey the distinctive grammatical characteristics of conversation identified through corpus research, whereas other studies provide detailed descriptions of a particular feature in conversation. For example, McCarthy ( 2002 ) describes nonminimal response tokens; Aijmer 2002 provides a book-length description of discourse particles; Carter and McCarthy 2006 describe the discourse functions of the get passive; Tao and McCarthy 2001 focus on nonrestrictive which clauses; and Norrick 2008 describes the discourse functions of interjections in conversational narratives. Other studies of a single spoken register have focused on academic speech in university settings, based on analysis of the Michigan Corpus of Academic Spoken English (MICASE). For example, Fortanet 2004 focuses on the pronoun we in university lectures; Lindemann and Mauranen 2001 describe the use of just in academic speech; and Swales 2001 provides a detailed description of the discourse functions served by point and thing in university academic speech.

A much larger number of studies have described linguistic variation within the context of a particular written register, most often a type of academic writing. Many of these have focused on the kinds of verbs used in research writing (e.g., Thomas and Hawes, 1994 ), or the referring expressions in research articles (e.g., Hyland, 2001 , on the use of self-mentions and Kuo, 1999 , on the role relationships expressed by personal pronouns). Other studies deal with simple grammatical structures, but again most often within the context of academic writing. For example, Hyland (2002a) and Swales et al. ( 1998 ) describe variation in the use of imperatives and the expression of directives, whereas Hyland (2002b) and Marley 2002 focus on the use of questions in written registers.

The study of linguistic variation related to the expression of stance and modality has been especially popular in corpus-based research. Several of these studies compare the ways in which stance is expressed in spoken versus written registers. Biber and Finegan 1988 and Conrad and Biber 2001 focus on adverbial markers of stance in speech and writing, whereas Biber and Finegan (1989a , 1989b) and Biber et al. (1999 , chapter 12 ) survey variation in the use of numerous grammatical stance devices (including modal verbs, stance adverbials, and stance complement clause constructions), again contrasting the patterns of use in spoken versus written registers. Biber (2006a , 2006b) and Keck and Biber ( 2004 ) take a similar approach but applied to university spoken and written registers.

Many other studies focus exclusively on the expression of stance and modality in written registers (usually academic writing). These include Vohla's (1999) study of modality in medical research writing, the studies of stance by Charles (2003 , 2006 , 2007) on academic writing from different disciplines, and several studies that focus on hedging in academic writing (e.g., Grabe and Kaplan, 1997 ; Hyland, 1996 , 1998a ; Salager, 1994 ). Related studies have been carried out under the rubric of evaluation , again usually focusing on academic writing (e.g., Hunston and Thompson, 2000 ; Hyland and Tse, 2005 ; Römer, 2005b ; Stotesbury, 2003 ; Tucker, 2003 ; cf. Bednarek's 2006 study of evaluation in newspaper language). Fewer studies have described the linguistic devices used to express stance and evaluation in spoken registers; some of these have focused on conversation (e.g., McCarthy and Carter, 1997 , 2004 ; Tao, 2007 ), whereas others have focused on academic spoken registers (e.g., Mauranen, 2003 , 2004 ; Mauranen and Bondi, 2003 ; Swales and Burke, 2003 ).

Dependent clauses and more complex syntactic structures have also been the focus of numerous corpus-based studies that consider register differences. Several studies contrast the patterns of use in spoken and written registers: Collins 1991 on cleft constructions, de Haan 1989 on nominal postmodifiers, Geisler 1995 on relative infinitives, Johansson ( 1995 ) on relative pronoun choice, and Biber et al. ( 1999 ) on complement clause constructions. Other studies have focused on the use of a syntactic construction in a particular register, like the study of conditionals in medical discourse (G. Ferguson, 2001 ) or the study of extraposed constructions in university student writing (Hewings and Hewings, 2002 ).

All of the kinds of studies surveyed in the preceding paragraphs can be approached from a historical (or diachronic) perspective rather than a synchronic perspective, and numerous studies have taken that approach. For example, many of the papers in the edited volumes by Nevalainen and Kahlas-Tarkka 1997 and Kytö, Rydén, and Smitterberg ( 2006 ) incorporate register comparisons to describe historical change for linguistic features like existential clauses, adverbial clauses, and relative clauses. Biber and Clark ( 2002 ) contrast the kinds of noun modifiers common in academic versus popular written registers. Several historical studies of stance and modality have included analysis of register differences, such as Kytö ( 1991 ) on modal verbs in written and speech-based registers, Culpeper and Kytö ( 1999 ) on hedges in Early Modern English dialogues, Salager-Meyer and Defives 1998 on hedges in academic writing over the last two centuries, Fitzmaurice (2002b , 2003) on stance and politeness in early eighteenth-century letters, and Biber 2004 on historical change in the use of stance and modal features across a range of speech-based and written registers. A few studies have focused on recent (i.e., twentieth-century) historical change; for example, Hundt and Mair 1999 contrast the rapid grammatical change observed in “agile” registers (like newspaper writing) with the much slower pace of change observed in “uptight” registers like academic prose. Leech, Hundt, Mair, and Smith (in press) track historical change in the twentieth century using the register categories distinguished in the Brown/LOB family of corpora.

3. Descriptions of Varieties

3.1. register descriptions.

The studies surveyed in the preceding section focus on a particular linguistic feature, using register to describe the use of that feature. In the present section, the analytical perspective is reversed: These studies focus on the overall description of a register, considering a suite of linguistic features that are characteristic of the register.

Many studies of this type describe spoken registers, including conversation (e.g., Biber, 2008 ; Carter and McCarthy, 1997 , 2004 ; Quaglio and Biber, 2006 ; Biber and Conrad, in press: chapter 4 ), service encounters (e.g., McCarthy, 2000 ), call center interactions (Friginal, 2009a , 2009b ), spoken business English (McCarthy and Handford, 2004 ), television dialogue (Quaglio, 2009 ; Rey, 2001 ), spoken media discourse (O'Keeffe, 2006 ), and spoken university registers like classroom teaching, office hours, and teacher-mentoring sessions (e.g., Biber, 2006a ; Biber, Conrad, and Leech, 2002 ; Csomay, 2005 ; Reppen and Vásquez, 2007 ). Ädel and Reppen 2008 include several papers that use corpus analysis to describe different registers from academic, workplace, and television settings.

However, written registers have received considerably more attention than spoken registers. Academic prose has been the best described written register (see, e.g., Biber, 2006a ; Biber, Connor, and Upton, 2007 ; Connor and Mauranen, 1999 ; Connor and Upton, 2004b ; Conrad, 1996 , 2001 ; Freddi, 2005 ; McKenna, 1997; Tognini-Bonelli and Del Lungo Camiciotti, 2005 ). But many other written registers have also been described using corpus-based analysis, including personal letters (e.g., Connor and Upton, 2003 ; Fitzmaurice, 2002a ; Precht, 1998 ), written advertisements (e.g., Bruthiaux, 1994 , 1996 , 2005 ), newspaper discourse (e.g., Bednarek, 2006 ; Herring, 2003 ; Jucker, 1992 ), and fiction (e.g., Thompson and Sealey, 2007 ; Mahlberg, in press; Semino and Short, 2004 ). Electronic registers that have emerged over the past few decades, from e-mail communication to weblogs and texting, have been an especially interesting and productive area of research (see, e.g., Biber and Conrad, in press: chapter 7 ; Danet and Herring; 2003 , Gains, 1999 ; Herring and Paolillo, 2006 ; Hundt, Nesselhauf, and Biewer, 2007 ; Morrow, 2006 ).

3.2. Multidimensional Analyses of Register Variation

Most of the studies previously listed have the primary goal of describing a single register. However, corpus analysis can also be used to describe the overall patterns of variation among a set of spoken and/or written registers. Perhaps the best known approach used for descriptions of this type is multidimensional (MD) analysis: a corpus-driven methodological approach that identifies the frequent linguistic co-occurrence patterns in a language, relying on inductive empirical/quantitative analysis (see, e.g., Biber, 1988 , 1995 ; Biber and Conrad, in press: chapter 8 ). Frequency plays a central role in the analysis, because each dimension represents a constellation of linguistic features that frequently co-occur in texts. These dimensions of variation can be regarded as linguistic constructs not previously recognized by linguistic theory. Thus, MD analysis is a corpus-driven (as opposed to corpus-based) methodology, in that the linguistic constructs—the dimensions—emerge from analysis of linguistic co-occurrence patterns in the corpus. The set of co-occurring linguistic features that comprise each dimension is identified quantitatively. That is, based on the actual distributions of linguistic features in a large corpus of texts, statistical techniques (specifically, factor analysis) are used to identify the sets of linguistic features that frequently co-occur in texts.

The original MD analyses (Biber, 1986 , 1988 ) investigated the relations among general spoken and written registers in English, based on analysis of the Lancaster-Oslo/Bergen (LOB) Corpus (15 written registers) and the London-Lund Corpus (6 spoken registers). Sixty-seven different linguistic features were analyzed computationally in each text of the corpus. Then, the co-occurrence patterns among those linguistic features were analyzed using factor analysis, identifying the underlying parameters of variation—in other words, the factors or dimensions.

In the 1988 MD analysis, the 67 linguistic features were reduced to 7 underlying dimensions. (The technical details of the factor analysis are given in Biber, 1988 : chapters 4 – 5 ; see also Biber, 1995 : chapter 5 ). The dimensions are interpreted functionally, based on the assumption that linguistic co-occurrence reflects underlying communicative functions; that is, linguistic features occur together in texts because they serve related communicative functions. For example, table 38.1 lists the important co-occurring features for dimensions 1 and 2 from the 1988 MD analysis, together with the labels reflecting the functional interpretation.

Many subsequent studies have applied the 1988 dimensions of variation to study the linguistic characteristics of other more specialized registers and discourse domains (Conrad and Biber, 2001 ). The following are examples: However, other MD studies have undertaken new corpus-driven analyses to identify the distinctive sets of co-occurring linguistic features that appear in a particular discourse domain or in a language other than English. The following section surveys some of those studies.

3.2.1 Comparison of the Multidimensional Patterns across Discourse Domains and Languages

Numerous other studies have undertaken complete MD analyses, using factor analysis to identify the dimensions of variation operating in a particular discourse domain in English rather than applying the dimensions from the 1988 MD analysis (e.g., Biber, 1992 , 2001 , 2006a , 2008 ; Biber, Connor, and Upton, 2007 ; Biber and Jones, 2005 ; Biber and Kurjian, 2007 ; Friginal 2006 , 2009b ; Kanoksilapatham, 2005 , 2007 ; Reppen, 2001 ).

Given that each of these studies is based on a different corpus of texts, representing a different discourse domain, it is reasonable to expect that they would each identify a unique set of dimensions. This expectation is reinforced by the fact that the more recent studies have included additional linguistic features not used in earlier MD studies (e.g., semantic classes of nouns and verbs). However, despite these differences in design and research focus, there are certain striking similarities in the set of dimensions identified by these studies.

Most important, in nearly all of these studies, the first dimension identified by the factor analysis is associated with a literate , informational focus (e.g., nouns, prepositional phrases, attributive adjectives, longer words) versus an oral , involved focus (personal involvement/stance, interactivity, and/or real time production features). For example, the MD studies of university spoken and written registers (Biber, 2006a ), elementary school spoken and written registers (Reppen, 2001 ), and eighteenth-century written and speech-based registers Biber ( 2001 ) all identified a first dimension of this type. More surprisingly, a similar dimension has emerged even in MD studies that have focused exclusively on spoken registers, such as that of M. White 1994 , which investigated register variation within the domain of job interviews, and of Biber ( 2008 ), which investigated register variation among the different types of conversation. A second parameter found in most MD analyses corresponds to narrative discourse, reflected by the co-occurrence of features like past tense, third-person pronouns, perfect aspect, and communication verbs (see, e.g., the Biber, 2006a study of university registers; Biber, 2001 , on eighteenth-century registers; and the Biber, 2008 , study of conversation text types).

However, most of these studies have also identified some dimensions that are unique to the particular discourse domain. For example, Reppen's ( 1994 ) factor analysis identified a dimension of “other-directed idea justification” in elementary student registers. The study of university spoken and written registers (Biber, 2006a ) identified two dimensions that are specialized to the university discourse domain: “Procedural versus content-focused discourse” and “academic stance.”

In sum, corpus-driven MD studies of English registers have uncovered both surprising similarities and notable differences in the underlying dimensions of variation. Two parameters seem to be fundamentally important, regardless of the discourse domain: a dimension associated with informational focus versus (inter) personal focus and a dimension associated with narrative discourse. At the same time, these MD studies have uncovered dimensions particular to the communicative functions and priorities of each different domain of use.

These same general patterns have emerged from MD studies of languages other than English, including Nukulaelae Tuvaluan (Besnier, 1988 ), Korean (Kim and Biber, 1994 ); Somali (Biber and Hared, 1992 , 1994 ); Taiwanese (Jang, 1998 ), Spanish (Biber, Davies, Jones, and Tracy-Ventura, 2006 ; Biber and Tracy-Ventura, 2007 ; Parodi, 2007 ), and Dagbani (Purvis, 2008 ). Taken together, these studies provide the first comprehensive investigations of register variation in non-English languages.

Biber 1995 synthesizes several of these studies to investigate the extent to which the underlying dimensions of variation and the relations among registers are configured in similar ways across languages. These languages show striking similarities in their basic patterns of register variation, as reflected by the co-occurring linguistic features that define the dimensions of variation in each language, the functional considerations represented by those dimensions, and the linguistic/functional relations among analogous registers. For example, similar to the full MD analyses of English, these MD studies have all identified dimensions associated with informational versus (inter)personal purposes and with narrative discourse.

At the same time, each of these MD analyses has identified dimensions that are unique to a language, reflecting the particular communicative priorities of that language and culture. For example, the MD analysis of Somali identified a dimension interpreted as “distanced, directive interaction,” represented by optative clauses, first- and second-person pronouns, directional preverbal particles, and other case particles. Only one register is especially marked for the frequent use of these co-occurring features in Somali—personal letters. This dimension reflects the particular communicative priorities of personal letters in Somali, which are typically interactive as well as explicitly directive.

The cross-linguistic comparisons further show that languages as diverse as English and Somali have undergone similar patterns of historical evolution following the introduction of written registers. For example, specialist written registers in both languages have evolved over time to styles with an increasingly dense use of noun phrase modification. Historical shifts in the use of dependent clauses are also surprising: in both languages, certain types of clausal embedding—especially complement clauses—turn out to be associated with spoken registers rather than with written registers.

These synchronic and diachronic similarities raise the possibility of universals of register variation. Synchronically, such universals reflect the operation of underlying form/function associations tied to basic aspects of human communication; diachronically, such universals relate to the historical development of written registers in response to the pressures of modernization and language adaptation.

3.3. Corpus-Based Studies of Historical Registers

Corpus analysis has been especially important for historical descriptions of registers (see Biber and Conrad, in press: chapter 6 ). Multidimensional analysis has been used to document historical patterns of register variation (e.g., Atkinson, 1992 , 1996 , 1999 ; Biber, 2001 ; Biber and Finegan, 1989a , 1997 ; Geisler, 2002 ). However, there has been an even larger number of studies that provide a detailed description of a single historical register. A few MD studies have focused on a specific register, such as the study of historical change in fictional dialogue by Biber and Burges 2000 or the study of recent changes in television dialogue (Rey, 2001 ). But most of these studies provide detailed descriptions of the linguistic characteristics of a historical register. Several of these studies analyze spoken registers from earlier historical periods (e.g., Culpeper and Kytö, 2000 , forthcoming; Kahlas-Tarkka and Rissanen, 2007 ; Kryk-Kastovsky, 2000 ; 2006; Kytö and Walker, 2003 ). The largest majority, though, focus on written historical registers, such as letters (Fitzmaurice, 2002a ; Nevala, 2004 ), medical recipes and herbals (Mäkinen, 2002 ; Taavitsainen, 2001 ), and medical and scientific writing (e.g., Taavitsainen and Pahta, 2000 , 2004 ).

3.4. World Englishes and English as a Lingua Franca (ELF)

In general, sociolinguistics has been resistant to the application of corpus-based analyses, and so most studies of social and regional dialect variation continue to employ traditional methodologies. However, a few research projects have studied regional dialect variation from a corpus perspective. For the most part, these projects have been conducted in European universities (Freiburg, Helsinki, Newcastle) and have focused on British English dialects, resulting in the Newcastle Electronic Corpus of Tyneside English, the Helsinki Corpus of British English Dialects (see Ihalainen, 1990 ), and the Freiburg English Dialect Corpus (FRED; see Kortmann and Wagner, 2005 ; Anderwald and Wagner, 2005 ). We are aware of only one study to date that has applied a corpus approach to analyze American English regional dialects: Grieve's ( 2009 ) study of variation in a 50-million-word corpus of letters to the editor collected from 200 cities from across the United States.

In contrast, the linguistic study of global varieties of English—or “World Englishes”—is almost always carried out from a corpus perspective. The strengths of the corpus approach make it ideal for describing new varieties that have emerged as English adapts to changing circumstances of use and contact with local languages and cultures (see Breiteneder, 2008 ). Research efforts in this area have focused on two major subareas: the study of World Englishes (indigenous varieties of English) and the study of English as a Lingua Franca (ELF; English used by nonnative English speakers). (See J. Jenkins, 2006 , for a full discussion of this topic.)

Corpus development efforts in the arena of World Englishes are best represented by the International Corpus of English (ICE) project. The ICE project is an attempt to construct comparable corpora for all varieties of English spoken around the world (see Greenbaum, 1988 , 1990a , 1990b , 1990c , 1991 , 1996 ; Greenbaum and Nelson, 1996 ). Each corpus in ICE ideally has the same design—in other words, a total size of one million words, with 500 texts of approximately 2,000 words each from the same registers (news, conversation, etc.). The texts in the corpus date from 1990 or later. The authors and speakers of the texts are aged 18 or over, are educated through the medium of English, and either were born in the target country or moved there at an early age (Nelson, 1996 ).

As part of the ICE project or other related efforts, individual corpora have been constructed for many of the varieties of English used around the world. These include corpora for the “inner-circle” varieties of English (e.g., for Australia, Canada, Great Britain, New Zealand, the United States; see http://www.ucl.ac.uk/english-usage/ ice/) as well as corpora for numerous other varieties of English spoken around the world, such as Caribbean English, East African English, Fiji English, Filipino English, Hong Kong English, Indian English, Jamaican English, Nigerian English, Singaporean English, and Xhosa English (see, e.g., Banjo, 1996 ; Bolt and Kingsley, 1996 ; Bolton, 2000 ; Burridge and Kortmann, 2008 ; Friginal, 2009b ; Holmes, 1996 ; Hundt 1998 , 2006 ; Hundt and Biewer, 2007 ; Kortmann, 2006 ; Mair, 1992 ; Mair and Sand, 1998 ; Ooi, 1997 ; Rogers, 2002 , 2003 ; Sand, 1998 , 1999 ; Schmied, 1990 , 1994 , 2004a , 2004b , 2005 , 2006 , 2007 ; Schmied and Hudson-Ettle, 1996 ; Tent and Mugler, 1996 , 2004 ).

A parallel research effort has focused on English as a lingua franca (ELF). Two especially important projects in this area have been the Vienna Oxford International Corpus of English (VOICE; see Seidlhofer, 2006 , 2007 ; Seidlhofer, Breiteneder, and Pitzl, 2006 ; Breiteneder et al., 2006 ) and the corpus of English as Lingua Franca in Academic Settings (ELFA corpus; see Mauranen, 2003 , 2006 , 2007 ).

4. Corpus Linguistics, Language Learning, and Language Pedagogy

Explorations into the pedagogical applications of corpus linguistics continue to match ongoing advancements in corpus-based technology and classroom research. Vocabulary acquisition and the mastery of grammar for language learners have traditionally been the preferred areas of investigation by many corpus researchers involved in the design and creation of language teaching materials (Conrad, 1999 , 2000 ; Hinkel, 2002 ). However, in recent years, corpus tools have been utilized in the teaching of specific skills particularly in genre-based writing (Hyland, 2004b ; Swales, 2002 ) and speaking in various academic and professional contexts.

There are several points of intersection between corpus linguistics and directly applied issues that involve language teaching and learning. In the following sections, we address four of these:

The compilation and analysis of learner corpora

The use of corpora for language teaching and learning

Applications of corpus research in ESP/EAP

The extent to which corpus findings can be integrated into textbooks and other teaching materials

4.1. Learner Corpora

One major application of corpus methods has been in the construction of learner corpora and the analysis of those corpora to document differences across L1 backgrounds. The most important project of this type is the International Corpus of Learner English (ICLE), a collection of corpora produced by learners from several different language backgrounds (see, e.g., Granger, 1993 , 1994 , 1996 , 1998a , 2003a , 2003b ). Many studies have compared the patterns of use in learner corpora to those found in native-English corpora to document patterns of overuse or underuse by learners. Studies have focused on a wide range of grammatical features, such as passives, participle clauses, connectors, and so on (see Aarts and Granger, 1998 ; Granger, 1997a , 2004 ; Granger and Tyson, 1996 ; Granger, Hung, and Petch-Tyson, 2002 ). Many studies in this tradition have also focused on formulaic sequences and the lexico-grammatical patterns associated with different learner groups (see, e.g., Altenberg and Granger, 2001 ; De Cock, 1998 ; De Cock et al., 1998 ; Granger, 1998b ; Meunier and Granger, 2008 ). Although most corpus studies of leaner language have been based on the ICLE, there have also been major studies with similar research goals undertaken from other perspectives (e.g., Hinkel, 2002 , 2003 ; Reder, Harris, and Setzler, 2003 ).

4.2. Corpora for Language Teaching and Learning

An even larger number of studies address the use of corpora for language teaching, introducing the approaches and discussing potential pedagogical benefits. These include numerous book-length treatments (e.g., Aston, 2001a ; Aston, Bernardini, and Stewart, 2004 ; Botley, McEnery, and Wilson, 2000 ; Burnard and McEnery, 2000 ; Ghadessy et al., 2001 ; Lewandowska-Tomaszczyk, 2003 , 2004 ; McEnery and Wilson, 1997 ; Mukherjee and Rohrbach, 2006 ; O'Keeffe, McCarthy, and Carter, 2007 ; Sinclair, 2004 ; Thomas and Short, 1996 ; Tribble and Jones, 1997 ; Wichmann, Fligelstone, McEnery, and Knowles, 1997 ) as well as an even larger number of journal articles and book chapters (e.g., Alderson, 1996 ; Aston, 1995 , 1997 , 2001b ; Barbieri and Eckhardt, 2007 ; Braun, 2005 ; Brodine, 2001 ; Donley and Reppen, 2001 ; Fligelstone, 1993 ; Huckin and Coady, 1999 ; “Kaltenböck and Mehlmauer-Larcher, 2005 ; Leech, 1997 , 2000 ; McCarthy and Carter, 2001 ; McEnery and Wilson, 1993 , 1997 , 2001 ; Meunier, 2002 ; Milton, 1998 ; Mindt, 1996 ; Mudraya, 2006 ; Murphy, 1996 ; O'Keeffe and Farr, 2003 ; Partington, 2001 ; Salsbury and Crummer, 2008 ; Shirato and Stapleton, 2007 ; Thompson and Tribble, 2001 ; Tribble, 2001 ; Yoon and Hirvela, 2004 ; Zorzi, 2001 ).

One especially common topic of these studies is the use of concordancing activities in the classroom, especially for inductive, data-driven learning (in addition to many of the studies previously cited, see Cobb, 1997 ; Flowerdew, 2001 ; Gaskell and Cobb, 2004 ; Gavioli, 1997 , 2001 ; Johns, 1994 , 1997 ; Nesselhauf, 2003 ; Qiao and Sussex, 2001; Sinclair, 2003 ; Stevens, 1993 ; Todd, 2001 ; Wichmann, 1995 ). For instance, Cobb 1997 and Horst, Cobb, and Nicolae 2005 report specific learning gains in the transfer of vocabulary knowledge of language learners that are attributable to the use of concordance programs and corpus-based tools. Similar studies by Chan and Liou ( 2005 ), Charles 2005 , and Friginal 2006 illustrate how web-based concordancing instruction and the use of concordancers in editing laboratory reports significantly help students' learning and use of verb-noun collocations, reporting verbs, passive and active sentence structures, and linking adverbials. Most participants in these studies see the use of concordancers as helpful. Innovative corpus tools that aid in the introduction of new words, collocations, and lexical bundles help learners to improve their awareness of word meanings and of the uses of words in various contexts. In addition, hands-on concordancing also aids in successful learning of new academic vocabulary, and enhances students' performance in activities and on tests (Altenberg and Granger, 2001 ; McCarthy and Carter, 2002 ; Nesselhauf, 2005 ).

Other studies focus more on the unexpected research findings that result from corpus investigations, discussing how such findings often indicate that we should be using radically different pedagogical approaches and different teaching materials than those traditionally used for language teaching (see, e.g. Carter and McCarthy, 1995 ; Conrad, 1999 , 2000 ; Henry and Roseberry, 2001 ; Hughes and McCarthy, 1998 ; Hunston, 2002b ; Hunston and Francis, 1998 ; Liu, 2003 ; Nesselhauf, 2003 ). For example, Biber and Reppen 2002 present corpus findings that identify the most common verbs in English conversation and then survey ESL grammar books to show that most of them fail to illustrate the use of those verbs.

4.3. Corpora and ESP/EAP

Research in the subfields of English for specific purposes (ESP) and English for academic purposes (EAP) has become almost entirely corpus based over the past 10 to 20 years. For example, a survey of articles in any recent issue of English for Specific Purposes or the Journal of English for Academic Purposes shows that recent linguistic descriptions of special/academic varieties in English are almost always based on corpus analysis.

Similarly, corpus approaches have become commonplace for ESP/EAP pedagogy. For example, Gilquin, Granger, and Paquot ( 2007 ), Hyland (2004b) , Flowerdew 2005 , and Gavioli 2005 all acknowledge the invaluable contribution of corpus approaches in the teaching of ESP/EAP, especially in increasing learners' awareness of the textual features of the target language. Yoon and Hirvela 2004 and Lee and Swales 2006 explore the use of corpora and corpus tools in EAP courses. For example, Lee and Swales piloted an innovative 13-week course in corpus-informed EAP, in which students were able to compare their writing with the linguistic patterns in a corpus of professional, published academic papers. These studies indicate that the corpus approach to academic writing facilitates the development of writing skills and contributes to learners' increased confidence; a majority of the participants in studies reported that they would recommend corpus-informed writing classes to other foreign students.

4.4. Corpus-Informed Language Textbooks

In contrast to the extremely large number of books and research papers that advocate the application of corpus approaches for language teaching, there are surprisingly few language textbooks that are based on corpus research. ELT dictionaries, which have been based on corpus research since the 1980s, are the major exception here (see sections 1 and 2). However, publishers have been more reluctant to break with tradition in ELT textbooks for vocabulary and grammar.

There are a few notable exceptions to this generalization. In some cases, textbooks have been shaped by corpus analysis, even though this influence is not acknowledged on the book cover or in the introduction. Such books include the series Vocabulary in Use (McCarthy and O'Dell, 2001 , 2004 , 2005 ) and Natural Grammar (Thornbury, 2004 ). In more recent years, though, publishers have become more willing to market ESL textbooks that are directly shaped by the results of corpus research. For example, the four-level EFL/ESL Touchstone series by McCarthy, McCarten, and Sandiford ( 2006 ) is advertised as drawing on “the Cambridge International Corpus … to build a syllabus based on how people actually use English” (back cover). Vocabulary books like those by Schmitt and Schmitt 2005 and Huntley 2006 are corpus based in two major respects:

They teach the words on the “Academic Word List”: a list of the most common vocabulary items that occur in a large corpus of written academic texts (see Coxhead, 2000 , previously discussed in section 2.1).

They provide practice in the typical “collocations” of those words, derived from further corpus analysis.

Corpus-based EAP curricula are widely used throughout Europe and Asia, but they are usually based on locally created materials rather than on a major textbook. One exception to this is the corpus-informed textbook on chemistry research writing by Robinson, Stoller, Costanza-Robinson, and Jones ( 2008 ). This book is actually targeted for all students of chemistry, because native speakers of English encounter many of the same challenges in learning advanced disciplinary writing skills as do language learners.

It is possible to make a distinction between corpus-informed textbooks and corpus-based textbooks: The former incorporate natural examples taken from a corpus, whereas in the latter, decisions about inclusion/exclusion of topics and the sequence of topics are made based on the results of prior corpus analysis. In many cases, a corpus-based book will present linguistic patterns of use that would not have even been acknowledged in a traditional textbook. The vocabulary books by Schmitt and Schmitt 2005 and Huntley 2006 are corpus based in this sense. The grammar book by Thornbury 2004 also seems to be corpus based in this sense, although there is nothing in the book introduction that acknowledges the role of corpus analysis.

Two recent books provide corpus-based introductions to English grammar for advanced students training to become language teachers: The Longman Student Grammar of Spoken and Written English (and the accompanying workbook; Biber, Conrad, and Leech, 2002 ; Conrad, Biber and Leech, 2002 ) and the Teacher's Grammar of English (Cowan, 2008 ). Finally, Conrad and Biber (in press) identifies 50 of the most important and surprising corpus research findings from the Longman Grammar of Spoken and Written English , presenting those as grammar units for ESL/EFL students.

5. Future Directions

The present chapter has surveyed the extensive body of research using corpus analysis to describe the patterns of language use in English (and other languages). In addition, there is no shortage of studies that advocate the application of corpus approaches for language teaching. However, as described in the last section, there has been much less effort given to the actual implementation of corpus research findings to develop teaching materials, especially textbooks that can provide the basis for a curriculum. At present, however, there are several such books in the works, and we anticipate that this state of affairs will change dramatically over the next few years.

One specific area that is currently receiving attention is the analysis of spoken corpora annotated for prosody in addition to lexico-grammatical information. Interestingly, the very first large spoken corpus of English—the London-Lund Corpus—included detailed coding to reflect pitch, length, and pausing phenomena (see Svartvik, 1990 ). However, this information was mostly disregarded in linguistic analyses of that corpus. More recently, though, spoken corpora are being analyzed to document systematic patterns of discourse intonation. Cheng, Greaves, and Warren's (2008 ; cf. Warren, 2004 ) study of the Hong Kong Corpus of Spoken English is one notable example of this type. Similarly, the C-ORAL-ROM project (Cresti and Moneglia, 2005 ) is a major research effort to develop acoustically analyzed spoken corpora for Italian, French, Spanish, and Portuguese.

Finally, multimodal annotation of spoken interactions should be another important area for future research (see, e.g., Gu, 2002 , 2007 ). In addition to enhanced prosodic and acoustic transcriptions of spoken corpora, these projects link video recordings to nonlinguistic features that play a crucial role in communication, such as facial expressions, hand gestures, and body position (see, e.g., Carter and Adolphs, 2008 ; Dahlmann and Adolphs, in press; Knight and Adolphs, 2008 ). Studies like these indicate that the strengths of corpus analysis can be extended to include aspects of communication beyond the analysis of the lexico-grammatical fabric of spoken and written texts.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2023 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Corpus Linguistics and Corpus-Based Research and Its Implication in Applied Linguistics: A Systematic Review

Profile image of tugas mahasiswa mahasiswa

PAROLE: Journal of Linguistics and Education

This article conveys a case-of-systematic survey of outstanding progress on corpora conducted by researchers affiliated with different common-section institutions all over the world. Such a range overview selected 20 outstanding types of research from multi research-pushing institutions all around the world. These projects employ corpus techniques and technology to treat an enormous domain of research queries that are relevant to linguistic studies, language teaching and learning, cultural studies, and discourse analysis. These varied implementations of corpus techniques and advances clearly explain the great stress and chances that corpora applied in linguistics can hand to those who have the intention to research, educate, and learn the language.

Related Papers

Ali Alhaidary

In the domain of English as a foreign language, many learners still face some difficulties in articulating vowel sounds of English, particularly the front ones. This study attempts to describe Yemeni EFL learners' errors while pronouncing the English front vowels. The subject of this study was the eighth-semester students studying at the English department, Faculty of Education, Mahweet branch, Sana'a University. The descriptive qualitative approach was utilized in this study. The data was collected by recording learners' pronunciation of English front vowel minimal pairs. The data was then compared with the ones produced by an English native speaker. The findings were confirmed by the spectrogram technique to identify the formants of F1 and F2 made by the learners in Praat Software Device. English front vowels /i /-/ I /-/ Ԑ / and / ae / were the target sounds. The study's findings revealed that Yemeni EFL learners' pronunciation differs from those produced by native English speakers. It was also revealed that Yemeni EFL learners are still struggling in articulating English front vowel sounds due to the influence of the sound system of their native language that causes such pronunciation problems in the target language.

research articles on corpus linguistics

Ali Mohammed Saleh Al-Hamzi

In English, the sound system is somewhat distinct from Arabic. Yemeni Novices have many difficulties pronouncing English terms owing to this. For example, we often hear English learners pronounce "sight [sait]" with "[saig]" or "know," [nəʊ]" with "[knəʊ]." This study is concerned with identifying widespread errors in the pronunciation of silent letters in English words among Yemeni EFL learners. Qualitative descriptive study was used in this investigation as a methodology. With the aid of purposive sampling techniques, a sample of 5 EFL novices was selected from 1st level of the English Department, Faculty of Education, Sana'a University, Yemen, which was classified into a group of novice high-level learners (beginners). Analytically descriptive method was used to analyze the data after they were collected through tests. The result of this study showed that in pronouncing words containing the silent letters like b,

layli hamida

Teachers&#39; views influence how they choose pedagogical methods and how they execute teaching practices in the classroom. An investigation of teachers&#39; views regarding the notion of &quot;language socialization&quot; (LS) and how it may be used to enhance pedagogical application via the use of social networking technologies (SNTs) in a Yemeni EFL setting is the subject of this study. In this study, 30 EFL teachers participated in a survey. Five males were interviewed out of the total number of participants. The findings revealed that the vast majority of teachers had good associations with LS as a teaching method and positive reactions to the use of SNTs in the EFL classrooms even though most participants had never used SNTs in their own classrooms. Because of a lack of access to and training in the usage of SNTs, teachers have been prevented from integrating them into their classrooms when teaching English. This suggests that instructors recognize the important functions that...

Teachers' views influence how they choose pedagogical methods and how they execute teaching practices in the classroom. An investigation of teachers' views regarding the notion of "language socialization" (LS) and how it may be used to enhance pedagogical application via the use of social networking technologies (SNTs) in a Yemeni EFL setting is the subject of this study. In this study, 30 EFL teachers participated in a survey. Five males were interviewed out of the total number of participants. The findings revealed that the vast majority of teachers had good associations with LS as a teaching method and positive reactions to the use of SNTs in the EFL classrooms even though most participants had never used SNTs in their own classrooms. Because of a lack of access to and training in the usage of SNTs, teachers have been prevented from integrating them into their classrooms when teaching English. This suggests that instructors recognize the important functions that SNTs play in improving students' language learning and socialization but that they lack first-hand experience and training in how to use them in their classrooms effectively. That there is a shortage of provision must be addressed as soon as possible to guarantee that Yemeni English as a foreign language teachers and their students have the chance to engage with new technologies in order to enhance their educational experiences critically.

Dian Inayati

The pivotal role of readers/audience in helping shape students’ writing has led to attempts to engage them in their composition. This reader awareness is further enhanced with the integration of Web 2.0 technology, say blogs, into writing pedagogy. This study aims to investigate the engagement strategies EFL writers employ in their blogs, and the readers’ responses toward the strategies. In this content analysis study, twenty-seven learning English blog entries retrieved from three student bloggers were analyzed using Hyland’s (2005) and Myers’ (2010) engagement strategies. The findings reveal the student bloggers’ use of almost all the eleven engagement strategies proposed in the model albeit the extent of which varied depending on the type of the blog post, their personal background, and writing purposes. The use of such strategies as “Pronoun” and “Image” drew numerous comments from readers on the blogs, implying a strong sense of readership. It is thus recommended for EFL writin...

Yulianeta Yulianeta

The Reading to Learn (R2L) pedagogy, which requires preparing for reading, detailed reading, joint construction, and independent construction steps, was invented in Australia to assist Australian indigenous secondary school students in reading texts and finding useful information inside the texts. This one group pre-test and post-test design was aimed at unfolding the effectiveness of R2L pedagogy in helping the students in constructing well-structured news report texts. A total of 51 ninth-grade students from a junior high school, Bandung, Indonesia participated in the research. The students received the R2L pedagogy in four meetings within a month, excluding the pre-test and post-test.

Parole: Journal of Linguistics and Education

Rosaria Amalia

Debating is recognized as a way of doing persuasion in speech that deals with four points of debate pillars, namely: assertion, reasoning, evidence, and link back . In the last two decades, every country debater champions have gathered and involved in a high level of debate atmosphere to compete each other as their country’s representatives to have worlds’ champion title. In conjunction to the glance of debate definition, the researcher analyzed the annual world most prestigious debating championship, WUDC 2016, which took place in Thessaloniki, Greece. In this research descriptive qualitative method was used. The source of this research was taken from video streaming on Youtube showing ESL grand final round between University of Indonesia team A (Indonesia) as Prime Minister (PM) and RWTH Aachen team A (Germany) as leader of opposition (LO). The result shows that hedges are used to make the debate performance run smooth and thoroughly powerful and convincing.

yusuf hamzah

Debating is recognized as a way of doing persuasion in speech that deals with four points of debate pillars, namely: assertion, reasoning, evidence, and link back. In the last two decades, every country debater champions have gathered and involved in a high level of debate atmosphere to compete each other as their country&#39;s representatives to have worlds&#39; champion title. In conjunction to the glance of debate definition, the researcher analyzed the annual world most prestigious debating championship, WUDC 2016, which took place in Thessaloniki, Greece. In this research descriptive qualitative method was used. The source of this research was taken from video streaming on Youtube showing ESL grand final round between University of Indonesia team A (Indonesia) as Prime Minister (PM) and RWTH Aachen team A (Germany) as leader of opposition (LO). The result shows that hedges are used to make the debate performance run smooth and thoroughly powerful and convincing.

Idha Nurhamidah

A common problem related to Teaching English as Foreign Language (TEFL) in an expanding circle country like Indonesia is students were accustomed to passively follow what the teacher told them to do. Task-Based Language Teaching (TBLT) is an effective method to let students develop their knowledge and creativity by learning the materials with their peers. It develops students’ self-learning habit that is beneficial for them to expand their knowledge, improve their competence and build their self-confidence as they were proud of their self-achievement. This research shows the integration of TBLT into listening and speaking class for second-grade students of State Vocational Senior Highschool, Central Java, majoring in electrical power engineering to fulfill their need as a technician with certain adaptations to suit the custom without decreasing its function and effectiveness

indah arvianti

The 2013 curriculum at the elementary level emphasizes the competence of students focused on character education. However, this curriculum does not include English as a means of mastering global knowledge. Based on this, the purposes of this research are to fulfill the absence of English subject which has not been integrated thematically in 2013 curriculum and to examine the effectiveness of English learning model using local wisdom-based thematic integrative education games for the 1st grade of elementary school students as well as to maintain local wisdom as a strategy to construct the character of youth nation. Data collection methods are interview, observation, questionnaire, and literature study. The other method is the pre-test and post-test experimental design. Respondents of this study are 148 students of the 1st grade elementary school in 17 elementary schools in Central Semarang. The results of both tests show that this model has proven to be effective in increasing the En...



dinamika soraya

siti salmiah muhammad

Peter Suwarno

Markus Deli Girik Allo

Sheila Wijayanti

Siti Mariam

Selviana Napitupulu

dwi wulandari

Endah Dewi Muliandari

PAROLE Journal of Linguistics and Education

Alfelia Nugky Permatasari

Asian Journal of Humanities and Social Studies

Richmond Ngula

Yusuf Al Arief

[email protected] Deroey

Sarwar Morshed

Language Literacy: Journal of Linguistics, Literature, and Language Teaching

Attapol Khamkhien

Cogent Arts & Humanities

Ebrahim Mohammadkhah

SSRN Electronic Journal

Ibrahim Bashir

neneng islamiah

English Language Teaching

Jayakaran Mukundan

mohsin tajummal

Research Paper

Hamed Babaie Shalmani

Scientific Journal of National Pedagogical Dragomanov University. Series 9. Current Trends in Language Development

Iryna Strashko

Topics in Linguistics

Tereza Guziurová

International Journal of English Linguistics

haroon shafique

English Studies at NBU

Tatiana Szczygłowska

eric nicaise


Shengyu Liao

GEMA Online® Journal of Language Studies


Ninuk Krismanti

Mercedes Querol-Julián

Sulis Triyono

International Journal of Corpus Linguistics

Lynne Flowerdew

Journal of Languages and Language Teaching

May Mahdi Alramadan

Teaching and Language Corpora

Tony McEnery


  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • JMIR Med Inform
  • PMC10662676

How People With a Bipolar Disorder Diagnosis Talk About Personal Recovery in Peer Online Support Forums: Corpus Framework Analysis Using the POETIC Framework

Glorianna jagfeld.

Division of Health Research, Spectrum Centre for Mental Health Research, Lancaster University, Lancaster, United Kingdom

UCREL Research Centre, School of Computing and Communications, Lancaster University, Lancaster, United Kingdom

Fiona Lobban

Chloe humphreys.

Faculty of Arts and Social Sciences, Department of Linguistics and English Language, Lancaster University, Lancaster, United Kingdom

Paul Rayson

Steven huntley jones, associated data.

Supplementary reports and documents.

Personal recovery is of particular value in bipolar disorder, where symptoms often persist despite treatment. We previously defined the POETIC (Purpose and Meaning, Optimism and Hope, Empowerment, Tensions, Identity, Connectedness) framework for personal recovery in bipolar disorder. So far, personal recovery has only been studied in researcher-constructed environments (eg, interviews and focus groups). Support forum posts can serve as a complementary naturalistic data resource to understand the lived experience of personal recovery.

This study aimed to answer the question “What can online support forum posts reveal about the experience of personal recovery in bipolar disorder in relation to the POETIC framework?”

By integrating natural language processing, corpus linguistics, and health research methods, this study analyzed public, bipolar disorder support forum posts relevant to the lived experience of personal recovery. By comparing 4462 personal recovery–relevant posts by 1982 users to 25,197 posts not relevant to personal recovery, we identified 130 significantly overused key lemmas. Key lemmas were coded according to the POETIC framework.

Personal recovery–related discussions primarily focused on 3 domains: “Purpose and meaning” (particularly reproductive decisions and work), “Connectedness” (romantic relationships and social support), and “Empowerment” (self-management and personal responsibility). This study confirmed the validity of the POETIC framework to capture personal recovery experiences shared on the web and highlighted new aspects beyond previous studies using interviews and focus groups.


This study is the first to analyze naturalistic data on personal recovery in bipolar disorder. By indicating the key areas that people focus on in personal recovery when posting freely and the language they use, this study provides helpful starting points for formal and informal carers to understand the concerns of people diagnosed with a bipolar disorder and to consider how to best offer support.


Bipolar disorder (BD) is a severe mental health (MH) problem characterized by recurring episodes of depressed and elevated mood [ 1 ]. Its lifetime prevalence ranges from 0.1% to 2.6% internationally [ 2 ]. BD is associated with lower quality of life [ 3 ] and high suicide risk [ 4 ]. Therefore, fostering recovery and living well with BD are important societal tasks.

MH care agendas increasingly focus on enhancing personal recovery (PR), defined as “a way of living a satisfying, hopeful life even with the limitations caused by the illness” [ 5 ]. This contrasts with a previously narrower focus on reducing symptoms (clinical recovery). PR might be of particular value in BD [ 6 ], where symptoms often persist despite treatment, but has been underresearched to date [ 7 ]. Jagfeld et al [ 8 ] (hereafter the POETIC review) recently synthesized 12 qualitative studies to develop the first conceptual framework for PR in BD. The POETIC (Purpose and Meaning, Optimism and Hope, Empowerment, Tensions, Identity, Connectedness) framework, based on the CHIME (Connectedness, Hope and Optimism, Identity, Meaning and Purpose, Empowerment) framework [ 9 ], comprises the following processes: “Purpose and meaning,” “Optimism and hope,” “Empowerment,” “Tensions,” “Identity,” and “Connectedness” (see Table 1 ).

The POETIC a framework [ 8 ]: lived experience of personal recovery in bipolar disorder.

a POETIC: Purpose and Meaning, Optimism and Hope, Empowerment, Tensions, Identify, Connectedness.

Current research on PR in BD has several limitations. First, it is mainly based on qualitative studies with few participants [ 10 ] and expert opinions, lacking quantitative evidence from larger samples [ 11 ]. Second, data collection is limited to structured settings (semistructured interviews, focus groups, and structured measures), which are not naturalistic and are subject to either interviewer bias [ 12 ] or constrained responses in structured measures. Third, recruitment is biased toward people who want to talk about PR and are in contact with services or researchers [ 8 ].

Naturalistic data collection, where “participants are not aware that they are being studied” [ 13 ], overcomes many of these limitations. Online forum posts are a source of naturalistic data, which can offer potential insights into “an experience as it is lived rather than as it is enacted in the researcher constructed environment” [ 14 ]. Some natural language processing (NLP) studies have analyzed large numbers of BD online forum posts via automatic quantitative methods such as content analysis [ 15 ] or emotion analysis [ 16 , 17 ] to identify forum topics or language differences between people with different or no MH diagnoses. Qualitative studies have applied conversation analysis [ 18 ], thematic analysis [ 19 ], grounded theory [ 20 ], and content analysis [ 21 ] to BD online forum posts. Such studies offer rich nuanced accounts of web-based discussions on BD but include only few, often handpicked, posts.

Corpus linguistics [ 22 ] provides a mix of quantitative and qualitative methods informed by linguistic theory for analyzing large amounts of text data with depth and richness that can overcome some of the shortcomings of previous NLP and qualitative studies. Semino et al [ 23 ] analyzed interviews and online forum posts of patients with cancer and their carers to learn about their lived experience and the metaphors they use for dealing with cancer. Hunt and Brookes [ 24 ] applied a combination of corpus linguistics and discourse analysis [ 25 ] to MH forum posts. Two corpus-linguistics studies have focused on BD specifically: Abdo et al [ 26 ] studied linguistic types of judgments, and McDonald and Woodward-Kron [ 27 ] studied forum users’ roles and identities.

A systematic review strongly recommended considering web-based content from individuals with lived experience in PR research [ 10 ], which has not yet been done. Therefore, the main aim of this paper was to gain further insights into the experience of PR in BD from online forum posts via a combination of NLP, corpus linguistics, and qualitative health research methods. Furthermore, the POETIC framework, synthesized from data collected via interviews or focus groups, has not been applied to new data yet. Hence, the secondary aim of this paper was to validate the framework by exploring to what extent it captures experiences shared on the web. The research question covering both aims is “What can online support forum posts reveal about the experience of PR in BD in relation to the POETIC framework?”

Data Source

This study analyzed posts from the international web-based discussion platform Reddit [ 28 ], which hosts subforums (subreddits) for various topics, including BD. Several reasons motivated the choice of this site: Reddit is one of the most visited internet sites worldwide with an international user base [ 29 ]; in contrast to other online support communities, everyone can read all public posts without a user account; and Reddit allows data analysis by third parties.

Reddit users with a self-reported BD diagnosis (S-BiDD) were automatically identified by matching phrases such as “I was diagnosed with bipolar” in all posts between January 2005 (the inception of Reddit) and March 2019 (see Jagfeld et al [ 30 ]). All posts of the identified users form the S-BiDD data set. Naturalistic data collection required subsequent filtering for content relevant for PR in BD, as an exploratory study revealed that the posts in the S-BiDD data set covered many other topics (see Report S1 in Multimedia Appendix 1 ). Figure 1 displays a flow chart for construction of the PR-BD corpus. In linguistics, a corpus is a sampled collection of texts representing a particular language variety [ 31 ]. The basis for the corpus was only posts in BD subreddits [ 32 ] (fourth level=“bipolar”), because a second exploratory study found that references to “recovery” and associated word forms were almost exclusively in relation to BD in BD subreddits (see Report S2 in Multimedia Appendix 1 ). Furthermore, only posts mentioning BD [ 33 ] were selected because only two-thirds (66%) of MH-related “recovery” mentions in BD subreddits referred to BD.

An external file that holds a picture, illustration, etc.
Object name is medinform-v11-e46544-g001.jpg

Flowchart of the 4 steps to create the PR-BD corpus and reference corpus. BD: bipolar disorder; MH: mental health; PR: personal recovery; S-BiDD: self-reported bipolar disorder diagnosis.

To select PR-relevant posts, a list of PR terms (comprising both single words and multiword phrases; n=562) [ 34 ] was compiled using corpus-linguistics methods (Document S2 in Multimedia Appendix 1 ). BD subreddit posts that mentioned BD were ranked according to their similarity with the PR terms list via term frequency–inverse document frequency–weighted cosine similarity, a standard information retrieval approach [ 35 , 36 ] (see Document S3 in Multimedia Appendix 1 ). To determine the cosine similarity cutoff, GJ coded whether 90 posts pertained to PR in BD using a preliminary codebook based on the second exploratory study and the POETIC codebook. SHJ audited the coding. To select the 90 posts, 10 posts were randomly sampled from every 10% quantile of the cosine similarity scores, taking only 10 posts from the first 2 quantiles that all scored 0. Following this, a minimum length of 94 words was set, as 5 posts shorter than this length lacked context to decide on their PR relevance (see Figure S5 in Multimedia Appendix 1 ). The codebook was refined to its final version (Document S4 in Multimedia Appendix 1 ). GJ and CH blindly coded 120 additional posts, again randomly sampled from each quantile of the cosine scores.

Ethical Considerations

The Lancaster University Faculty of Health and Medicine research ethics committee approved this research in May 2019 (reference FHMREC18066), which follows ethics guidelines for internet-mediated research [ 37 ]. It was infeasible to seek individual informed consent from the large number of included forum users, but quotes were paraphrased to protect users’ anonymity (see Document S5 in Multimedia Appendix 1 ). We recognize that some people may object to the use of web-based posts as research data without individual consent (eg, [ 38 ]). Users generally post to share information or seek support and do not directly provide their content for research. However, we believe that on balance, the benefits of this research to better understand PR makes it worthwhile while acknowledging these potential objections.

The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.

Involvement of People With Lived Experience

In all, 4 volunteers with lived experience of BD, who use online forums, and were recruited via People in Research [ 39 ] provided input on the study design, results, and subsequent plans in individual web-based meetings. All 4 volunteers—1 man and 3 women—were UK-based and in their 30s to 40s; additionally, 3 (75%) reported a bipolar II disorder diagnosis, 1 (25%) did not further specify their BD diagnosis, and at least 2 (50%) had a migrant background. Importantly, all volunteers were very supportive of the project, and none raised ethical concerns. After study completion, the volunteers were reinvited to provide feedback on our interpretations of the results.


Reflexivity is important to highlight how subjectivity may have impacted on research findings [ 40 ]. The research team embraces a PR approach in BD. GJ, FL, and SHJ previously developed the POETIC framework for PR in BD. They anticipated that it would capture many aspects shared on the web, but data analysis would reveal new aspects and deeper insights into the experience of PR in everyday life.

Corpus Framework Analysis

Data analysis drew on methods from corpus linguistics [ 22 ] and qualitative framework analysis [ 41 ], which we call corpus framework analysis. Quantitative corpus-linguistics methods derive frequency lists of the words in the corpus; identify keywords that occur statistically significantly more frequently in the corpus compared to other language samples; and find collocations, that is, words a target word co-occurs with more frequently than by chance. The main qualitative method is to analyze the context of specific words or phrases in so called concordances. Key lemmas in the PR-BD corpus were identified by comparing it to a reference corpus of posts with low similarity to the PR terms list via #LancsBox (version 6.0; Lancaster University) [ 42 ]. A lemma is the dictionary form of a word; for example, “recovering” and “recovered” are word forms of the lemma “recover.” To focus on the most salient topics of the PR-BD corpus, key lemmas overused at least twice at a significance level of P <.0001 [ 43 ] and used by at least 5% of users were analyzed. See Document S6 in Multimedia Appendix 1 for methodological details.

The key lemmas were coded into the POETIC framework via concordance analysis. First, overall impressions of all concordances were noted after sorting them according to the lemma, left, and right context (20 words each) in #LancsBox. Subsequently, 30 randomly sampled concordances for each key lemma were coded into the second-level POETIC categories (see the codebook in Appendix B of Jagfeld et al [ 8 ]). The coders read the full post if the 40 words did not provide enough context and noted impressions for every key lemma again. Finally, concordances that did not fit into an existing POETIC category were coded inductively. GJ coded all key lemmas, and SHJ and FL audited 6 key lemmas each.

Finally, new key lemmas that were not in the PR terms list and absent PR terms [ 44 ] were analyzed. Absence was defined as 0 frequency in the PR-BD corpus or a lower relative frequency than in the reference corpus. Additionally, collocations were analyzed via the #LancsBox GraphColl tool for some key lemmas. To do so, content words (noun, verb, adjective, and adverb) within a context of 5 words left and right of the target term and a minimum collocation frequency of 5 were ranked according to cubed mutual information [ 24 ].

The S-BiDD data set [ 45 ] contains 21,407,595 posts by 19,685 users (available for noncommercial research after signing a data usage agreement). The programming code is publicly available [ 46 ].

Coding the PR Relevance of Posts and Constructing the PR-BD Corpus

Following a blind trial of coding the PR relevance of 20 posts and a subsequent discussion, GJ and CH achieved moderate agreement (Cohen κ=0.51; 77/100, 77% observed agreement) in coding the remaining 100 posts (see Table S20 in Multimedia Appendix 1 ). The team resolved all disagreements. In total, 66 (31%) of the 210 posts were coded as PR relevant (see Table S21 in Multimedia Appendix 1 ). Based on this, the PR-BD corpus comprises posts with a PR score above 0.025 to balance precision in selecting PR-relevant posts and corpus size (see Table S22 in Multimedia Appendix 1 ). The PR-BD corpus has 4462 posts with 1,337,080 words by 1982 users. The reference corpus of posts with a PR score below 0.013 (see Table S23 in Multimedia Appendix 1 ) comprises 25,197 posts with 4,700,834 words by 6075 users.

Concordance Analysis With the POETIC Framework

In all, 130 lemmas met the prespecified keyness criteria. Figure 2 shows the domain and category frequencies, extrapolated from the 30 concordance lines coded for each key lemma and color coded according to Tol’s [ 47 ] light scheme for color-blindness accessibility. Table S28 in Multimedia Appendix 1 lists the key lemmas coded into each category, and Table S29 in Multimedia Appendix 1 lists the categories that each key lemma was coded into. Overall, the POETIC framework captured the experiences in the PR-BD corpus very well: there was evidence for all categories. Only 16% (9303/59,199) of key lemma instances fell into the new “Not POETIC” domain rather than the existing framework. The text below briefly reviews each domain with key lemmas in italics, highlighting the differences between the original framework and the web-based data. Tables S30 and S31 in Multimedia Appendix 1 provide illustrative quotes for all categories.

An external file that holds a picture, illustration, etc.
Object name is medinform-v11-e46544-g002.jpg

Frequency of POETIC domains and categories and new categories. MH: mental health; POETIC: Purpose and Meaning, Optimism and Hope, Empowerment, Tensions, Identity, Connectedness.

Purpose and Meaning

“Purpose and meaning” was the most frequent domain and contained the most salient topic differences between the POETIC review and web-based data. Although participants in the POETIC review mainly discussed the meaningful life and social role of being a parent , web-based discussions focused on reproductive decisions. Participants discussed perceived risks they might be responsible for, for example, the possibility for their child to develop MH issues that affected their decision or choice to bear and raise kids . In the “Work” category, extended to include formal education, many discussions focused on struggles around studying and graduating college . No participant in the POETIC review reported financial or housing issues, whereas several web-based users complained about a low quality of life due to money problems, causing homelessness or inability to afford treatment. Spirituality was discussed more frequently and richer than in the POETIC review. Users often wondered whether to regard their experiences as truly spiritual or rather as (hypo)manic symptoms.

Optimism and Hope

Reddit users differed in their “Belief” in the possibility of recovery . The mainstream opinion on Reddit was that BD is a “chronic condition that cannot be cured , only managed. ” Users questioned whether feeling fully recovered was not just a temporary experience caused by (hypo)mania. In the “ Positive thinking and valuing success ” category, many users were grateful for aspects of their BD experiences; for example, challenges provide opportunities for growth and demonstrating strength .


As in the POETIC review, “Self-management and personal responsibility ” was the most frequent and richest category. Forum users generally considered (taking steps towards) maintaining a healthy lifestyle (including routines or schedules , diet , exercise , and coping skills ) as an individual’s responsibility to reach recovery . In contrast, experiencing MH symptoms or feeling stuck in their recovery was regarded outside of someone’s responsibility if they followed professional or mainstream forum advice. The “Controversial role of medication” category included concerns about drug effects on the baby during pregnancy or nursing and alternative non–evidence-based treatments such as the keto diet or cannabis, which were not present in the POETIC review.

Experiences coded in the “Tensions” domain were similar to the POETIC review. Several participants shared feeling more comfortable to discuss “Ambivalence” around (hypo)mania on the web. Some asked if there was a possibility to enjoy increased motivation and confidence to make progress in their goals without the hypomania getting out of control.

Some participants shared rich success stories in the “Rebuilding positive sense of self” category, in which they moved away from shame and guilt by forgiving themselves for past behaviors and toward accepting themselves, whereas others were struggling with this process. Shame associated with stigma in the society was another focus of discussions, and some participants shared creative ways for overcoming stigma.


Regarding “Connectedness,” users mainly discussed relationships and support from others. Although there were positive accounts, participants often discussed struggles with romantic relationships or marriage and friendships and complained about issues with professional and family support, similar to the POETIC review. However, the web-based accounts, particularly those of relationship and family problems, appeared more candid, for example, with them also discussing sexual issues, trauma, and shame .

Inductive coding of the 645 concordance lines that did not fit into the POETIC framework revealed that they were unrelated to individuals’ PR or lived experience. Most quotes discussed other MH issues without PR relevance (symptoms, genetics and heredity, treatment, diagnosis, societal issues, and scientific research), followed by storytelling of their own or others’ situation without PR relevance; direct interactions between forum users, for example, giving advice or congratulating; and discussions of non-MH issues.

New PR Terms

Although 99 (76%) of 130 key lemmas were PR terms, 31 (24%) key lemmas were new. Of these 31 lemmas, 15 (48%) conveyed similar meanings to PR terms; for example, brother likened other family members in the PR terms list such as son or nephew. Another 7 (23%) new key lemmas introduced aspects not covered by PR terms. For example, baby , raise , and bear were related to reproductive decision-making; childhood was related to making sense of MH issues via early traumatic experiences; and environment was related to a focus on structural or societal circumstances rather than the individual (see Table S32 in Multimedia Appendix 1 ).

Absent PR Terms

Only 13% (n=54) of the 416 unique PR terms (after removing spelling and phraseological variants) were absent: 46 (11%) did not appear in the PR-BD corpus and 8 (2%) were underused compared to the reference corpus. The underused PR terms referred to symptoms ( high mood , mania , manic , and sleep ) or medical MH professionals ( doctor , pdoc [psychiatrist], and psychiatrist ; see Table S33 in Multimedia Appendix 1 ). These terms were relevant for some PR domains but also were strongly associated with clinical recovery. All PR terms missing in the PR-BD corpus were also missing in the reference corpus. They were mostly complex phrases, for example, brush yourself off , and none indicated aspects that were not covered by other key lemmas (see Table S34 in Multimedia Appendix 1 ).

Feedback From People With Lived Experience

Two volunteers who had commented on the first exploratory study provided feedback on the main study results. Overall, they valued the results and agreed with our findings but indicated limitations of the data, as reflected in the Discussion section. One volunteer argued that categorizations of experiences can be problematic for masking individual differences. Conversely, the other volunteer had found it particularly helpful to align some of her behaviors with CHIME categories because this gave her a sense of being on the right track.

This study analyzed Reddit posts of people with a BD diagnosis via corpus framework analysis to learn about the lived experience of PR in BD and validate the POETIC framework.

Key Findings in Relationship to Previous Work

The primary study aim was to provide new insights on PR in BD. Indeed, the web-based data contained candid, in-the-moment experiences that traditional qualitative data collection is unlikely to retrieve. For example, 1 user posted about their experiences in a current manic episode on 2 subsequent days: “Yesterday I posted here about the realization that I’ve entered a manic episode.” Other users shared things on the web that they had not shared elsewhere: ‘Talking about this part of my inner world to a psychiatrist would require a lot of trust for me.” The users had different interpretations of elated mood as signs of recovery, spiritual experiences, helpful motivational boosts, or dangerous MH symptoms to avoid. Quantitative [ 48 , 49 ] and qualitative [ 50 , 51 ] evidence shows that web-based anonymity affords personal self-disclosures and discussions of sensitive and stigmatized issues.

The results show that 3 POETIC domains were featured the most in Reddit discussions: “Purpose and meaning” (particularly reproductive decision making, work, and formal education), “Connectedness” (romantic relationships and social support), and “Empowerment” (self-management and personal responsibility). In line with a recent quantitative review [ 52 ], the concerns raised on Reddit pointed to a wide range of social and occupational functioning among people with a BD diagnosis: some were not working or leaving their house and therefore sought support on the web, whereas others asked for specific advice to further improve their already functional lifestyle. The popularity of the “Self-management and personal responsibility” category agrees with recent quantitative findings. A review by Mezes et al [ 53 ] found positive associations between PR and psychological characteristics focusing on control and personal agency, and a longitudinal study identified positive impacts of adaptive coping and balanced risk-taking on PR [ 54 ].

Importantly, the analysis highlighted PR issues that exclusively or more frequently came up on the web. This might be due to differences in sample demographics and data collection methods between this study and those included in the POETIC review. Users in the S-BiDD data set were younger than those in the studies included in the POETIC review: the S-BiDD data set users had a mean age of 32 years versus 45 years in the POETIC review, 30% (5866/19,685) versus 17% (18/163) of participants were aged between 18-29 years, and 7% (1299/19,685) versus 34% (36/163) were aged between 50-64 years [ 8 , 30 ]. This might explain why perspectives on transitioning into adulthood with BD, challenges of college education, and reproductive decision-making exclusively surfaced in the web-based data. Sahota and Sankar [ 20 ] summarized their qualitative analysis of discussions of genetic risk and reproductive decision-making in 2 BD subreddits as centering around the manageability of parenting a child for people with a BD diagnosis, which aligns well with the experiences found in this study.

Moreover, users in the S-BiDD data set were overwhelmingly from the United States [ 30 ], whereas all POETIC review studies stemmed from countries that provide at least a basic level of free public MH care and social security (the United Kingdom, Norway, Australia, Canada, China, Spain, and Turkey). This may explain why existential financial issues such as (threat of) homelessness and the inability to afford treatment surfaced only in the web-based data. Since health insurance in the United States (except for Medicare for those aged 65+ y) is either employer provided or privately paid, individuals who cannot work due to their MH issues lose their insurance and in turn access to professional support, often causing MH issues to exacerbate, for example, by abruptly stopping medication. One Reddit user described this as a “vicious cycle.” It also appears plausible that Reddit users stem from a different socioeconomic group than the participants recruited into the POETIC review studies.

The secondary aim of this study was to validate the POETIC framework. Results confirmed that the framework usefully captured PR experiences shared on the web. Web-based users discussed all second-level POETIC categories, and only 645 of the 3900 analyzed concordance lines could not be accommodated in the framework, demonstrating its comprehensiveness.

Strengths and Limitations

Three aspects of this study constitute both strengths and limitations. First, using online forums as a data source provided rich, candid, and in-the moment experiences. However, there is limited background and demographic information on the online forum users (but see Jagfeld et al [ 30 ] for an analysis of these properties in the users in the S-BiDD data set), and they are not representative of the general population with a BD diagnosis. One user in the PR-BD corpus posted “My hunch is that r/bipolarreddit overrepresents those who are struggling, who, understandably, may be more pessimistic about everything.” One volunteer shared his experience that discussions on Reddit MH forums mainly followed a mainstream opinion and that deviant opinions were ignored or suppressed. McDonald and Woodward-Kron [ 27 ] support this with corpus-linguistics evidence that BD forum users over time shifted from advice seeking to giving and used more medicalized language. Similarly, Vayreda and Antaki [ 18 ] showed that established BD forum users urged new members to seek a formal diagnosis and reinforced a biomedical view of BD. Our Reddit study provides one lens on the lived experience of some people that can complement studies of other MH forums and other sources, such as one-on-one interviews.

Second, the list of PR terms facilitated focusing on the concept of interest among the wealth of data, yet it arguably biased the data selection. Nevertheless, 52% (16/31) of the key lemmas that were not PR terms contributed new PR aspects. Moreover, explicitly stating our expectations of PR aspects via the terms list enabled us to identify absent aspects in the data.

Third, corpus-linguistics methods, particularly the coding of key lemmas, allowed the analysis of more data than traditional qualitative methods. However, single words probably more readily capture topic-like (eg, “Relationships”) rather than theme-like (eg, “Balancing acceptance with ambitions”) categories. Therefore, the relative category frequencies should be interpreted with some caution.

Research Implications

This study has at least 4 research implications. First, it demonstrates the usefulness of analyzing online forum posts to tap into authentic and candid accounts of lived experience of MH issues. Second, this study serves as the first validation of the POETIC framework. Ideally, this encourages other researchers to apply it in their research. Third, the combination of corpus linguistics and qualitative framework analysis allowed the analysis of large amounts of data. Hence, corpus framework analysis may also be useful for future studies of text data, such as therapy transcripts (eg, [ 55 ]). Lastly, the S-BiDD data set and derived corpora are available for future research, for example, on other aspects of the lived experience of BD.

Clinical Implications

This study identifies the key issues relevant to PR in BD shared by people with lived experience on the web and extends previous knowledge from interviews and focus groups. These findings, including the quotes in Tables S30 and S31 in Multimedia Appendix 1 , are a rich resource for understanding more about the experience of PR in BD for individuals living with BD, their loved ones and informal carers, and MH professionals. This is also relevant for recent initiatives to educate MH professionals on the lived experience of severe MH issues, such as the current “Understanding psychosis and BD” training for the UK National Health Service [ 56 ]. Subsequently, issues identified in this study may provide helpful starting points for therapists to collaboratively consider them with their clients, for example, in recovery-focused therapy [ 57 , 58 ].

Individuals discussed issues on the web that they considered contentious and personal and were not comfortable sharing offline, such as sexuality, spirituality, and (hypo)mania. Recovery-focused therapies that are free to work with whatever model the clients bring for their BD experiences [ 58 ] may be particularly suitable to create a therapeutic environment where clients feel comfortable to discuss such sensitive issues. Moreover, Jones et al [ 59 ] showed that recovery-focused therapy reduces the positive self-appraisal of hypomanic experiences.

Reproductive decision-making surfaced as a major issue for young adults living with BD, and dedicated counseling on this topic may be advisable. Although understanding genetic vulnerability and risk data in MH is challenging, there is evidence that genetic counseling can offer effective support [ 60 ].

This study analyzed 4462 Reddit posts by 1982 people with an S-BiDD within the POETIC framework [ 8 ] for PR in BD. It is the first to analyze online forum data on PR. This study confirmed the validity of the POETIC framework to also capture PR experiences shared on the web and highlighted new aspects in PR that did not come up in previous studies using interviews and focus groups. It also demonstrated the utility of integrating corpus linguistics and qualitative framework analysis to identify key themes within large text data sets. By indicating the key areas that people focus on when posting freely, this study provides rich insights into the lived experience of PR in BD for formal and informal carers of people with a BD diagnosis.


The authors wish to thank the volunteers with lived experience who provided helpful insights for conducting and writing up this research. GJ is grateful to Enrica Troiano for insightful discussions on designing and evaluating the coding for personal recovery relevance. She would also like to thank Gavin Brookes and the participants of the 6th Corpora & Discourse International Conference for helpful comments on the corpus construction and analysis. The authors also thank Matthew Coole for testing the code release. This study was completed as part of a PhD studentship for GJ, which was funded by the Faculty of Health and Medicine at Lancaster University, United Kingdom.


Multimedia appendix 1.

GJ led on the study design, supported by SHJ, FL, and PR. GJ collected the data, conducted the analyses, and drafted the manuscript. SHJ and FL audited the analyses (personal recovery relevance of posts and framework analysis of concordance lines). CH double coded the personal recovery relevance of posts and reviewed the paraphrasing of selected quotes. All coauthors discussed and agreed on the results, commented on the draft manuscript, and approved the final version. SHJ, FL, and PR obtained the funding for this study.

None declared.

What the New York Times’ copyright suit against OpenAI means

Photo illustration of a gunslinger's legs as he faces off in a showdown with another cowboy.

Hello Eye on AI readers and Happy 2024!

As many of you know, I was on leave for the past several months, working on a book about the generative AI revolution and all its potential ramifications. The book is due to be published this summer by Simon & Schuster. I’ll be letting you know more about it as the publication date approaches. Now back at Fortune , I’m assuming a new role as our AI editor, helping to build out our coverage of this vital technology. And I’ve got some exciting news: Eye on AI will be coming to your inbox more frequently. We are dedicated to providing you, as business leaders, with all the AI news you need to stay informed. AI is currently one of the hottest topics in the corporate world, and considering its rapid advancements, Eye on AI will now be delivered to you twice a week, on Tuesdays and Thursdays. Imagine, you’ll be twice as knowledgeable as before!

OK, the biggest news in AI this past week has got to be the copyright infringement lawsuit the New York Times filed against Microsoft and OpenAI in federal court on Dec. 27. It’s a doozie, one many think will be precedent-setting. Some commentators speculated it could even spell the end of OpenAI, and perhaps the entire business model on which many generative AI companies have been built. The suit doesn’t include a specific claim for damages but says the two tech companies should be held liable for “billions of dollars in statutory and actual damages.”

OpenAI, which had been in talks with the Times since April over possible licensing terms for the newspaper’s content, said it had thought negotiations were progressing and that it was “surprised and disappointed” by the Times’ suit. “We respect the rights of content creators and owners and are committed to working with them to ensure they benefit from A.I. technology and new revenue models,” OpenAI spokesperson Lindsey Held said. “We’re hopeful that we will find a mutually beneficial way to work together, as we are doing with many other publishers.” Microsoft declined to comment on the lawsuit.

The Times alleges that tens of thousands of its articles were copied, without its permission, in the process of training the GPT models that underpin OpenAI’s ChatGPT and Microsoft’s CoPilot (formerly called Bing Chat). It also alleges that ChatGPT and CoPilot allow users to further infringe on the Times’ copyrights by producing text that plagiarizes Times articles. It argues that the integration of OpenAI’s GPT models with web browsing and search tools steals commercial referrals and traffic from the newspaper’s own website. In a novel claim for this sort of case, the publisher also alleges its reputation is damaged when OpenAI’s models hallucinate, making up information and falsely attributing it to the Times. Among the reams of evidence that the Times submitted in support of its claims is a 127-page exhibit that includes 100 examples of OpenAI’s GPT-4 outputting verbatim lengthy passages from Times articles when prompted with just a sentence, or part of a sentence, from the original.

The Times lawsuit is certainly the most significant of the copyright infringement claims that have been filed against OpenAI and Microsoft to date. The Times has top copyright lawyers, relatively deep pockets, and a history of pursuing claims all the way to the Supreme Court when it feels an issue presents a threat to not just its own journalism, but to the free press as a whole. The newspaper is claiming here that OpenAI’s copyright infringement undercuts the revenues publications require to serve the public interest through news reporting and investigative journalism. This sets it apart from most of the other copyright infringement claims previously filed against OpenAI, which simply pit the commercial interests of creators against those of OpenAI. But what really differentiates the Times’ case is the clarity of the narrative and exhibits it presents. Many commentators believe these will prove highly persuasive to a jury if the case winds up in front of one.

Gary Marcus, the emeritus New York University cognitive scientist and vocal AI expert, opined, in a series of posts on X (formerly Twitter), that this is OpenAI’s Napster moment. He claims the Times’ lawsuit could wind up bankrupting the high-flying AI startup, just as a landmark 2001 copyright judgment against Napster obliterated the peer-to-peer music-sharing company’s business model and eventually drove it under.

Having done a fair bit of research into AI and copyright for my forthcoming book, I think this is unlikely to happen. For one, this case is likely to settle. The fact that the newspaper was in negotiations with OpenAI for a licensing deal and only filed suit after those talks apparently reached an impasse (probably because the Times was asking for more money than OpenAI wanted to pay) is a good indication that, despite the public interest gloss the Times applied to its complaint, its real motivation here is commercial. OpenAI has signed a deal with the Associated Press to license its content for AI training and last month inked a multiyear deal with publisher Axel Springer, which owns Business Insider and Politico, that gives OpenAI access to its current and archived content. That deal is worth more than $10 million per year, according to one report. OpenAI and Microsoft have a strong incentive to settle rather than deal with years of legal uncertainty; chances are, they will.

Even if this case goes to trial, a ruling might not ultimately go the Times’ way. Microsoft has deeper pockets than the Times and also has access to top-notch legal talent. And there are more precedents here than just Napster. Copyright experts vigorously debate which cases might be most analogous—the Google Books case, the Sega case, the Sony case, or the recent Andy Warhol case. The specifics of these analogies are too complicated to get into here. But the point is, this is far from a settled matter, and OpenAI and Microsoft have decent arguments they can use to try to defend themselves. It isn’t open and shut by any means.

It is also possible that the U.S. Copyright Office or Congress will weigh in before the Supreme Court does. The Copyright Office has just concluded a commentary period on the implications of generative AI. The Senate also recently held hearings on the topic. It is possible Congress will step in and pass a new law that would render the Times’ claim moot. Some legal scholars have suggested Congress should create a “fair learning” law that gives software companies an explicit right to use copyrighted material for AI training. Meanwhile, those sympathetic to rights holders have suggested lawmakers should mandate that creators are compensated for any works used to train AI. Congress could also insist that AI companies apply filters to screen out any model outputs that are identical to copyrighted material used in training. There is a precedent for Congress weighing in this way: The 1992 Audio Home Recording Act exempted sellers of digital audio tape from being sued for copyright infringement. But it also set up a licensing fee that all manufacturers and importers of audio recording devices have to pay to the Copyright Office, which then distributes those funds as royalty payments to music rights holders. Congress could wind up establishing a similar licensing and royalty regime for generative AI software.

Finally, even if OpenAI is ultimately forced to pay creators’ licensing fees, it can probably afford it. The company is, according to some news accounts , currently bringing in revenue at a $1.6 billion per year clip, with some insiders predicting that this figure will hit $5 billion before 2024 is out. With this kind of cash machine, OpenAI can probably survive. While copyright infringement claims sank Napter, Spotify was eventually able to reach a settlement with music rights holders. And while those payments crimped Spotify’s profits, and the company has lately struggled to sell stock investors on a convincing growth story, Spotify is also not about to go bust.

So, no, I don’t think OpenAI will go under. But I do think the Times’ lawsuit signifies that the era of freely using copyrighted material for AI training is coming to an end. The threat of lawsuits will push most companies building AI models to license any data they use. For instance, there are reports that Apple is currently in discussions to do exactly this for the data it is seeking to train its own AI models. In image generation, artists are also increasingly turning to masking technology that makes it impossible to effectively train AI models on their work without consent. Similar technology does not yet exist for text or music, but researchers are working on it. And plenty of publishers have now taken steps to prevent their websites from being freely scraped by web crawlers. Pretty soon, the only way companies are going to be able to obtain the data they need to train good generative AI models is if they pay to license it. One way or another, the sun is setting on the Wild West of generative AI.

And with that, more AI news below.

Jeremy Kahn [email protected] @jeremyakahn


Ex-Trump lawyer blames AI for fake precedents cited in legal brief. The former Trump fixer Michael Cohen said in court papers unsealed last week that he accidentally provided his own lawyer with fictitious legal citations used in a filing submitted to a federal judge because he relied on Google's AI chatbot Bard. Cohen said he had not realized Bard could hallucinate, creating realistic-looking but fictitious citations, and had provided these cases to his lawyer not expecting the attorney, David Schwartz, would drop them into his brief without checking them for accuracy, the New York Times reported . Schwartz had filed a motion asking the court to end its supervision of Cohen, now that Cohen has been released from prison after serving time for campaign finance law violations. The Bard hallucinations could factor in the upcoming New York criminal trial of former President Donald Trump where Cohen is expected to serve as a key prosecution witness. Trump’s lawyers have seized on the fake citations as evidence that Cohen is an unreliable and untrustworthy witness.

U.S. Supreme Court Chief Justice offers thoughts on AI and the law. Chief Justice John Roberts offered his thoughts on AI in the legal system in a year-end report published last week, the Independent reported . Roberts said that AI would not replace human judges any time soon but predicted that AI would increasingly be used to help lawyers prepare cases and do legal research. He said that such AI software could help level the playing field, improving access to legal resources for Americans who might not otherwise be able to afford them. However, he cautioned about AI's risks, including the problem of fake citations leading to legal errors, using the Michael Cohen news as an example, and warning about possible data privacy issues. He advised legal professionals to use AI with caution and humility.

U.K. terrorism law monitor warns AI chatbot could radicalize people. A lawyer appointed by the British government to assess its terrorism-related legislation says the country’s laws are insufficient to prevent people from being radicalized by AI chatbots. The lawyer, Jonathan Hall KC, told British newspaper the Telegraph that he chatted with a digital persona created by AI startup character.ai that was designed to mimic the head of the Islamic State and that it tried to recruit him to the terrorist group. He said the country currently had no laws that would hold someone responsible in cases where an AI chatbot, rather than a person, generated text that encouraged terroristic activities. Character.ai's terms and conditions prohibit users from uploading content that promotes violence and extremism but does not prevent the chatbot itself from outputting such content. Character.ai told the newspaper that its products “should never produce responses that encourage users to harm others.”

Nobel-winning economist cautions on STEM emphasis in new AI era. Christopher Pissarides, a Nobel-prize-winning labor market economist who works at the London School of Economics, said computer programmers were now sowing the seeds of their own destruction with the development of AI. He predicted that many coding and engineering roles in the future may be taken over by AI, while the skills that will be in high demand will be the empathetic and creative ones that humanities and liberal arts programs emphasize. He said that jobs requiring face-to-face contact, such as hospitality and health care, would not easily be replicated by AI, according to Bloomberg .


Sharing the burden. Many LLMs require huge amounts of computing power, not just to train, but also for inference. So there is growing interest in how this computing power might be federated, allowing groups of people without access to high-powered GPU clusters to run big AI models using laptops and PCs with a few GPUs available. Researchers from Yandex, Neiro.ai, the University of Washington, and Hugging Face have now proposed a method for distributed inference and for fine-tuning LLMs, an algorithm they call PETALS. They demonstrate that it can work on both LLAMA 2, which is an open-source 70 billion parameter LLM, and BLOOM, which is a 176 billion parameter model. With PETALs, each computer in the network only has to handle less than 3% of the full model weights, and it can run efficiently despite the latency and information loss that comes from trying to integrate lots of machines across the internet. You can read the paper, which is on the non-peer-reviewed research repository arxiv.org, here .


Boards are woefully unprepared for AI. Here’s how they can start to catch up —by Lila MacLellan

IBM AI chief advises people who want a tech job in 2024 to learn the language and creative thinking skills you get with the liberal arts —by Ryan Hogg

These movies do the best job of accurately capturing AI’s power and nuance, according to 10 AI experts —by Kylie Robison

Queen Latifah feels the same ‘nervousness that everyone feels’ about AI, but she’s monetizing her digital avatar. ‘It’s a bell we can’t un-ring’ —by Rachyl Jones

This is the online version of Eye on AI, Fortune 's weekly newsletter on how AI is shaping the future of business. Sign up for free .

Stephens’ kangaroo rat


  1. (PDF) Review of Corpus Linguistics Beyond the Word. Corpus Research

    research articles on corpus linguistics

  2. (PDF) Corpus Linguistics: Mixed‐Methods Research

    research articles on corpus linguistics

  3. (PDF) Corpus Linguistics and the Automatic Analysis of English

    research articles on corpus linguistics

  4. Studies in Corpus Linguistics

    research articles on corpus linguistics

  5. (PDF) Current trends in Corpus Linguistics and textual variation

    research articles on corpus linguistics

  6. TEXTBOOK Corpus Linguistics

    research articles on corpus linguistics


  1. Teaching Young Learners in Corpus linguistics explained in Urdu and Hindi

  2. Linguistics


  4. Conducting Research in the Database Academic Search Complete

  5. Asking linguistics professors questions 📚 Dr Daniel Van Olmen at Lancaster University 📚



  1. Corpus Linguistics

    Series home About this Cambridge Elements series Corpus Linguistics has grown to become part of the mainstream of Linguistics and Applied Linguistics, as well as being used as an adjunct to other forms of discourse analysis in a variety of fields.

  2. Review of Corpus Linguistics for Education: A Guide for Research

    Corpus Linguistics for Education shows that corpus linguistics research is not only useful in the field of linguistics but also in other fields, such as education. Researchers can use this book as a guideline for conducting educational research by adopting a linguistics-based corpus.

  3. PDF Perspectives on Corpus Linguistics: The Methodological Synergy in ...

    3. Meaning and Etymology of Corpus linguistics The term corpus linguistics was coined by Jan Aarts in the early (1980) who was hesitant in using the term, although many other linguists have been less than happy with it at that time (Leech, 2011). Later it is conveniently seen as an umbrella term for linguistic research that depends on the

  4. (PDF) Research Trends in Corpus Linguistics: A Bibliometric Analysis of

    This paper uses a bibliometric analysis to map the field of Corpus Linguistics (CL) research in arts and humanities over the last 20 years, while tracking changes in the popular CL research...

  5. Applied Corpus Linguistics

    Corpus research is no longer confined primarily to the study of linguistics and to generalised language description but is now applied in diverse fields, such as forensic linguistics, social policy studies, food studies, anthropology, writing development studies, translation and interpreting, and the analysis of corporate and government communic...

  6. Corpus Linguistics and Linguistic Theory

    Corpus Linguistics and Linguistic Theory (CLLT) is a peer-reviewed journal publishing high-quality original corpus-based research focusing on theoretically relevant issues in all core areas of linguistic research, or other recognized topic areas.

  7. Research in Corpus Linguistics

    Research in Corpus Linguistics Open Call for Papers 2020-02-02 Articles falling within one of the four categories published in RiCL are welcome through the whole year and will be evaluated according to the journal's editorial policies Read More Read more about Open Call for Papers Current Issue

  8. The importance of corpus-based research for language teachers

    Corpus-based research has three important characteristics. First, corpus-based studies use a principled collection of naturally-occurring texts; i.e. the corpus. Many corpora are now available, including some on the Web, and teachers can also compile their own corpora for classroom use.

  9. What does corpus linguistics have to offer to language assessment

    While corpus linguistic tools and methods have been used extensively in second language learning research, they have also been used increasingly in the design and validation of language assessments (Callies & Götz, 2015; Deshors, Götz, & Laporte, 2016; Park, 2014). The collection of papers in this special issue represents an intentional and ...

  10. Applied Corpus Linguistics: a hub for data-driven research with real

    In their article Patient feedback and duration of treatment: A corpus-based analysis of written comments on cancer care in England, researchers at Lancaster University explore "the value of using (corpus) linguistic methods to approach seemingly non-linguistic questions." In the study, they analyze a corpus of 214,340 feedback statements ...

  11. Full article: Corpus Linguistic Onomastics: A Plea for a Corpus-Based

    This article seeks to advance the field through a theoretical discussion of onomastic issues from a corpus linguistic point of view. It presents an overview of the linguistic status, meaning and grammar of proper names in order to highlight aspects that lend themselves to corpus linguistic inquiry. Earlier onomastic research is adduced, to ...

  12. Corpus Linguistics

    Corpus linguistics is a relatively new discipline, and a fast-changing one. As computer resources, particularly web-based ones, develop, sophisticated corpus investigations come within the reach of the ordinary translator, language learner, or linguist.

  13. PDF An IntroductIon to corpus LInguIstIcs

    In short, corpus linguistics serves to answer two fundamental research questions: 1. What particular patterns are associated with lexical or grammatical features? 2. How do these patterns differ within varieties and registers?

  14. Review of Corpora in Applied Linguistics (2nd Edition ...

    Article Review of Corpora in Applied Linguistics (2nd Edition), Susan Hunston, Cambridge University Press, Cambridge 2022, ISBN: 9781108616218 Book Review Published: 13 November 2023 ( 2023 ) Cite this article Download PDF Corpus Pragmatics Aims and scope Submit manuscript Mehrdad Vasheghani Farahani & Mahsa Pahlevanzade Fini 25 Accesses

  15. Research trends in corpus linguistics

    Abstract This paper uses a bibliometric analysis to map the field of Corpus Linguistics (CL) research in arts and humanities over the last 20 years, tracking changes in popular CL research topics, outlets, highly cited authors, and geographical origins based on the metadata of 5,829 CL-related articles from 429 Scopus-indexed journals. Results reveal an increase in corpus-assisted discourse ...

  16. Corpus Linguistics and Corpus-Based Research and Its Implication in

    Corpus Linguistics and Corpus-Based Research and Its Implication in Applied Linguistics: A Systematic Review CC BY-SA 4.0 Authors: Ali Mohammed Saleh Al-Hamzi Sana'a University Ayoub Gougui...

  17. Research in Corpus Linguistics

    Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use, resulting in research findings that are have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language.

  18. Research in Corpus Linguistics

    Research in Corpus Linguistics Published by Research in Corpus Linguistics Online ISSN: 2243-4712 Recent articles Review of Egbert, Jesse, Douglas Biber and Bethany Gray. 2022. Designing and...


    Abstract. This chapter offers an introduction to corpus linguistics as a methodology for studying language, literature, and other fields in the humanities. It defines corpus linguistics, explores ...

  20. CORPORUM: Journal of Corpus Linguistics

    Corporum: Journal of Corpus Linguistics is a biannual and double blind peer reviewed journal that publishes original research on corpus-based approaches to language and society. It is the first journal of its kind in Pakistan and is affiliated with the Corpus Research Centre at Air University, Islamabad. You can browse the current and past issues, submit your manuscripts, and learn more about ...

  21. Corpus linguistic techniques applied to textlinguistics

    1.. IntroductionCorpus linguistics is already a well-established area for language description and analysis (Aijmer and Altenberg, 1991).However during the last decade there has been a discernible shift in the use of computerised text corpora from pure linguistic research per se to a more `applied' corpus linguistic perspective where the focus is on the learner in some way.

  22. Using genre analysis and corpus linguistics to teach research article

    Corpus linguistics is another tool for providing evidence of what is both acceptable and commonly used in research writing. Two elements are needed for this approach—a corpus and a concordancing software program. ... Output of a concordance search using AdTAT of a corpus of 30 research articles in the field of biotechnology: search term role; ...

  23. Book Review: The language of mental illness corpus linguistics and the

    The Language of Mental Illness: Corpus Linguistics and the Construction of Mental Illness in the Press, is a pioneering monograph by Hazel Price, which offers the first large-scale linguistic analysis of UK news reports on mental illness.It is intended to address a research gap in previous studies into press representations of mental illness, the majority of which focused on the field of ...

  24. Corpus Linguistics and Corpus-Based Research and Its Implication in

    Corpus & Discourse Analysis Discourse Analysis can also profit from corpus linguistics research. Two studies have made the use of corpus linguistic research to reinforce the capacity and efficiency of discourse analysis. One study titled "Using corpora in discourse analysis" by Partington & Marchi (2015) indicates that the most explicit ...

  25. How People With a Bipolar Disorder Diagnosis Talk About Personal

    Ideally, this encourages other researchers to apply it in their research. Third, the combination of corpus linguistics and qualitative framework analysis allowed the analysis of large amounts of data. Hence, corpus framework analysis may also be useful for future studies of text data, such as therapy transcripts (eg, ). Lastly, the S-BiDD data ...

  26. What the New York Times' copyright suit against OpenAI means

    The Times alleges that tens of thousands of its articles were copied, without its permission, in the process of training the GPT models that underpin OpenAI's ChatGPT and Microsoft's CoPilot ...