About Stanford GSB

  • The Leadership
  • Dean’s Updates
  • School News & History
  • Commencement
  • Business, Government & Society
  • Centers & Institutes
  • Center for Entrepreneurial Studies
  • Center for Social Innovation
  • Stanford Seed

About the Experience

  • Learning at Stanford GSB
  • Experiential Learning
  • Guest Speakers
  • Entrepreneurship
  • Social Innovation
  • Communication
  • Life at Stanford GSB
  • Collaborative Environment
  • Activities & Organizations
  • Student Services
  • Housing Options
  • International Students

Full-Time Degree Programs

  • Why Stanford MBA
  • Academic Experience
  • Financial Aid
  • Why Stanford MSx
  • Research Fellows Program
  • See All Programs

Non-Degree & Certificate Programs

  • Executive Education
  • Stanford Executive Program
  • Programs for Organizations
  • The Difference
  • Online Programs
  • Stanford LEAD
  • Seed Transformation Program
  • Aspire Program
  • Seed Spark Program
  • Faculty Profiles
  • Academic Areas
  • Awards & Honors
  • Conferences

Faculty Research

  • Publications
  • Working Papers
  • Case Studies

Research Hub

  • Research Labs & Initiatives
  • Business Library
  • Data, Analytics & Research Computing
  • Behavioral Lab

Research Labs

  • Cities, Housing & Society Lab
  • Golub Capital Social Impact Lab

Research Initiatives

  • Corporate Governance Research Initiative
  • Corporations and Society Initiative
  • Policy and Innovation Initiative
  • Rapid Decarbonization Initiative
  • Stanford Latino Entrepreneurship Initiative
  • Value Chain Innovation Initiative
  • Venture Capital Initiative
  • Career & Success
  • Climate & Sustainability
  • Corporate Governance
  • Culture & Society
  • Finance & Investing
  • Government & Politics
  • Leadership & Management
  • Markets & Trade
  • Operations & Logistics
  • Opportunity & Access
  • Organizational Behavior
  • Political Economy
  • Social Impact
  • Technology & AI
  • Opinion & Analysis
  • Email Newsletter

Welcome, Alumni

  • Communities
  • Digital Communities & Tools
  • Regional Chapters
  • Women’s Programs
  • Identity Chapters
  • Find Your Reunion
  • Career Resources
  • Job Search Resources
  • Career & Life Transitions
  • Programs & Services
  • Career Video Library
  • Alumni Education
  • Research Resources
  • Volunteering
  • Alumni News
  • Class Notes
  • Alumni Voices
  • Contact Alumni Relations
  • Upcoming Events

Admission Events & Information Sessions

  • MBA Program
  • MSx Program
  • PhD Program
  • Alumni Events
  • All Other Events
  • Operations, Information & Technology
  • Classical Liberalism
  • The Eddie Lunch
  • Accounting Summer Camp
  • Videos, Code & Data
  • California Econometrics Conference
  • California Quantitative Marketing PhD Conference
  • California School Conference
  • China India Insights Conference
  • Homo economicus, Evolving
  • Political Economics (2023–24)
  • Scaling Geologic Storage of CO2 (2023–24)
  • A Resilient Pacific: Building Connections, Envisioning Solutions
  • Adaptation and Innovation
  • Changing Climate
  • Civil Society
  • Climate Impact Summit
  • Climate Science
  • Corporate Carbon Disclosures
  • Earth’s Seafloor
  • Environmental Justice
  • Operations and Information Technology
  • Organizations
  • Sustainability Reporting and Control
  • Taking the Pulse of the Planet
  • Urban Infrastructure
  • Watershed Restoration
  • Junior Faculty Workshop on Financial Regulation and Banking
  • Ken Singleton Celebration
  • Marketing Camp
  • Quantitative Marketing PhD Alumni Conference
  • Presentations
  • Theory and Inference in Accounting Research
  • Stanford Closer Look Series
  • Quick Guides
  • Core Concepts
  • Journal Articles
  • Glossary of Terms
  • Faculty & Staff
  • Researchers & Students
  • Research Approach
  • Charitable Giving
  • Financial Health
  • Government Services
  • Workers & Careers
  • Short Course
  • Adaptive & Iterative Experimentation
  • Incentive Design
  • Social Sciences & Behavioral Nudges
  • Bandit Experiment Application
  • Conferences & Events
  • Get Involved
  • Reading Materials
  • Teaching & Curriculum
  • Energy Entrepreneurship
  • Faculty & Affiliates
  • SOLE Report
  • Responsible Supply Chains
  • Current Study Usage
  • Pre-Registration Information
  • Participate in a Study

The Data Revolution and Economic Analysis

Many believe that “big data” will transform business, government and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research. Large-scale administrative datasets and proprietary private sector data can greatly improve the way we measure, track and describe economic activity. They also can enable novel research designs that allow researchers to trace the consequences of different events or policies. We outline some of the challenges in accessing and making use of these data. We also consider whether the big data predictive modeling tools that have emerged in statistics and computer science may prove useful in economics.

  • Priorities for the GSB's Future
  • See the Current DEI Report
  • Supporting Data
  • Research & Insights
  • Share Your Thoughts
  • Search Fund Primer
  • Affiliated Faculty
  • Faculty Advisors
  • Louis W. Foster Resource Center
  • Defining Social Innovation
  • Impact Compass
  • Global Health Innovation Insights
  • Faculty Affiliates
  • Student Awards & Certificates
  • Changemakers
  • Dean Jonathan Levin
  • Dean Garth Saloner
  • Dean Robert Joss
  • Dean Michael Spence
  • Dean Robert Jaedicke
  • Dean Rene McPherson
  • Dean Arjay Miller
  • Dean Ernest Arbuckle
  • Dean Jacob Hugh Jackson
  • Dean Willard Hotchkiss
  • Faculty in Memoriam
  • Stanford GSB Firsts
  • Certificate & Award Recipients
  • Teaching Approach
  • Analysis and Measurement of Impact
  • The Corporate Entrepreneur: Startup in a Grown-Up Enterprise
  • Data-Driven Impact
  • Designing Experiments for Impact
  • Digital Business Transformation
  • The Founder’s Right Hand
  • Marketing for Measurable Change
  • Product Management
  • Public Policy Lab: Financial Challenges Facing US Cities
  • Public Policy Lab: Homelessness in California
  • Lab Features
  • Curricular Integration
  • View From The Top
  • Formation of New Ventures
  • Managing Growing Enterprises
  • Startup Garage
  • Explore Beyond the Classroom
  • Stanford Venture Studio
  • Summer Program
  • Workshops & Events
  • The Five Lenses of Entrepreneurship
  • Leadership Labs
  • Executive Challenge
  • Arbuckle Leadership Fellows Program
  • Selection Process
  • Training Schedule
  • Time Commitment
  • Learning Expectations
  • Post-Training Opportunities
  • Who Should Apply
  • Introductory T-Groups
  • Leadership for Society Program
  • Certificate
  • 2023 Awardees
  • 2022 Awardees
  • 2021 Awardees
  • 2020 Awardees
  • 2019 Awardees
  • 2018 Awardees
  • Social Management Immersion Fund
  • Stanford Impact Founder Fellowships and Prizes
  • Stanford Impact Leader Prizes
  • Social Entrepreneurship
  • Stanford GSB Impact Fund
  • Economic Development
  • Energy & Environment
  • Stanford GSB Residences
  • Environmental Leadership
  • Stanford GSB Artwork
  • A Closer Look
  • California & the Bay Area
  • Voices of Stanford GSB
  • Business & Beneficial Technology
  • Business & Sustainability
  • Business & Free Markets
  • Business, Government, and Society Forum
  • Second Year
  • Global Experiences
  • JD/MBA Joint Degree
  • MA Education/MBA Joint Degree
  • MD/MBA Dual Degree
  • MPP/MBA Joint Degree
  • MS Computer Science/MBA Joint Degree
  • MS Electrical Engineering/MBA Joint Degree
  • MS Environment and Resources (E-IPER)/MBA Joint Degree
  • Academic Calendar
  • Clubs & Activities
  • LGBTQ+ Students
  • Military Veterans
  • Minorities & People of Color
  • Partners & Families
  • Students with Disabilities
  • Student Support
  • Residential Life
  • Student Voices
  • MBA Alumni Voices
  • A Week in the Life
  • Career Support
  • Employment Outcomes
  • Cost of Attendance
  • Knight-Hennessy Scholars Program
  • Yellow Ribbon Program
  • BOLD Fellows Fund
  • Application Process
  • Loan Forgiveness
  • Contact the Financial Aid Office
  • Evaluation Criteria
  • GMAT & GRE
  • English Language Proficiency
  • Personal Information, Activities & Awards
  • Professional Experience
  • Letters of Recommendation
  • Optional Short Answer Questions
  • Application Fee
  • Reapplication
  • Deferred Enrollment
  • Joint & Dual Degrees
  • Entering Class Profile
  • Event Schedule
  • Ambassadors
  • New & Noteworthy
  • Ask a Question
  • See Why Stanford MSx
  • Is MSx Right for You?
  • MSx Stories
  • Leadership Development
  • Career Advancement
  • Career Change
  • How You Will Learn
  • Admission Events
  • Personal Information
  • Information for Recommenders
  • GMAT, GRE & EA
  • English Proficiency Tests
  • After You’re Admitted
  • Daycare, Schools & Camps
  • U.S. Citizens and Permanent Residents
  • Requirements
  • Requirements: Behavioral
  • Requirements: Quantitative
  • Requirements: Macro
  • Requirements: Micro
  • Annual Evaluations
  • Field Examination
  • Research Activities
  • Research Papers
  • Dissertation
  • Oral Examination
  • Current Students
  • Education & CV
  • International Applicants
  • Statement of Purpose
  • Reapplicants
  • Application Fee Waiver
  • Deadline & Decisions
  • Job Market Candidates
  • Academic Placements
  • Stay in Touch
  • Faculty Mentors
  • Current Fellows
  • Standard Track
  • Fellowship & Benefits
  • Group Enrollment
  • Program Formats
  • Developing a Program
  • Diversity & Inclusion
  • Strategic Transformation
  • Program Experience
  • Contact Client Services
  • Campus Experience
  • Live Online Experience
  • Silicon Valley & Bay Area
  • Digital Credentials
  • Faculty Spotlights
  • Participant Spotlights
  • Eligibility
  • International Participants
  • Stanford Ignite
  • Frequently Asked Questions
  • Founding Donors
  • Location Information
  • Participant Profile
  • Network Membership
  • Program Impact
  • Collaborators
  • Entrepreneur Profiles
  • Company Spotlights
  • Seed Transformation Network
  • Responsibilities
  • Current Coaches
  • How to Apply
  • Meet the Consultants
  • Meet the Interns
  • Intern Profiles
  • Collaborate
  • Research Library
  • News & Insights
  • Program Contacts
  • Databases & Datasets
  • Research Guides
  • Consultations
  • Research Workshops
  • Career Research
  • Research Data Services
  • Course Reserves
  • Course Research Guides
  • Material Loan Periods
  • Fines & Other Charges
  • Document Delivery
  • Interlibrary Loan
  • Equipment Checkout
  • Print & Scan
  • MBA & MSx Students
  • PhD Students
  • Other Stanford Students
  • Faculty Assistants
  • Research Assistants
  • Stanford GSB Alumni
  • Telling Our Story
  • Staff Directory
  • Site Registration
  • Alumni Directory
  • Alumni Email
  • Privacy Settings & My Profile
  • Success Stories
  • The Story of Circles
  • Support Women’s Circles
  • Stanford Women on Boards Initiative
  • Alumnae Spotlights
  • Insights & Research
  • Industry & Professional
  • Entrepreneurial Commitment Group
  • Recent Alumni
  • Half-Century Club
  • Fall Reunions
  • Spring Reunions
  • MBA 25th Reunion
  • Half-Century Club Reunion
  • Faculty Lectures
  • Ernest C. Arbuckle Award
  • Alison Elliott Exceptional Achievement Award
  • ENCORE Award
  • Excellence in Leadership Award
  • John W. Gardner Volunteer Leadership Award
  • Robert K. Jaedicke Faculty Award
  • Jack McDonald Military Service Appreciation Award
  • Jerry I. Porras Latino Leadership Award
  • Tapestry Award
  • Student & Alumni Events
  • Executive Recruiters
  • Interviewing
  • Land the Perfect Job with LinkedIn
  • Negotiating
  • Elevator Pitch
  • Email Best Practices
  • Resumes & Cover Letters
  • Self-Assessment
  • Whitney Birdwell Ball
  • Margaret Brooks
  • Bryn Panee Burkhart
  • Margaret Chan
  • Ricki Frankel
  • Peter Gandolfo
  • Cindy W. Greig
  • Natalie Guillen
  • Carly Janson
  • Sloan Klein
  • Sherri Appel Lassila
  • Stuart Meyer
  • Tanisha Parrish
  • Virginia Roberson
  • Philippe Taieb
  • Michael Takagawa
  • Terra Winston
  • Johanna Wise
  • Debbie Wolter
  • Rebecca Zucker
  • Complimentary Coaching
  • Changing Careers
  • Work-Life Integration
  • Career Breaks
  • Flexible Work
  • Encore Careers
  • D&B Hoovers
  • Data Axle (ReferenceUSA)
  • EBSCO Business Source
  • Global Newsstream
  • Market Share Reporter
  • ProQuest One Business
  • Student Clubs
  • Entrepreneurial Students
  • Stanford GSB Trust
  • Alumni Community
  • How to Volunteer
  • Springboard Sessions
  • Consulting Projects
  • 2020 – 2029
  • 2010 – 2019
  • 2000 – 2009
  • 1990 – 1999
  • 1980 – 1989
  • 1970 – 1979
  • 1960 – 1969
  • 1950 – 1959
  • 1940 – 1949
  • Service Areas
  • ACT History
  • ACT Awards Celebration
  • ACT Governance Structure
  • Building Leadership for ACT
  • Individual Leadership Positions
  • Leadership Role Overview
  • Purpose of the ACT Management Board
  • Contact ACT
  • Business & Nonprofit Communities
  • Reunion Volunteers
  • Ways to Give
  • Fiscal Year Report
  • Business School Fund Leadership Council
  • Planned Giving Options
  • Planned Giving Benefits
  • Planned Gifts and Reunions
  • Legacy Partners
  • Giving News & Stories
  • Giving Deadlines
  • Development Staff
  • Submit Class Notes
  • Class Secretaries
  • Board of Directors
  • Health Care
  • Sustainability
  • Class Takeaways
  • All Else Equal: Making Better Decisions
  • If/Then: Business, Leadership, Society
  • Grit & Growth
  • Think Fast, Talk Smart
  • Spring 2022
  • Spring 2021
  • Autumn 2020
  • Summer 2020
  • Winter 2020
  • In the Media
  • For Journalists
  • DCI Fellows
  • Other Auditors
  • Academic Calendar & Deadlines
  • Course Materials
  • Entrepreneurial Resources
  • Campus Drive Grove
  • Campus Drive Lawn
  • CEMEX Auditorium
  • King Community Court
  • Seawell Family Boardroom
  • Stanford GSB Bowl
  • Stanford Investors Common
  • Town Square
  • Vidalakis Courtyard
  • Vidalakis Dining Hall
  • Catering Services
  • Policies & Guidelines
  • Reservations
  • Contact Faculty Recruiting
  • Lecturer Positions
  • Postdoctoral Positions
  • Accommodations
  • CMC-Managed Interviews
  • Recruiter-Managed Interviews
  • Virtual Interviews
  • Campus & Virtual
  • Search for Candidates
  • Think Globally
  • Recruiting Calendar
  • Recruiting Policies
  • Full-Time Employment
  • Summer Employment
  • Entrepreneurial Summer Program
  • Global Management Immersion Experience
  • Social-Purpose Summer Internships
  • Process Overview
  • Project Types
  • Client Eligibility Criteria
  • Client Screening
  • ACT Leadership
  • Social Innovation & Nonprofit Management Resources
  • Develop Your Organization’s Talent
  • Centers & Initiatives
  • Student Fellowships

The Data Revolution and Economic Analysis

Many believe that "big data" will transform business, government and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research. Large-scale administrative datasets and proprietary private sector data can greatly improve the way we measure, track and describe economic activity. They also can enable novel research designs that allow researchers to trace the consequences of different events or policies. We outline some of the challenges in accessing and making use of these data. We also consider whether the big data predictive modeling tools that have emerged in statistics and computer science may prove useful in economics.

Related Topics

  • Working Paper

More Publications

Sticky wages on the layoff margin, corruption and firms: evidence from randomized audits in brazil, the digital privacy paradox: small money, small costs, small talk.

Book Review The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences

Rob Kitchin, Sage, London, 222 pp. ISBN-13 978–1446287484

  • Published: 22 February 2015
  • Volume 24 , pages 385–388, ( 2015 )

Cite this article

data revolution research paper

  • Drew Paine 1  

898 Accesses

1 Altmetric

Explore all metrics

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Hey, T., Tansley, S., and Tolle, K. (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery : Microsoft Research Redmond, WA .

Google Scholar  

Jirotka, M., Lee, C.P., and Olson, G.M. (2013). Supporting Scientific Collaboration: Methods, Tools and Concepts. Computer Supported Cooperative Work (CSCW), vol. 22, no. 4–6, pp. 667–715.

Article   Google Scholar  

Jirotka, M., Procter, R., Rodden, T. and Bowker, G. (2006). Special Issue: Collaboration in E-Research. Computer Supported Cooperative Work (CSCW), vol. 15, no. 4, pp. 251–255.

Kitchin, Rob (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences : Sage .

Ribes, D. and Lee, C.P. (2010). Sociotechnical Studies of Cyberinfrastructure and E-Research: Current Themes and Future Trajectories. Computer Supported Cooperative Work (CSCW), vol. 19, no. 3, pp. 231–244.

Download references

Author information

Authors and affiliations.

University of Washington, Seattle, WA, USA

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Drew Paine .

Rights and permissions

Reprints and permissions

About this article

Paine, D. Book Review The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences . Comput Supported Coop Work 24 , 385–388 (2015). https://doi.org/10.1007/s10606-015-9220-y

Download citation

Published : 22 February 2015

Issue Date : August 2015

DOI : https://doi.org/10.1007/s10606-015-9220-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research

20.500.12592/9wg3hj

The Data Revolution and Economic Analysis

Liran Einav , Jonathan D. Levin

Related Topics

Share artifact.

Or copy link:

If your institution is a member, please log into Policy Commons from a link provided by your institution. This typically involves logging in via a menu managed by your library.

Accessing this content requires a membership

Add to list

You have no lists yet

Create your first list:

1 ? 's' : ''}`" >

Full-page Screenshot

The Data Revolution and Economic Analysis

You may be able to download this chapter for free via the Document Object Identifier .

Many believe that “big data” will transform business, government and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research. Large-scale administrative datasets and proprietary private sector data can greatly improve the way we measure, track and describe economic activity. They also can enable novel research designs that allow researchers to trace the consequences of different events or policies. We outline some of the challenges in accessing and making use of these data. We also consider whether the big data predictive modeling tools that have emerged in statistics and computer science may prove useful in economics.

We thank Susan Athey, Preston McAfee, Erin Scott, Scott Stern and Hal Varian for comments. We are grateful for research support from the NSF, the Alfred P. Sloan Foundation, and the Toulouse Network on Information Technology.

This paper draws on some of my experience in related work that used proprietary data from various companies, which were obtained through contracts that I and my coauthors signed with each of these companies (eBay Research, Alcoa, Safeway, and a subprime lender).

Levin consulted in 2010-11 for eBay Research, and has consulted for other internet companies, and for the US government. He has received research funding from the Alfred P. Sloan Foundation, the National Science Foundation, and the Toulouse Network on Information Technology.

MARC RIS BibTeΧ

Download Citation Data

Published From Paper

More from nber.

In addition to working papers , the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter , the NBER Digest , the Bulletin on Retirement and Disability , the Bulletin on Health , and the Bulletin on Entrepreneurship  — as well as online conference reports , video lectures , and interviews .

15th Annual Feldstein Lecture, Mario Draghi, "The Next Flight of the Bumblebee: The Path to Common Fiscal Policy in the Eurozone cover slide

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 04 August 2020

Moving back to the future of big data-driven research: reflecting on the social in genomics

  • Melanie Goisauf   ORCID: orcid.org/0000-0002-3909-8071 1 , 2   na1 ,
  • Kaya Akyüz   ORCID: orcid.org/0000-0002-2444-2095 1 , 2   na1 &
  • Gillian M. Martin   ORCID: orcid.org/0000-0002-5281-8117 3   na1  

Humanities and Social Sciences Communications volume  7 , Article number:  55 ( 2020 ) Cite this article

3101 Accesses

8 Citations

9 Altmetric

Metrics details

  • Science, technology and society

With the advance of genomics, specific individual conditions have received increased attention in the generation of scientific knowledge. This spans the extremes of the aim of curing genetic diseases and identifying the biological basis of social behaviour. In this development, the ways knowledge is produced have gained significant relevance, as the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory. This article argues that an in-depth discussion and critical reflection on the social configurations that are inscribed in, and reproduced by genomic data-intensive research is urgently needed. This is illustrated by debating a recent case: a large-scale genome-wide association study (GWAS) on sexual orientation that suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). This case is analysed from three angles: (1) the demonstration of how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) the exploration of the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ). Big Data Soc and (3) the demonstration of how the assumption of ‘free from theory’ in this case does not mean free of choices made, which are themselves restricted by data that are available. In questioning how key sociological categories are incorporated in a wider scientific debate on genetic conditions and knowledge production, the article shows how underlying classification and categorizations, which are inherently social in their production, can have wide ranging implications. The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Similar content being viewed by others

data revolution research paper

Using genetics for social science

K. Paige Harden & Philipp D. Koellinger

data revolution research paper

Genetic determinism, essentialism and reductionism: semantic clarity for contested science

K. Paige Harden

data revolution research paper

Participation bias in the UK Biobank distorts genetic associations and downstream analyses

Tabea Schoeler, Doug Speed, … Zoltán Kutalik

Introduction

With the advance of genomic research, specific individual conditions received increased attention in scientific knowledge generation. While understanding the genetic foundations of diseases has become an important driver for the advancement of personalized medicine, the focus of interest has also expanded from disease to social behaviour. These developments are embedded in a wider discourse in science and society about the opportunities and limits of genomic research and intervention. With the emergence of the genome as a key concept for ‘life itself’, understandings of health and disease, responsibility and risk, and the relation between present conditions and future health outcomes have shifted, impacting also the ways in which identities are conceptualized under new genetic conditions (Novas and Rose 2000 ). At the same time, the growing literature of postgenomics points to evolving understandings of what ‘gene’ and ‘environment’ are (Landecker and Panofsky 2013 ; Fox Keller 2014 ; Meloni 2016 ). The postgenomic genome is no longer understood as merely directional and static, but rather as a complex and dynamic system that responds to its environment (Fox Keller 2015 ), where the social as part of the environment becomes a signal for activation or silencing of genes (Landecker 2016 ). At the same time, genetic engineering, prominently known as the gene-editing technology CRISPR/Cas9, has received considerable attention, but also caused concerns regarding its ethical, legal and societal implications (ELSI) and governance (Howard et al. 2018 ; Jasanoff and Hurlbut 2018 ). Taking these developments together, the big question of nature vs. nurture has taken on a new significance.

Studies which aim to reveal how biology and culture are being put in relation to each other appear frequently and pursue a genomic re-thinking of social outcomes and phenomena, such as educational attainment (Lee et al. 2018 ) or social stratification (Abdellaoui et al. 2019 ). Yet, we also witness very controversial applications of biotechnology, such as the first known case of human germline editing by He Jiankui in China, which has impacted the scientific community both as an impetus of wide protests and insecurity about the future of gene-editing and its use, but also instigated calls towards public consensus to (re-)set boundaries to what is editable (Morrison and de Saille 2019 ).

Against this background, we are going to debate in this article a particular case that appeared within the same timeframe as these developments: a large-scale genome-wide association study (GWAS) on sexual orientation Footnote 1 , which suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). Some scientists have been claiming sexual orientation to be partly heritable and trying to identify genetic basis for sexual orientation for years (Hamer et al. 1993 ); however, this was the first time that genetic variants were identified as statistically significant and replicated in an independent sample. We consider this GWAS not only by questioning the ways genes are associated with “the social” within this research, but also by exploring how the complexity of the social is reduced through specific data practices in research.

The sexual orientation study also constitutes an interesting case to reflect on how knowledge is produced at a time the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory (Meloni 2014 ). Large amounts of genomic data are needed to identify genetic variations and for finding correlations with different biological and social factors. The rise of the genome corresponds to the rise of big data as the collection and sharing of genomic data gains power with the development of big data analytics (Parry and Greenhough 2017 ). Growing number of correlations, e.g. in genomics of educational attainment (Lee et al. 2018 ; Okbay et al. 2016 ), are being found that are linking the genome to the social, increasingly blurring the established biological/social divide. These could open up new ways of understanding life, and underpin the importance of culture, while, paradoxically, may also carry the risk of new genetic determinism and essentialism. The changing understanding of the now molecularised and datafied body also illustrates the changing significance of empirical research and sociology (Savage and Burrows 2007 ) in the era of postgenomics and ‘datafication’ (Ruckenstein and Schüll 2017 ). These developments are situated within methodological debates in which social sciences often appear through the perspective of ELSI.

As the field of genomics is progressing rapidly and the intervention in the human genome is no longer science fiction, we argue that it is important to discuss and reflect now on the social configurations that are inscribed in, and reproduced by genomic data-driven research. These may co-produce the conception of certain potentially editable conditions, i.e. create new, and reproduce existing classifications that are largely shaped by societal understandings of difference and order. Such definitions could have real consequences—as Thomas and Thomas ( 1929 ) remind us—for individuals and societies, and mark what has been described as an epistemic shift in biomedicine from the clinical gaze to the ‘molecular gaze’ where the processes of “medicalisation and biomedicalisation both legitimate and compel interventions that may produce transformations in individual, familial and other collective identities” (Clarke et al. 2013 , p. 23). While Science and Technology Studies (STS) has demonstrated how science and society are co-produced in research (Jasanoff 2004 ), we want to use the momentum of the current discourse to critically reflect on these developments from three angles: (1) we demonstrate how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) we explore the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ) and (3) using the GWAS case in focus, we show how the assumption of ‘free from theory’ (Kichin 2014a ) in this case does not mean free of choices made, choices which are themselves restricted by data that are available. We highlight Griffiths’ ( 2016 ) contention that the material nature of genes, their impacts on biological makeup of individuals and their socially and culturally situated behaviour are not deterministic, and need to be understood within the dynamic, culturally and temporally situated context within which knowledge claims are made. We conclude by making the important point that ignoring the social may lead to a distorted, datafied, genomised body which ignores the key fact that “genes are not stable but essentially malleable” (Prainsack 2015 ) and that this ‘malleability’ is rooted in the complex interplay between biological and social environments.

From this perspective, the body is understood through the lens of embodiment, considering humans ‘live’ their genome within their own lifeworld contexts (Rehmann-Sutter and Mahr 2016 ). We also consider this paper as an intervention into the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

In the following reflections, we proceed step by step: First, we introduce the case of the GWAS on same-sex sexual behaviour, as well as its limits, context and impact. Second, we recall key sociological theory on categorizations and their implications. Third, we discuss the emergence of a digital-datafication of scientific knowledge production. Finally, we conclude by cautioning against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Studying sexual orientation: The case of same-sex sexual behaviour

Currently, a number of studies at the intersection of genetic and social conditions appear on the horizon. Just as in the examples we have already mentioned, such as those on educational attainment (Lee et al. 2018 ), or social stratification (Abdellaoui et al. 2019 ), it is important to note that the limit to such studies is only the availability of the data itself. In other words, once the data is available, there is always the potential that it would eventually be used. This said, an analysis of the entirety of the genomic research on social outcomes and behaviour is beyond the scope of this article. Therefore, we want to exemplify our argument with reference to the research on the genetics of same-sex sexual behaviour.

Based on a sample of half a million individuals of European ancestry, the first large-scale GWAS of its kind claims five genetic variants to be contributing to the assessed “same-sex sexual behaviour” (Ganna et al. 2019b ). Among these variants, two are useful only for male–male sexual behaviour, one for female–female sexual behaviour, and the remaining two for both. The data that has led to this analysis was sourced from biobanks/cohorts with different methods of data collection. The authors conclude that these genetic variations are not predictive of sexual orientation; not only because genetics is supposedly only part of the picture, but also because the variations are only a small part (<1% of the variance in same-sex sexual behaviour, p. 4) of the approximated genetic basis (8–25% of the variance in same-sex sexual behaviour) that may be identified with large sample sizes (p. 1). The study is an example of how the ‘gay gene’ discourse that has been around for years, gets transformed with the available data accumulating in the biobanks and the consequent genomic analysis, offering only one facet of a complex social phenomenon: same-sex sexual behaviour.

The way the GWAS has been conducted was not novel in terms of data collection. Genome-wide studies of similar scale, e.g. on insomnia (Jansen et al. 2019 ) or blood pressure (Evangelou et al. 2018 ), often rely on already collected data in biobanks rather than trying to collect hundreds of thousands of individuals’ DNA from scratch. Furthermore, in line with wider developments, the study was preregistered Footnote 2 with an analysis plan for the data to be used by the researchers. Unlike other GWASes, however, the researchers partnered with an LGBTQIA+ advocacy group (GLAAD) and a science communication charity (Sense About Science), where individuals beyond the research team interpreted the findings and discussed how to convey the results Footnote 3 . Following these engagements, the researchers have produced a website Footnote 4 with potential frequently asked questions as well as a video about the study, highlighting what it does and what it does not claim.

Despite efforts to control the drifting away of the study into genetic deterministic and discriminatory interpretations, the study has been criticized by many Footnote 5 . Indeed, the controversial “How gay are you?” Footnote 6 app on the GenePlaza website utilized the findings of the study, which in turn raised the alarm bells and, ultimately, was taken down after much debate. The application, however, showed how rapidly such findings can translate into individualized systems of categorization, and consequently feed into and be fed by the public imaginary. One of the study authors demands continuation of research by noting “[s]cientists have a responsibility to describe the human condition in a more nuanced and deeper way” (Maxmen, 2019 , p. 610). Critics, however, note that the context of data collected from the individuals may have influence on the findings; for instance, past developments (i.e. decriminalization of homosexuality, the HIV/AIDS epidemic, and legalization of same-sex marriage) are relevant to understand the UK Biobank’s donor profile and if the GWAS were to be redone according to the birth year of the individuals, different findings could have come out of the study (Richardson et al. 2019 , p. 1461).

It has been pointed out that such research should be assessed by a competent ethical review board according to its potential risks and benefits (Maxmen 2019 , p. 610), in addition to the review and approval by the UK Biobank Access Sub-Committee (Ganna et al. 2019a , p. 1461). Another ethical issue of concern raised by critics is that the informed consent form of UK Biobank does not specify that it could be used for such research since “homosexuality has long been removed from disease classifications” and that the broad consent forms allow only “health-related research” (Holm and Ploug 2019 , p. 1460). We do not want to make a statement here for or against broad consent. However, we argue that discussions about informed consent showcase the complexities related to secondary use of data in research. Similarly, the ‘gay gene’ app developed in the wake of the sexual orientation study, revealed the difficulty of controlling how the produced knowledge may be used, including in ways that are openly denounced by the study authors.

To the best of our knowledge, there have not been similar genome-wide studies published on sexual orientation and, while we acknowledge the limitations associated with focusing on a single case in our discussion, we see this case as relevant to opening up the following question: How are certain social categorizations incorporated into the knowledge production practices? We want to answer this by first revisiting some of the fundamental sociological perspectives into categorizations and the social implications these may have.

Categorizing sex, gender, bodies, disease and knowledge

Sociological perspectives on categorizations.

Categorizations and classifications take a central role in the sociology of knowledge, social stratifications and data-based knowledge production. Categories like gender, race, sexuality and class (and their intersection, see Crenshaw 1989 ) have become key classifications for the study of societies and in understanding the reproduction of social order. One of the most influential theories about the intertwining of categories like gender and class with power relations was formulated by Bourdieu ( 2010 , 2001 ). He claimed that belonging to a certain class or gender is an embodied practice that ensures the reproduction of social structure which is shaped by power relations. The position of subjects within this structure reflects the acquired cultural capital, such as education. Incorporated dispositions, schemes of perception, appreciation, classification that make up the individual’s habitus are shaped by social structure, which actors reproduce in practices. One key mechanism of social categorization is gender classification. The gender order appears to be in the ‘nature of things’ of biologically different bodies, whereas it is in fact an incorporated social construction that reflects and constitutes power relations. Bourdieu’s theory links the function of structuring classifications with embodied knowledge and demonstrates that categories of understanding are pervaded by societal power relations.

In a similar vein Foucault ( 2003 , 2005 ) describes the intertwining of ordering classifications, bodies and power in his study of the clinic. Understandings of and knowledge about the body follow a specific way of looking at it—the ‘medical gaze’ of separating the patient’s body from identity and distinguishing healthy from the diseased, which, too, is a process pervaded by power differentials. Such classifications evolved historically. Foucault reminds us that all periods in history are characterized by specific epistemological assumptions that shape discourses and manifest in modalities of order that made certain kinds of knowledge, for instance scientific knowledge, possible. The unnoticed “order of things”, as well as the social order, is implemented in classifications. Such categorizations also evolved historically for the discourse about sexuality, or, in particular as he pointed out writing in the late 1970s, distinguishing sexuality of married couples from other forms, such as homosexuality (Foucault 1998 ).

Bourdieu and Foucault offer two influential approaches within the wider field of sociology of knowledge that provide a theoretical framework on how categorizations and classifications structure the world in conjunction with social practice and power relations. Their work demonstrates that such structuration is never free from theory, i.e. they are not existing prediscursively, but are embedded within a certain temporal and spatial context that constitutes ‘situated knowledge’ (Haraway 1988 ). Consequently, classifications create (social) order that cannot be understood as ‘naturally’ given but as a result of relational social dynamics embedded in power differentials.

Feminist theory in the 1970s emphasized the inherently social dimension of male and female embodiment, which distinguished between biological sex and socially rooted gender. This distinction built the basis for a variety of approaches that examined gender as a social phenomenon, as something that is (re-)constructed in social interaction, impacted by collectively held beliefs and normative expectations. Consequently, the difference between men and women was no longer simply understood as a given biological fact, but as something that is, also, a result of socialization and relational exchanges within social contexts (see, e.g., Connell 2005 ; Lorber 1994 ). Belonging to a gender or sex is a complex practice of attribution, assignment, identification and, consequently, classification (Kessler and McKenna 1978 ). The influential concept of ‘doing gender’ emphasized that not only the gender, but also the assignment of sex is based on socially agreed-upon biological classification criteria, that form the basis of placing a person in a sex category , which needs to be practically sustained in everyday life. The analytical distinction between sex and gender became eventually implausible as it obscures the process in which the body itself is subject to social forces (West and Zimmerman 1991 ).

In a similar way, sexual behaviour and sexuality are also shaped by society, as societal expectations influence sexual attraction—in many societies within normative boundaries of gender binary and heteronormativity (Butler 1990 ). This also had consequences for a deviation from this norm, resulting for example in the medicalisation of homosexuality (Foucault 1998 ).

Reference to our illustrative case study on the recently published research into the genetic basis of sexuality brings the relevance of this theorization into focus. The study cautions against the ‘gay gene’ discourse, the use of the findings for prediction, and genetic determinism of sexual orientation, noting “the richness and diversity of human sexuality” and stressing that the results do not “make any conclusive statements about the degree to which ‘nature’ and ‘nurture’ influence sexual preference” (Ganna et al. 2019b , p. 6).

Coming back to categorizations, more recent approaches from STS are also based on the assumption that classifications are a “spatio-temporal segmentation of the world” (Bowker and Star 2000 , p. 10), and that classification systems are, similar to concepts of gender theory (e.g. Garfinkel 1967 ), consistent, mutually exclusive and complete. The “International Classification of Diseases (lCD)”, a classification scheme of diseases based on their statistical significance, is an example of such a historically grown knowledge system. How the ICD is utilized in practice points to the ethical and social dimensions involved (Bowker and Star 2000 ). Such approaches help to unravel current epistemological shifts in medical research and intervention, including removal of homosexuality from the disease classification half a century ago.

Re-classifying diseases in tandem with genetic conditions creates new forms of ‘genetic responsibilities (Novas and Rose 2000 ). For instance, this may result in a change of the ‘sick role’ (described early in Parsons 1951 ) in creating new obligations not only for diseased but also for actually healthy persons in relation to potential futures. Such genetic knowledge is increasingly produced using large-scale genomic databases and creates new categories based on genetic risk, and consequently, may result in new categories of individuals that are ‘genetically at risk’ (Novas and Rose 2000 ). The question now is how these new categories will alter, structure or replace evolved categories, in terms of constructing the social world and medical practice.

While advancement in genomics is changing understandings of bodies and diseases, the meanings of certain social categories for medical research remain rather stable. Developments of personalized medicine go along with “the ‘re-inscription’ of traditional epidemiological categories into people’s DNA” and adherence to “old population categories while working out new taxonomies of individual difference” (Prainsack 2015 , pp. 28–29). This, again, highlights the fact that knowledge production draws on and is shaped by categories that have a political and cultural meaning within a social world that is pervaded by power relations.

From categorization to social implication and intervention

While categorizations are inherently social in their production, their use in knowledge production has wide ranging implications. Such is the case of how geneticisation of sexual orientation has been an issue that troubled and comforted the LGBTQIA+ communities. Despite the inexistence of an identified gene, ‘gay gene’ has been part of societal discourse. Such circulation disseminates an unequal emphasis on the biologized interpretations of sexual orientation, which may be portrayed differently in media and appeal to groups of opposing views in contrasting ways (Conrad and Markens 2001 ). Geneticisation, especially through media, moves sexual orientation to an oppositional framework between individual choice and biological consequence (Fausto-Sterling 2007 ) and there have been mixed opinions within LGBTQIA+ communities, whether this would resolve the moralization of sexual orientation or be a move back into its medicalisation (Nelkin and Lindee 2004 ). Thus, while some activists support geneticisation, others resist it and work against the potential medicalisation of homosexuality (Shostak et al. 2008 ). The ease of communicating to the general public simple genetic basis for complex social outcomes which are genetically more complex than reported, contributes to the geneticisation process, while the scientific failures of replicating ‘genetic basis’ claims do not get reported (Conrad 1999 ). In other words, while finding a genetic basis becomes entrenched as an idea in the public imaginary, research showing the opposite does not get an equal share in the media and societal discourse, neither of course does the social sciences’ critique of knowledge production that has been discussed for decades.

A widely, and often quantitatively, studied aspect of geneticisation of sexual orientation is how this plays out in the broader understanding of sexual orientation in society. While there are claims that geneticisation of sexual orientation can result in depoliticization of the identities (O’Riordan 2012 ), it may at the same time lead to polarization of society. According to social psychologists, genetic attributions to conditions are likely to lead to perceptions of immutability, specificity in aetiology, homogeneity and discreteness as well naturalistic fallacy (Dar-Nimrod and Heine 2011 ). Despite the multitude of suggestive surveys that belief in genetic basis of homosexuality correlates with acceptance, some studies suggest learning about genetic attribution to homosexuality can be polarizing and confirmatory of the previously held negative or positive attitudes (Boysen and Vogel 2007 ; Mitchell and Dezarn 2014 ). Such conclusions can be taken as a precaution that just as scientific knowledge production is social, its consequences are, too.

Looking beyond the case

We want to exemplify this argument by taking a detour to another case where the intersection between scientific practice, knowledge production and the social environment is of particular interest. While we have discussed the social implications of geneticisation with a focus on sexual orientation, recent developments in biomedical sciences and biotechnology also have the potential to reframe the old debates in entirely different ways. For instance, while ‘designer babies’ were only an imaginary concept until recently, the facility and affordability of processes, such as in vitro selection of baby’s genotype and germline genome editing, have potentially important impacts in this regard. When CRISPR/Cas9 technique was developed for rapid and easy gene editing, both the hopes and worries associated with its use were high. Martin and others ( 2020 , pp. 237–238) claim gene editing is causing both disruption within the postgenomic regime, specifically to its norms and practices, and the convergence of various biotechnologies such as sequencing and editing. Against this background, He Jiankui’s announcement in November 2018 through YouTube Footnote 7 that twins were born with edited genomes was an unwelcome surprise for many. This unexpected move may have hijacked the discussions on ethical, legal, societal implications of human germline genome-editing, but also rang the alarm bells across the globe for similar “rogue” scientists planning experimentation with the human germline (Morrison and de Saille 2019 ). The facility to conduct germline editing is, logically, only one step away from ‘correcting’ and if there is a correction, then that would mean a return to a normative state. He’s construction of HIV infection as a genetic risk can be read as a placeholder for numerous questions to human germline editing: What are the variations that are “valuable” enough for a change in germline? For instance, there are plans by Denis Rebrikov in Russia to genome edit embryos to ‘fix’ a mutation that causes congenital deafness (Cyranoski 2019 ). If legalized, what would be the limits applied and who would be able to afford such techniques? At a time when genomics research into human sociality is booming, would the currently produced knowledge in this field and others translate into ‘corrective’ genome-editing? Who would decide?

The science, in itself is still unclear at this stage as, for many complex conditions, using gene editing to change one allele to another is often minuscule in effect, considering that numerous alleles altogether may affect phenotypes, while at the same time a single allele may affect multiple phenotypes. In another GWAS case, social genomicists claim there are thousands of variations that are found to be influential for a particular social outcome such as educational attainment (Lee et al. 2018 ), with each having minimal effect. It has also been shown in the last few years, as the same study is conducted with ever more larger samples, more genomic variants are associated with the social outcome, i.e. 74 single nucleotide polymorphisms (SNPs) associated with the outcome in a sample size of 293,723 (Okbay et al. 2016 ) and 1271 SNPs associated with the outcome in a sample size of 1.1 million individuals (Lee et al. 2018 ).

Applying this reasoning to the GWAS on same-sex sexual behaviour, it is highly probable that the findings will be superseded in the following years with similar studies of bigger data, increasing the number of associations.

A genomic re-thinking?

The examples outlined here have served to show how focusing the discussion on “genetic determinism” is fruitless considering the complexity of the knowledge production practices and how the produced knowledge could both mirror social dynamics and shape these further. Genomic rethinking of the social necessitates a new formulation of social equality, where genomes are also relevant. Within the work of social genomics researchers, there has been cautious optimism toward the contribution of findings from genomics research to understanding social outcomes of policy change (Conley and Fletcher 2018 ; Lehrer and Ding 2019 ). Two fundamental thoughts govern this thinking. First, genetic basis is not to be equalized with fate; in other words, ‘genetic predispositions’ make sense only within the broader social and physical environmental frame, which often allows room for intervention. Second, genetics often relates to heterogeneity of the individuals within a population, in ways that the same policy may be positive, neutral or negative for different individuals due to their genes. In this respect, knowledge gained via social genomics may be imagined as a basis for a more equal society in ‘uncovering’ invisible variables, while, paradoxically, it may also be a justification for exclusion of certain groups. For example, a case that has initially raised the possibility that policies affect individuals differently because of their genetic background was a genetic variant that was correlated to being unaffected by tax increases on tobacco (Fletcher 2012 ). The study suggested that raising the taxes may be an ineffective tool for lowering smoking rates below a certain level, since those who are continuing to smoke may be those who cannot easily stop due to their genetic predisposition to smoking. Similar ideas could also apply to a diverse array of knowledge produced in social genomics, where the policies may be under scrutiny according to how they are claimed to variably influence the members of a society due to their genetics.

Datafication of scientific knowledge production

From theory to data-driven science.

More than a decade has gone by since Savage and Burrows ( 2007 ) described a crisis in empirical research, where the well-developed methodologies for collecting data about the social world would become marginal as such data are being increasingly generated and collected as a by-product of daily virtual transactions. Today, sociological research faces a widely datafied world, where (big) data analytics are profoundly changing the paradigm of knowledge production, as Facebook, Twitter, Google and others produce large amounts of socially relevant data. A similar phenomenon is taking place through opportunities that public and private biobanks, such as UK Biobank or 23andMe, offer. Crossing the boundaries of social sciences and biological sciences is facilitated through mapping correlations between genomic data, and data on social behaviour or outcomes.

This shift from theory to data-driven science misleadingly implies a purely inductive knowledge production, neglecting the fact that data is not produced free of preceding theoretical framing, methodological decisions, technological conditions and the interpretation of correlations—i.e. an assemblage situated within a specific place, time, political regime and cultural context (Kitchin 2014a ). It glosses over the fact that data cannot simply be treated as raw materials, but rather as “inherently partial, selective and representative”, the collection of which has consequences (Kitchin 2014b , p. 3). How knowledge of the body is generated starts with how data is produced and how it is used and mobilized. Through sequencing, biological samples are translated into digital data that are circulated and merged and correlated with other data. With the translation from genes into data, their meaning also changes (Saukko 2017 ). The kind of knowledge that is produced is also not free of scientific and societal concepts.

Individually assigned categorical variables to genomes have become important for genomic research and are impacting the ways in which identities are conceptualized under (social) genomic conditions. These characteristics include those of social identity, such as gender, ethnicity, educational and socioeconomic status. They are often used for the study of human genetic variation and individual differences with the aim to advance personalized medicine and based on demographic and ascribed social characteristics.

The sexual orientation study that is central to this paper can be read as a case where such categories intersect with the mode of knowledge production. As the largest contributor of data to the study, UK Biobank’s data used in this research are revealing since they are based on the answer to the following question “Have you ever had sexual intercourse with someone of the same sex?” along with the statement “Sexual intercourse includes vaginal, oral or anal intercourse.” Footnote 8 .

Furthermore, the authors accept having made numerous reductive assumptions and that their study has methodological limitations. For instance, Ganna et al. ( 2019b ) acknowledge both within the article (p. 1) and an accompanying website Footnote 9 that the research is based on a binary ‘sex’ system with exclusions of non-complying groups as the authors report that they “dropped individuals from [the] study whose biological sex and self-identified sex/gender did not match” (p. 2). However, both categorizing sexual orientation mainly on practice rather than attraction or desire, and building it on normative assumptions about sexuality, i.e. gender binary and heteronormativity, are problematic, as sexual behaviour is diverse and does not necessarily correspond with such assumptions.

The variations found in the sexual orientation study, as is true for other genome-wide association studies, are often relevant for the populations studied and in this case, those mainly belong to certain age groups and European ancestry. While the study avoids critique in saying that their research is not genetics of sexual orientation, but rather of same-sex sexual behaviour, whether such a genomic study would be possible is also questionable. This example demonstrates that, despite the increasing influence of big data, a fundamental problem with the datafication of many social phenomena is whether or not they are amenable to measurement. In the case of sexual orientation, whether the answer to the sexual orientation questions corresponds to the “homosexuality” or “willingness to reveal homosexuality”/“stated sexual orientation” is debatable, considering the social pressure and stigma that may be an element in certain social contexts (Conley 2009 , p. 242).

While our aim is to bring a social scientific perspective, biologists have raised at least two different critical opinions on the knowledge production practice here in the case of the sexual orientation study, first on the implications of the produced knowledge Footnote 10 and second on the problems and flaws of the search for a genetic basis Footnote 11 . In STS, however, genetic differences that were hypothesized to be relevant for health, especially under the category of race in the US, have been a major point of discussion within the genomic ‘inclusion’ debates of 1990s (Reardon 2017 , p. 49; Bliss 2015 ). In other words, a point of criticism towards the knowledge production was the focus on certain “racial” or racialized groups, such as American of European ancestry, which supposedly biased the findings and downstream development of therapies for ‘other’ groups. However, measuring health and medical conditions against the background of groups that are constituted based on social or cultural categories (e.g. age, gender, ethnicity), may also result in a reinscription/reconstitution of social inequalities attached to these categories (Prainsack 2015 ) and at the same time result in health justice being a topic seen through a postgenomics lens, where postgenomics is “a frontline weapon against inequality” (Bliss 2015 p. 175). Social-economic factors may recede in the background, while data with its own often invisible politics are foregrounded.

Unlike what Savage and Burrows suggested in 2007, the coming crisis can not only be seen as a crisis of sociology, but of science in general. Just as the shift of focus in social sciences towards digital data is only one part of the picture, another part could be the developments in genomisation of the social. Considering that censuses and large-scale statistics are not new, the distinction of the current phenomenon is possibly the opportunity to individualize the data, while categories themselves are often unable to capture the complexity, despite producing knowledge more efficiently. In that sense, the above-mentioned survey questions do not do justice to the complexity of social behaviour. What is most important to flag within these transformations is the lack of reflexivity regarding how big data comes to represent the world and whether it adds and/or takes away from the ways of knowing before big data. These developments and directions of genetic-based research and big data go far beyond the struggle of a discipline, namely sociology, with a paradigm shift in empirical research. They could set the stage for real consequences for individuals and groups. Just as what is defined as an editable condition happens as a social process that relies on socio-political categories, the knowledge acquired from big data relies in similar way on the same kind of categories.

The data choices and restrictions: ‘Free from theory’ or freedom of choice

Data, broadly understood, have become a fundamental part of our lives, from accepting and granting different kinds of consent for our data to travel on the internet, to gaining the ‘right to be forgotten’ in certain countries, as well as being able to retrieve collected information about ourselves from states, websites, even supermarket chains. While becoming part of our lives, the data collected about individuals in the form of big data is transferred between academic and non-academic research, scientific and commercial enterprises. The associated changes in the knowledge production have important consequences for the ways in which we understand and live in the world (Jasanoff 2004 ). The co-productionist perspective in this sense does not relate to whether or how the social and the biological are co-produced, but rather it is pointing to how produced knowledge in science is both shaped by and shaping societies. Thus, the increasing impact and authority of big data in general, and within the sexual orientation study in focus here, opens up new avenues to claim as some suggest, that we have reached the end of theory.

The “end of theory” has actively been debated within and beyond science. Kitchin ( 2014a ) locates the recent origin of this debate in a piece in the Wired , where the author states “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson 2008 ). Others call this a paradigm shift towards data-intensive research leaving behind the empirical and theoretical stages (Gray 2009 , p. xviii). While Google and others form the basis for this data-driven understanding in their predictive capacity or letting the data speak, the idea that knowledge production is ‘free from theory’ in this case seems to be, at best, an ignorance of any data infrastructure and how the categories are formed within it.

Taking a deeper look at the same-sex sexual behaviour study from this angle suggests that such research cannot be free from theory as it has to make an assumption regarding the role of genetics in the context of social dynamics. In other words, it has to move sexual orientation, at least partially in the form of same-sex sexual behaviour, out of the domain of the social towards the biological. In doing so, just as the study concludes the complexity of sexual orientation, the authors note in their informative video Footnote 12 on their website, that “they found that about a third of the differences between people in their sexual behaviour could be explained by inherited genetic factors. But the environment also plays a large role in shaping these differences.” While the study points to a minuscule component of the biological, it also frames biology as the basis on which the social, as part of the environment, acts upon.

Reconsidering how the biology and the social are represented in the study, three theoretical choices are made due to the limitation of the data. First of all, the biological is taken to be “the genome-wide data” in the biobanks that the study relies on. This means sexual orientation is assumed to be within the SNPs, or points on the genome that are common variations across a population, and not in other kinds of variations that are rare or not captured by the genotyped SNPs. These differences include, but are not limited to, large-scale to small-scale duplications and deletions of the genomic regions, rare variants or even common variants in the population that the SNP chips do not capture. Such ignored differences are very important for a number of conditions, from cancer to neurobiology. Similarly, the genomic focus leaves aside the epigenetic factors that could theoretically be the missing link between genomes and environments. In noting this, we do not suggest that the authors of the study are unaware or uninterested in epigenetics; however, regardless of their interest and/or knowledge, the availability of large-scale genome-wide data puts such data ahead of any other variation in the genome and epigenome. In other words, if the UK Biobank and 23andMe had similar amounts of epigenomic or whole genome data beyond the SNPs, the study would have most possibly relied on these other variations in the genome. The search for genetic basis within SNPs is a theoretical choice, and in this case this choice is pre-determined by the limitations of the data infrastructures.

The second choice that the authors make is to take three survey questions, i.e. in the case of UK Biobank data, as encompassing enough of the complexity of sexual orientation for their research. As partly discussed earlier, these questions are simply asking about sexual behaviour. Based on the UK Biobank’s definition of sexual intercourse as “vaginal, oral or anal intercourse” the answers to the following questions were relevant for the research: “Have you ever had sexual intercourse with someone of the same sex?” (Data-Field 2159), “How many sexual partners of the same sex have you had in your lifetime?” (Data-Field 3669), and, “About how many sexual partners have you had in your lifetime?” (Data-Field 2149). Answers to such questions do little justice to the complexity of the topic. Considering that they are not included in the biobank as data for the purpose of identifying a genetic basis to same-sex sexual behaviour, there is much to consider in what capacity they are useful for that. It is worth noting here that the UK Biobank is primarily focused on health-related research, and thus these three survey questions could not have been asked with a genomic exploration of ‘same-sex sexual behaviour’ or ‘sexual orientation’ in mind. The degree of success in the way they have been used to identify the genetic basis for complex social behaviours is questionable.

The authors of the study consider the UK Biobank sample to be comprised of relatively old individuals and this to be a shortcoming Footnote 13 . Similarly, the study authors claim that 23andMe samples may be biased because “[i]ndividuals who engage in same-sex sexual behaviour may be more likely to self-select the sexual orientation survey”, which then explains the high percentage of such individuals (18.9%) (Ganna et al. 2019b , p. 1). However, the authors do not problematize that there is at least three-fold difference between the youngest and oldest generation in the UK Biobank sample in their response to the same-sex sexual behaviour question (Ganna et al. 2019b , p. 2). The study, thus, highlights the problematic issue about who should be regarded as the representative sample to be asked about their “same-sex sexual behaviour”. Still, this is a data choice that the authors make in concluding a universal explanation out of a very specific and socially constrained collection of self-reported data that encompasses only part of what the researchers are interested in.

The third choice is a choice unmade. The study data mainly came from UK Biobank, following a proposal by Brendan Zietsch with the title “Direct test whether genetic factors predisposing to homosexuality increase mating success in heterosexuals” Footnote 14 . The original plan for research frames “homosexuality” as a condition that heterosexuals can be “predisposed” to and as this condition is not eliminated through evolution, scientists hypothesize that whatever genetic variation that predisposes an individual to homosexuality may also be functional in increasing the individual’s reproductive capacity. Despite using such an evolutionary explanation as the theoretical basis for obtaining the data from the UK Biobank, the authors use evolution/evolutionary only three times in the article, whereas the concept “mating success” is totally missing. Unlike the expectation in the research proposal, authors observe lower number of offspring for individuals reporting same-sex sexual behaviour, and they conclude briefly “This reproductive deficit raises questions about the evolutionary maintenance of the trait, but we do not address these here” (Ganna et al. 2019b , p. 2). In other words, the hypothesis that allowed scientists to acquire the UK Biobank data becomes irrelevant for the researchers, when they are reporting their findings.

In this section, we have performed an analysis of how data choices are made at different steps of the research and hinted at how these choices reflect certain understandings of how society functions. These are evident in the ways sexual behaviour is represented and categorized according to quantitative data, and, the considerations of whether certain samples are contemporary enough (UK Biobank) or too self-selecting (same-sex sexual behaviour being too high in 23andMe). The study, however, does not problematize how the percentage of individuals reporting same-sex sexual behaviour steadily increases according to year of birth, at least tripling for males and increasing more than five-fold for females from 1940 and 1970 (for UK Biobank). Such details are among the data that the authors display as descriptive statistics in Fig. 1 (Ganna et al. 2019b , p. 2); however, these do not attract a discussion that genomic data receives. The study itself starts from the idea that genetic markers that are associated with same-sex sexual behaviour could have an evolutionary advantage and ends in saying the behaviour is complex. Critics claim the “approach [of the study] implies that it is acceptable to issue claims of genetic drivers of behaviours and then lay the burden of proof on social scientists to perform post-hoc socio-cultural analysis” (Richardson et al. 2019 , p. 1461).

In this paper, we have ‘moved back to the future’—taking stock of the present-day accelerated impact of big data and of its potential and real consequences. Using the sexual orientation GWAS as point of reference, we have shown that claims to working under the premise of ‘pure science’ of genomics are untenable as the social is present by default—within the methodological choices made by the researchers, the impact on/of the social imaginary or epigenetic context.

By focusing on the contingency of the knowledge production on the social categories that are themselves reflections of the social in the data practices, we have highlighted the relational processes at the root of knowledge production. We are experiencing a period where the repertoire of what gets quantified continuously, and possibly exponentially, increases; however, this does not necessarily mean that our understanding of complexity increases at the same rate, rather, it may lead to unintended simplification where meaningful levels of understanding of causality are lost in the “triumph of correlations” in big data (Mayer-Schönberger and Cukier 2013 ; cited in Leonelli 2014 ). While sociology has much to offer through its qualitative roots, we think it should do more than critique, especially considering the culturally and temporally specific understandings of the social are also linked to the socio-material consequences.

We want to highlight that now is the time to think about the broader developments in science and society, not merely from an external perspective, but within a new framework. Clearly, our discussion of a single case here cannot sustain suggestions for a comprehensive and applicable framework for any study; however, we can flag the urgency of its requirement. We have shown that, in the context of the rapid developments within big data-driven, and socio-genomic research, it is necessary to renew the argument for bringing the social, and its interrelatedness to the biological, clearly back into focus. We strongly believe that reemphasizing this argument is essential to underline the analytical strength of the social science perspective, and in order to avoid the possibility of losing sight of the complexity of social phenomena, which risk being oversimplified in mainly statistical data-driven science.

We can also identify three interrelated dimensions of scientific practice that the framework would valorize: (1) Recognition of the contingency of choices made within the research process, and sensibility of their consequent impact within the social context. (2) Ethical responsibilities that move beyond procedural contractual requirements, to sustaining a process rooted in clear understanding of societal environments. (3) Interdisciplinarity in analytical practice that potentiates the impact of each perspectival lens.

Such a framework would facilitate moving out of the disciplinary or institutionalized silos of ELSI, STS, sociology, genetics, or even emerging social genomics. Rather than competing for authority on ‘the social’, the aim should be to critically complement each other and refract the produced knowledge with a multiplicity of lenses. Zooming ‘back to the future’ within the field of socio-biomedical science, we would flag the necessity of re-calibrating to a multi-perspectival endeavour—one that does justice to the complex interplay of social and biological processes within which knowledge is produced.

The GWAS primarily uses the term “same-sex sexual behaviour” as one of the facets of “sexual orientation” where the former becomes the component that is directly associable with the genes and the latter the broader phenomenon of interest. Thus, while the article is referring to “same-sex sexual behaviour” in its title, it is editorially presented in the same Science issue under Human Genetics heading with the subheading “The genetics of sexual orientation” (p. 880) (see Funk 2019 ). Furthermore, the request for data from UK Biobank by the corresponding author Brendan P. Zietsch (see footnote 14) refers only to sexual orientation and homosexuality and not to same-sex sexual behaviour. Therefore, we follow the same interchangeable use in this article.

Source: https://osf.io/xwfe8 (04.03.2020).

Source: https://www.wsj.com/articles/research-finds-genetic-links-to-same-sex-behavior-11567101661 (04.03.2020).

Source: https://geneticsexbehavior.info (04.03.2020).

In addition to footnotes 10 and 11, for a discussion please see: https://www.nytimes.com/2019/08/29/science/gay-gene-sex.html (04.03.2020).

Later “122 Shades of Grey”: https://www.geneplaza.com/app-store/72/preview (04.03.2020).

Source: https://www.youtube.com/watch?v=th0vnOmFltc (04.03.2020).

Source: http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=2159 (04.03.2020).

Source: https://geneticsexbehavior.info/ (04.03.2020).

Source: https://www.broadinstitute.org/blog/opinion-big-data-scientists-must-be-ethicists-too (04.03.2020).

Source: https://medium.com/@cecilejanssens/study-finds-no-gay-gene-was-there-one-to-find-ce5321c87005 (03.03.2020).

Source: https://videos.files.wordpress.com/2AVNyj7B/gosb_subt-4_dvd.mp4 (04.03.2020).

Source: https://geneticsexbehavior.info/what-we-found/ (04.03.2020).

Source: https://www.ukbiobank.ac.uk/2017/04/direct-test-whether-genetic-factors-predisposing-to-homosexuality-increase-mating-success-in-heterosexuals/ (04.03.2020).

Abdellaoui A, Hugh-Jones D, Yengo L, Kemper KE, Nivard MG, Veul L, Holtz Y, Zietsch BP, Frayling TM, Wray NR (2019) Genetic correlates of social stratification in Great Britain. Nat Hum Behav 1–21. https://doi.org/10.1038/s41562-019-0757-5

Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete, Wired https://www.wired.com/2008/06/pb-theory/ . Accessed 31 Mar 2020

Bliss C (2015) Defining health justice in the postgenomic era. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham, Durham/London, pp. 174–191

Chapter   Google Scholar  

Bourdieu P (2001) Masculine domination. Stanford University Press, Stanford

Google Scholar  

Bourdieu P (2010) Distinction: a social critique of the judgement of taste. Routledge, London/New York

Bowker GC, Star SL (2000) Sorting things out: classification and its consequences. MIT Press, Cambridge/London

Book   Google Scholar  

Boysen GA, Vogel DL (2007) Biased assimilation and attitude polarization in response to learning about biological explanations of homosexuality. Sex Roles 57(9–10):755–762. https://doi.org/10.1007/s11199-007-9256-7

Article   Google Scholar  

Butler J (1990) Gender trouble. Feminism and the subversion of identity. Routledge, New York

Clarke AE, Shim JK, Shostak S, Nelson A (2013) Biomedicalising genetic health, diseases and identities. In: Atkinson P, Glasner P, Lock M (eds) Handbook of genetics and society: mapping the new genomc era. Routledge, Oxon, pp. 21–40

Conley D (2009) The promise and challenges of incorporating genetic data into longitudinal social science surveys and research. Biodemogr Soc Biol 55(2):238–251. https://doi.org/10.1080/19485560903415807

Conley D, Fletcher J (2018) The genome factor: what the social genomics revolution reveals about ourselves, our history, and the future. Princeton University Press, Princeton/Oxford

Connell RW (2005) Masculinities. Polity, Cambridge

Conrad P (1999) A mirage of genes. Sociol Health Illn 21(2):228–241. https://doi.org/10.1111/1467-9566.00151

Conrad P, Markens S (2001) Constructing the ‘gay gene’ in the news: optimism and skepticism in the US and British press. Health 5(3):373–400. https://doi.org/10.1177/136345930100500306

Crenshaw K (1989) Demarginalizing the intersection of race and sex: a black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics, vol 1989(8). University of Chicago Legal Forum. http://chicagounbound.uchicago.edu/uclf/vol1989/iss1/8 . Accessed 1 Apr 2020

Cyranoski D (2019) Russian ‘CRISPR-baby’ scientist has started editing genes in human eggs with goal of altering deaf gene. Nature 574(7779):465–466. https://doi.org/10.1038/d41586-019-03018-0

Article   ADS   CAS   PubMed   Google Scholar  

Dar-Nimrod I, Heine SJ (2011) Genetic essentialism: on the deceptive determinism of DNA. Psychol Bull 137(5):800–818. https://doi.org/10.1037/a0021860

Article   PubMed   PubMed Central   Google Scholar  

Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, Ntritsos G, Dimou N, Cabrera CP, Karaman I (2018) Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet 50(10):1412–1425. https://doi.org/10.1038/s41588-018-0205-x

Article   CAS   PubMed   PubMed Central   Google Scholar  

Fausto-Sterling A (2007) Frameworks of desire. Daedalus 136(2):47–57. https://doi.org/10.1162/daed.2007.136.2.47

Fletcher JM (2012) Why have tobacco control policies stalled? Using genetic moderation to examine policy impacts. PLoS ONE 7(12):e50576. https://doi.org/10.1371/journal.pone.0050576

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Foucault M (1998) The history of sexuality 1: the will to knowledge. Penguin Books, London

Foucault M (2003) The birth of the clinic. Routledge, London/New York

Foucault M (2005) The order of things. Routledge, London/New York

Fox Keller E (2014) From gene action to reactive genomes. J Physiol 592(11):2423–2429. https://doi.org/10.1113/jphysiol.2014.270991

Article   CAS   Google Scholar  

Fox Keller E (2015) The postgenomic genome. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham/London, pp. 9–31

Funk M (2019) The genetics of sexual orientation. Science 365(6456):878–880. https://doi.org/10.1126/science.365.6456.878-k

Article   ADS   Google Scholar  

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019a) Genome studies must account for history—response. Science 366(6472):1461–1462. https://doi.org/10.1126/science.aaz8941

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019b) Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365(6456):eaat7693. https://doi.org/10.1126/science.aat7693

Garfinkel H (1967) Studies in ethnomethodology. Polity Press, Cambridge

Gray J (2009) Jim Gray on eScience: a transformed scientific method. In: Hey T, Tansley S, Tolle KM (eds) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, pp. xvii–xxxi

Griffiths DA (2016) Queer genes: realism, sexuality and science. J Crit Realism 15(5):511–529. https://doi.org/10.1080/14767430.2016.1210872

Hamer DH, Hu S, Magnuson VL, Hu N, Pattatucci AM (1993) A linkage between DNA markers on the X chromosome and male sexual orientation. Science 261(5119):321–327. https://doi.org/10.1126/science.8332896

Haraway D (1988) Situated knowledges: the science question in feminism and the privilege of partial perspective. Fem Stud 14(3):575–599

Holm S, Ploug T (2019) Genome studies reveal flaws in broad consent. Science 366(6472):1460–1461. https://doi.org/10.1126/science.aaz3797

Howard HC, van El CG, Forzano F, Radojkovic D, Rial-Sebbag E, de Wert G, Borry P, Cornel MC (2018) One small edit for humans, one giant edit for humankind? Points and questions to consider for a responsible way forward for gene editing in humans. Eur J Hum Genet 26(1):1. https://doi.org/10.1038/s41431-017-0024-z

Article   CAS   PubMed   Google Scholar  

Jansen PR, Watanabe K, Stringer S, Skene N, Bryois J, Hammerschlag AR, de Leeuw CA, Benjamins JS, Muñoz-Manchado AB, Nagel M, Savage JE, Tiemeier H, White T, Agee M, Alipanahi B, Auton A, Bell RK, Bryc K, Elson SL, Fontanillas P, Furlotte NA, Hinds DA, Huber KE, Kleinman A, Litterman NK, McCreight JC, McIntyre MH, Mountain JL, Noblin ES, Northover CAM, Pitts SJ, Sathirapongsasuti JF, Sazonova OV, Shelton JF, Shringarpure S, Tian C, Wilson CH, Tung JY, Hinds DA, Vacic V, Wang X, Sullivan PF, van der Sluis S, Polderman TJC, Smit AB, Hjerling-Leffler J, Van Someren EJW, Posthuma D, The 23andMe Research, T. (2019) Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat Genet 51(3):394–403. https://doi.org/10.1038/s41588-018-0333-3

Jasanoff S (2004) The idiom of co-production. In: Jasanoff S (ed.) States of knowledge: the co-production of science and social order. Routledge, London, p 1–12

Jasanoff S, Hurlbut JB (2018) A global observatory for gene editing. Nature 555:435–437. https://doi.org/10.1038/d41586-018-03270-w

Kessler SJ, McKenna W (1978) Gender: an ethnomethodological approach. John Wiley & Sons, New York

Kitchin, R. (2014a) Big Data, new epistemologies and paradigm shifts. Big Data Soc. https://doi.org/10.1177/2053951714528481

Kitchin R (2014b) The data revolution. Big data, open data, data infrastructures and their consequences. Sage, London

Landecker H (2016) The social as signal in the body of chromatin. Sociol Rev 64(1_suppl):79–99. https://doi.org/10.1111/2059-7932.12014

Landecker H, Panofsky A (2013) From social structure to gene regulation, and back: a critical introduction to environmental epigenetics for sociology. Annu Rev Sociol 39:333–357. https://doi.org/10.1146/annurev-soc-071312-145707

Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Linnér RK (2018) Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment. Nat Genet 50(8):1112. https://doi.org/10.1038/s41588-018-0147-3

Lehrer SF, Ding W (2019) Can social scientists use molecular genetic data to explain individual differences and inform public policy? In: Foster G (ed.) Biophysical measurement in experimental social science research. Academic Press, London/San Diego/Cambridge/Oxford, pp. 225–265

Leonelli, S. (2014) What difference does quantity make? On the epistemology of Big Data in biology. Big Data Soc. https://doi.org/10.1177/2053951714534395

Lorber J (1994) Paradoxes of gender. Yale University Press, New Haven

Martin P, Morrison M, Turkmendag I, Nerlich B, McMahon A, de Saille S, Bartlett A (2020) Genome editing: the dynamics of continuity, convergence, and change in the engineering of life. New Genet Soc 39(2):219–242. https://doi.org/10.1080/14636778.2020.1730166

Maxmen A (2019) Controversial ‘gay gene’ app provokes fears of a genetic Wild West. Nature 574(7780):609–610. https://doi.org/10.1038/d41586-019-03282-0

Mayer-Schönberger V, Cukier K (2013) Big data: a revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, Boston/New York

Meloni M (2014) Biology without biologism: social theory in a postgenomic age. Sociology 48(4):731–746. https://doi.org/10.1177/0038038513501944

Meloni M (2016) Political biology: Science and social values in human heredity from eugenics to epigenetics. Palgrave Macmillan, n.p.p

Mitchell RW, Dezarn L (2014) Does knowing why someone is gay influence tolerance? Genetic, environmental, choice, and “reparative” explanations. Sex Cult 18(4):994–1009. https://doi.org/10.1007/s12119-014-9233-6

Morrison M, de Saille S (2019) CRISPR in context: towards a socially responsible debate on embryo editing. Palgrave Commun 5(1):1–9. https://doi.org/10.1057/s41599-019-0319-5

Nelkin D, Lindee MS (2004) The DNA mystique: the gene as a cultural icon. University of Michigan Press, Ann Arbor

Novas C, Rose N (2000) Genetic risk and the birth of the somatic individual. Econ Soc 29(4):485–513. https://doi.org/10.1080/03085140050174750

O’Riordan K (2012) The life of the gay gene: from hypothetical genetic marker to social reality. J Sex Res 49(4):362–368. https://doi.org/10.1080/00224499.2012.663420

Article   PubMed   Google Scholar  

Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, Turley P, Chen G-B, Emilsson V, Meddens SFW (2016) Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533(7604):539–542. https://doi.org/10.1038/nature17671

Parry B, Greenhough B (2017) Bioinformation. Polity Press, Cambridge

Parsons T (1951) The social system. Free Press, New York

Prainsack B (2015) Is personalized medicine different? (Reinscription: the sequel) A response to Troy Duster. Br J Sociol 66(1):28–35. https://doi.org/10.1111/1468-4446.12117

Reardon J (2017) The postgenomic condition: ethics, justice, and knowledge after the genome. University of Chicago Press, Chicago/London

Rehmann-Sutter C, Mahr D (2016) The lived genome. In: Whitehead A, Woods A (eds) Edinburgh companion to the critical medical humanities. Edinburgh University Press, Edinburgh, pp. 87–103

Richardson SS, Borsa A, Boulicault M, Galka J, Ghosh N, Gompers A, Noll NE, Perret M, Reiches MW, Sandoval JCB (2019) Genome studies must account for history. Science 366(6472):1461. https://doi.org/10.1126/science.aaz6594

Ruckenstein M, Schüll ND (2017) The datafication of health. Annu Rev Anthropol 46(261–278). https://doi.org/10.1146/annurev-anthro-102116-041244

Saukko P (2017) Shifting metaphors in direct-to-consumer genetic testing: from genes as information to genes as big data. New Genet Soc 36(3):296–313. https://doi.org/10.1080/14636778.2017.1354691

Savage M, Burrows R (2007) The coming crisis of empirical sociology. Sociology 41(5):885–899. https://doi.org/10.1177/0038038507080443

Shostak S, Conrad P, Horwitz AV (2008) Sequencing and its consequences: path dependence and the relationships between genetics and medicalization. Am J Sociol 114(S1):S287–S316. https://doi.org/10.1086/595570

Thomas WJ, Thomas DS (1929) The child in America. Behavior problems and programs. Knopf, New York

West C, Zimmerman DH (1991) Doing gender. In: Lorber J, Farrell SA (eds) The social construction of gender. Sage, Newbury Park/London, pp. 13–37

Download references

Acknowledgements

Open access funding provided by University of Vienna. The authors thank Brígida Riso for contributing to a previous version of this article.

Author information

These authors contributed equally: Melanie Goisauf, Kaya Akyüz, Gillian M. Martin.

Authors and Affiliations

Department of Science and Technology Studies, University of Vienna, Vienna, Austria

Melanie Goisauf & Kaya Akyüz

BBMRI-ERIC, Graz, Austria

Department of Sociology, University of Malta, Msida, Malta

Gillian M. Martin

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Melanie Goisauf or Kaya Akyüz .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Goisauf, M., Akyüz, K. & Martin, G.M. Moving back to the future of big data-driven research: reflecting on the social in genomics. Humanit Soc Sci Commun 7 , 55 (2020). https://doi.org/10.1057/s41599-020-00544-5

Download citation

Received : 15 November 2019

Accepted : 09 July 2020

Published : 04 August 2020

DOI : https://doi.org/10.1057/s41599-020-00544-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Biobanking and risk assessment: a comprehensive typology of risks for an adaptive risk governance.

  • Gauthier Chassang
  • Michaela Th. Mayrhofer

Life Sciences, Society and Policy (2021)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

data revolution research paper

chrome icon

The Data Revolution: A Comprehensive Survey on Datafication

Content maybe subject to  copyright     Report

3,955  citations

2,062  citations

751  citations

513  citations

199  citations

Related Papers (5)

Ask Copilot

Related papers

Related topics

  • Survey Paper
  • Open access
  • Published: 26 January 2021

A survey on data‐efficient algorithms in big data era

  • Amina Adadi   ORCID: orcid.org/0000-0002-9697-666X 1  

Journal of Big Data volume  8 , Article number:  24 ( 2021 ) Cite this article

26k Accesses

110 Citations

75 Altmetric

Metrics details

The leading approaches in Machine Learning are notoriously data-hungry. Unfortunately, many application domains do not have access to big data because acquiring data involves a process that is expensive or time-consuming. This has triggered a serious debate in both the industrial and academic communities calling for more data-efficient models that harness the power of artificial learners while achieving good results with less training data and in particular less human supervision. In light of this debate, this work investigates the issue of algorithms’ data hungriness. First, it surveys the issue from different perspectives. Then, it presents a comprehensive review of existing data-efficient methods and systematizes them into four categories. Specifically, the survey covers solution strategies that handle data-efficiency by (i) using non-supervised algorithms that are, by nature, more data-efficient, by (ii) creating artificially more data, by (iii) transferring knowledge from rich-data domains into poor-data domains, or by (iv) altering data-hungry algorithms to reduce their dependency upon the amount of samples, in a way they can perform well in small samples regime. Each strategy is extensively reviewed and discussed. In addition, the emphasis is put on how the four strategies interplay with each other in order to motivate exploration of more robust and data-efficient algorithms. Finally, the survey delineates the limitations, discusses research challenges, and suggests future opportunities to advance the research on data-efficiency in machine learning.

Introduction

Building machines that learn and think like humans is one of the core ambitions of Artificial Intelligence (AI) and Machine Learning (ML) in particular. On the quest for this goal, artificial learners have made groundbreaking accomplishments in many domains spanning object recognition, image processing, speech recognition, medical information processing, robotics and control, bioinformatics, natural language processing (NLP), cybersecurity, and many others. Their success has captured attention beyond academia. In industry, many companies such as Google and Facebook devoted active research instances to explore these technologies.

Ultimately, AI has succeeded to speed up its pace to be like humans and even defeats humans in some fields. AlphaGo [ 1 ] defeats human champions in the ancient game of Go. The deep network ResNet [ 2 ] obtains better classification performance than humans on ImageNet. And, recently, Google launched Meena [ 3 ] a human-like AI chatbot that can conduct sensible and specific conversations like humans. However, there is another side to this coin, the impressive results achieved with modern ML (in particular by deep learning) are made possible largely by the use of huge datasets. For instance, DeepMinds’s AlphaGo used more than 38 million positions to train their algorithm to play Go. The ImageNet database used by ResNet contains about 1.2 million labeled examples. And Meena has been trained on a massive 341 GB corpus, the equivalent of roughly 341,000 books, far more than most people read in a lifetime. This is obviously far from human-like learning. One thing that makes human learners so efficient is that we are active, strategic information-seekers. We learn continually and we make use of our previous experiences. So far, it is not clear how to replicate such abilities into artificial learners. In the big data era, algorithms continue to be more data-hungry, while real facts indicate that many application domains can often only use a few data points because acquiring them involves a process that is expensive or time-consuming.

Consequently, many researchers and engineers began to recognize that the progress of ML is highly dependent on the premise of the availability of large number of input samples (generally with annotations). Without massive data, ML success is uncertain. Hence, key ML researchers have sounded a cautionary note regarding the data hunger behavior of algorithms. In his controversial work “Deep Learning: A Critical Appraisal” [ 4 ], Marcus listed ten concerns about deep learning research, at the top of the list data hungriness. he noted that “in problems where data are limited, deep learning often is not an ideal solution”. Data-hungriness was also included in the unsolved problems in AI research, described in the book “Architects of Intelligence” by Martin Ford [ 5 ]. Most of the experts interviewed in this book are calling for more data-efficient algorithms. For instance, Oren Etzioni quoted in the book that “stepping stone [towards AGI] is that it’s very important that [AI] systems be a lot more data-efficient. So, how many examples do you need to learn from?” [ 5 , p. 502].

As such, our work extends the recent call for more research on data-efficient algorithms. In fact, we view these concerns as an opportunity to examine in-depth what it means for a machine to learn efficiently like humans, what are the efforts deployed to alleviate data- hungriness, and what are the possible research avenues to explore. Studying data hungriness of ML algorithms is unfortunately a topic that has not yet received sufficient attention in the academic research community, nonetheless, it is of big importance and impact. Accordingly, the main aim of this survey is to stimulate research on this topic by providing interested researchers with a clear picture of the current research landscape.

Prior to this paper, we know of few works that attempted to investigate the issue of data-hungriness. Shu et al. [ 6 ] proposed a survey that covers learning methods for small samples regime, they focused on concept learning and experience learning. Wang et al. [ 7 ] surveyed few-shot learning by putting on light methods operating at the level of data, model, and algorithms. Qi et al. [ 8 ] discussed small data challenges from unsupervised and semi-supervised learning perspectives, they presented an up-to-date review of the progress in these two paradigms. As a matter of fact, existing surveys are limited in the way they approach the problem and in the scope they cover. In contrast, our work seeks comprehensiveness, we tackle the issue from an interdisciplinary perspective and discussed potential solutions from different backgrounds and horizons. In doing so, we brought different concepts under one roof that were never discussed together and tried to draw connections between them. Furthermore, as all AI players are concerned by the issue, while elaborating the survey we deliberately tried to make it accessible to the non-theoretician while still providing precise arguments for the specialist. In this respect, we make three main contributions:

We propose a comprehensive background regarding the causes, manifestations, and implications of data-hungry algorithms. For a good understanding of the issue, particular attention is given to the nature and the evolution of the data/algorithm relationship.

Based on an analysis of the literature, we provide an organized overview of the proposed techniques capable of alleviating data hunger. Through this overview, readers will understand how to expand limited datasets to take advantage of the capabilities of big data.

In our discussions, we identify many directions to accompany previous and potential future research. We hope it inspires more researchers to engage in the various topics discussed in the paper.

Accordingly, the remainder of the survey is organized as follows. “ Background ” section presents a preliminary background. “ Review ” section surveys existing solutions and organizes surveyed approaches according to four research strategies. “ Discussion ” section discusses research directions and open problems that we gathered and distilled from the literature review. Finally, “ Conclusion ” section concludes this survey.

Data and algorithms: a match made in heaven

Communicating, working, entertaining, traveling and other daily life activities perfectly exemplify the fruit of combining algorithms and data. Notwithstanding, in order to dig deep into the complex nature of the link between these two concepts, we must look below to understand its germination. In this section, we provide an exhaustive investigation of data/algorithms link. The link is developed in light of four perspectives: (i) by examining the historical trajectory followed by the two concepts, (ii) by drawing insights from biology, as the two concepts can be observed both in nature and in digital, (iii) by studying the related technical and theoretical background, (iv) and by identifying business motivations that feed this link.

Historical perspective

Because the historical evolution in form and content of “Data” and “Algorithm” concepts has a poignant bearing on the issue, it is worth exploring the intellectual history of these two concepts in order to illustrate how they are intertwined. A curious fact about our data-obsessed era is that little is known about the origin and the evolution of this vital concept. Indeed, it is common to think of “big data”, “machine learning” and related technologies as relatively modern technologies. Yet the roots across these domains, that gave rise to the spectacular advances we are witnessing today, are often not well known and have never been assembled in a single work, to be studied or analysed.

Presumably, the word “data” began to be used around the 17th century [ 9 ], it is derived from the Latin meaning “facts given or granted”. The concept of data as a given has been criticized by many social scientists [ 9 , 10 ], they claimed that the concept symbolizes rather “the facts taken or observed from nature by the scientist” and should instead be characterized as “capta” which means taken and constructed. It was mainly the invention of Gutenberg’s Printing Press upon the Renaissance that helped the rapid transmission of ideas and expanded access to knowledge. People started then to use “data” originally to refer to “facts given as the basis for calculation in mathematical problems”. The age of Enlightenment ushered in many fields and disciplines like economics, biology, and political science which accentuated the proliferation of data. Centuries of data sharing and scientific methodologies have led to the emergence of a new field: Statistics, data became to refer, at the time, to “numerical facts collected for future reference”.

Meanwhile, the word “algorithm” was derived directly from the 9th century mathematician al-Khwarizmi, the author of the oldest work of algebra “al-Mukhtasar fi Hisâb al-jabr wa l-Muqabala” [ 11 ]. Beyond the etymological roots of the term, the concept of “Algorithm” can be traced back to the third millennium BCE. The first examples of algorithms can be found in Babylonian tablets and in Egyptians scrolls [ 12 ]. In the 12th century, the book of al-Khwarizmi was translated from Arabic to Latin and so the denary numeration system began to spread throughout medieval Europe under the name of “algorismus”. Around the 18th century the term “algorismus” became the modern “algorithm”. The use of the word has also evolved to include arithmetic procedures for solving problems or performing tasks [ 13 ].

A major series of achievements came during the industrial revolution (IR) with intense investment in technological innovations. This was a turning point for “data” and “algorithms” that henceforth will be used synergistically to produce tools that indented to substitute human labour: Machines [ 14 ]. Basically, data were fed to machines and algorithms guided their actions. During this period, capturing and recording data practice was developed through Jacquard’s loom and Hollerith’s tabulating machines. The punched cards, used at the time, were the primary data entry medium. Furthermore, a myriad of algorithms was developed at the time to automate all sorts of human actions. The first algorithm meant to be executed on a machine was created by Ada Lovelace in 1843. In 1847, George Boole invented binary algebra, the basis for modern computer code. In 1888, Giuseppe Peano established the axiomatization of mathematics in a symbolic language.

The 20th century was the digital era par excellence. It marks the modern evolution of “data” and “algorithm” concepts. Driven by advances in electronics, the concept of “algorithm” was formalized in 1936 through Alan Turing’s Turing machines and Alonzo Church’s lambda calculus, which in turn formed the foundation of computer science [ 15 ]. Data took now a digital form and it refers to collections of binary elements transmittable and storable by which computer operations (implemented algorithms) are performed. This opened up the evolution of data modelling for databases [ 16 ], programming languages and paradigms [ 17 ], and as foreseen by Moore [ 18 ], computing power has exposed, resulting on sophisticated tools and methods to store and process data, and complex algorithms with higher level of computational prowess. The 1990s brought a real communication revolution with the arrival of the Web and the expansion of the Internet, which was as disruptive as Gutenberg’s Printing Press 600 years ago but with a much larger impact and scale. Indeed, as more and more social networks start appearing and the Web 2.0 takes flight, the volume and speed of data creation were increased with a velocity that had never been experienced before, data are literally everywhere. Starting from the early 2000s we called these very large datasets: Big data [ 19 ].

Data by themselves are useless, to be useful, it must be processed [ 20 ]. While the developed infrastructures managed to store and retrieve massive data, traditional business intelligence tools started to show their limitations in terms of processing data with high velocity, volumes, and varieties. Thus, sophisticated algorithms with advanced capabilities of extracting knowledge from a large amount of data were needed. Artificial Intelligence offered a very interesting solution at this regard: Machine learning algorithms.

Right from the beginning, the ultimate goal of machines was to be as intelligent as humans. Back to the 1950s, Alan Turing’s infamous question “Can machines think?” [ 21 ] was a landmark that set the foundations of AI. That was followed by heavy investments in AI research and projects. Experts predicted that it will take only a few years until we reach strong AI of an average human being. Yet, unfortunately, these predictions turned out to be wildly off the mark. In the 1970s, after the Lighthill report [ 22 ] stated that AI has failed to live up to its promises and that it was too brittle to be used in practice, AI research took the back seat, and remained an area of relative scientific obscurity and limited practical interest for over half a century. This period is known as AI winter. During this time, ML research, as a subfield of AI, struggled too with slow and modest realizations. It mainly focused on Artificial Neural Networks (ANN) which despite their potential theory, their practical feasibility was very limited due to the lack of available data and computational power. Indeed, in contrast to traditional algorithms, ML algorithms are not purpose-built programs for specific data. Instead, they are fed by observation data that enable them to gradually learn how to solve problems by induction. So, they need large data to make accurate predictions. In other words, they were waiting for the big data era to flourish again. This indeed was the case, with the rise of Big Data and improvements in computing power, ANN made a comeback in the form of Deep Learning [ 23 ]. At the present time, with a quite mutualistic relationship, ML and big data are making unprecedented and stunning success stories in diverse domains and more are still ahead of us. Figure  1  illustrates the remarkable recent interest in the two fields using google trends.

In hindsight, it seems clear by now that the concepts of “algorithms” and “data” are deeply rooted in history, not just a short-lived phenomenon. Figure  2 traces the growing interest in the two concepts over time, starting from the 18th century using Google Books Ngram Viewer. It seems also clear that these two concepts share a tangled causal chain of events. Surprisingly, few historical studies exist on the particular interaction between Algorithms and Data. This aspect is often neglected by contemporary scientists and researchers, more concerned with advances in the modern age. Here, we attempted to connect the history of the two concepts, believing that the historical background can help us to put facts in context and to understand the source of the voracious appetite of contemporary algorithmic practice for massive data. As summarized in Fig.  3 , during the prehistoric space, foundations, and theories around the two concepts have been established. Starting from IR period, fields related to the two concepts started to converge with the emergence of machines. In the course of events, both data and algorithm concepts have evolved –Data in nature and size, and algorithms in complexity and intelligence. What brings them together is a common goal –Reproducing human intelligence.

figure 1

“Big Data” and “Machine Learning” search trends, 2004–2020. Data source: Google Trends ( https://www.google.com/trends )

figure 2

Interest evolution of “Data” and “Algorithm” terms between 18th and 20th century. Data source: Google Books Ngram Viewer. ( https://books.google.com/ngrams )

figure 3

“Data” and “Algorithms” evolution with key milestones.

Biological perspective

From a biological standpoint, data and algorithms have been around since the beginning of time and existed well before a special word has been coined to describe them. After all, data is a physical concept, as Landauer quoted in his paper [ 24 ] “Information is not a disembodied abstract entity; it is always tied to a physical representation. It is represented by engraving on a stone tablet, a spin, a charge, a hole in a punched card, a mark on paper, or some other equivalent”. Following this line of thinking, many theorists support the idea of data being the essential unit of the physical universe [ 25 – 27 ]. This was famously encapsulated by physicist John Wheeler in his pithy slogan “It from Bit” [ 27 ], meaning that every aspect of a particle can be expressed as data, and put into binary code, which makes in Wheeler view “everything is data (information)” [ 27 ]. Indeed, many examples in nature depict the world as an entity capable of encoding data. The DNA molecule encodes biological data about all known organisms. The retina encodes visual data seeing through the eyes. And fingerprints encode biometric data that uniquely identify a natural person.

The same research line claimed the so-called “The computational theory of the universe”. Lloyd [ 28 ] argued that “the computational paradigm for the universe supplements the ordinary mechanistic paradigm: the universe is not just a machine; it is a machine that processes information. The universe computes”. Referring back to Wheel’s ‘it from bit’ view [ 18 ], every process in the universe can be reduced to interactions between particles that produce binary answers: yes or no. That means nature, at its most fundamental level, is simply the flipping of binary digits or bits, just like algorithms do. The idea of the universe being a computer might seem to be only a metaphor. Metaphors usually reflect the most advanced thinking of the era that spawned it, and computers are the defining machines of our era, it seems thus natural to draw a parallelism between the Universe and Computer. Lloyd [ 28 ], however, argued in the defense of the theory that the computing universe stem from mathematics and physics facts: Maxwell, Boltzmann, and Gibbs showed that all atoms register and process information long before computers arrive. Aristotle has also discussed the physics of the computing universe and its implications thousands of years ago in his “beyond the physical” book [ 29 ]. Assuming that the universe is a computational entity, processing and interpretation of bits (data), give naturally rise to all sorts of complex order and structure seen in nature, which make the laws of physics essentially algorithms that calculate and handle data [ 24 ]. How planets move in the solar system is an algorithm, how a spider spins its webs is an algorithm, and how a baby recognizes his mother’s face is also an algorithm.

In light of this, we can conclude that data, algorithms, and hence their interaction are shaping every biological organism and physical phenomena in the world. They are concepts giving by nature and not created by humans. In fact, what we are attempting to do is use these biologically inspired paradigms to create ever more intelligent technology. Nature teaches us, all tools and machines invented throughout human history are simply reverse-engineering the data processes that underlie biology, including that of our brain. Indeed, the most powerful information-processing system known has inspired many researches that try to mimic its functioning [ 30 , 31 , 32 , 33 ]. The most obvious example is ANN which as the name implies, try to learn tasks (to solve problems) by mimicking the networks of biological neurons. Other nature and biological organisms have also been a source of inspiration for many algorithms [ 34 , 35 , 36 ]. Taking animals as an example, a large variety of bio-inspired algorithms that simulate biological processes or mimics a collective behavior of animals has been reported in the literature [ 34 ]. Ant algorithms mimic the foraging behavior of social ants, Bees-inspired algorithms are inspired by the foraging behavior of honey bees in nature. Bat algorithm is inspired by the echolocation behavior of microbats. And genetic algorithm is inspired by biological evolution, to name a few. In spite of the popularity and success of nature- and bio-inspired computation, researchers in the field warn against the growing gap between the original biological models and the man-made models [ 34 ,  37 ]. Indeed, as reported by Molina et al. [ 34 ] a poor relationship is often found between the natural inspiration of an algorithm and its behavior. We are interested in one particular behavior “data consumption”. By taking the previous example of ANN, this learning algorithm is known for being exceptionally data-hungry, it needs many examples and experiences to learn. However, it is not the case for its biological counterpart. For instance, a self-driving car powered by a deep learning algorithm may need to crash into a tree thousands of times in virtual simulations before it figures out how not to do that. While a person can learn to drive a car in only 15 h of training without crashing into anything. Visibly, nature is much less demanding in terms of data to learn. Therefore, the question here is why (learning) algorithms do not inherit the capacity of learning from a few experiences like their biological inspiration?

Technical perspective

In the introduction of his book “Machine Learning” [ 38 ], Tom Mitchell provides a short yet useful formalism for learning algorithms: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”.

This formalism is broad enough to include most tasks that we would conventionally call “learning” tasks. It puts emphasis on three core features that define a learning problem: (i) the learner’s output ( T ), (ii) measures of success ( P ), and (iii) the training data ( E ). Although considered vital in the learning process, Mitchell’s definition of experience or training data is not enough concrete to explore the requirement aspect in a precise formal setting. Fortunately, learnability has been extensively theorized in a dedicated field called Computational Learning Theory (CLT) [ 39 ]. In a broader sense, CLT formally studies fundamental principles of learning as a computational process, it seeks to formalize at a precise mathematical level the learning efficiency in terms of both data usage and processing time. CLT proposes the Probably Approximately Correct (PAC) model of learning [ 40 ] to formally define the efficiency of learning algorithms (referred to as learners). More precisely, the PAC allows analysing whether and under what conditions a learner L will probably output an approximately correct classifier.

Formally, given a [ 40 ]:

Input data X .

Output values Y = {−1,+1} .

Training data Data = {〈x i , c(x i ) = y i 〉, x i ∈ X , y i ∈ Y \({\}}_{i=1}^{m}\)

C set of training concepts c ∈ C : c : X → {0, 1} .

Instances are generated at random from X according to some probability distribution D . In general, D may be any distribution and it will be unknown to the learner. D must be stationary, i.e. it does not change over time.

A set H of possible hypotheses.

A learner L outputs some hypothesis h from H as a model of c .

Definition 1

True error error D (h) of the hypothesis h with respect to the target function c and the probabilistic distribution D is the probability that the hypothesis h wrongly classifies a randomly selected instance according to D (error D (h) ≡ Pr x ∈ D [c(x) ≠ h(x)])

Definition 2

Consider a concept class C defined over a set of instances X of length n ( n is the size of instances, i.e. the size of their representation) and a learner L using hypothesis space H . C is PAC-learnable by L using H if for all c ∈ C , distribution D over X , ε such that 0 < ε < \(\frac{1}{2}\) , δ such that 0 < δ < , learner L will with probability at least 1 – δ (confidence) output a hypothesis h ∈ H such that error D (h )≤ ε , in time that is polynomial in \(\frac{1}{\epsilon}\) , \(\frac{1}{\delta}\) , n .

The definition of PAC learnability contains two approximation parameters. The accuracy parameter ε determines how far the output classifier can be from the optimal one (this corresponds to the “approximately correct” part of “PAC”), and a confidence parameter δ indicating how likely the classifier is to meet that accuracy requirement (corresponds to the “probably” part of “PAC”). In short, the goal of a PAC-learner is to build a hypothesis with high probability (1- δ ) that is approximately correct (error rate less than ε ). Knowing that a target concept C is PAC-learnable allows us to bound the sample size necessary to probably learn an approximately correct classifier. Indeed, one of the fundamental questions in CLT is sample complexity, that is how much training data is required to achieve arbitrary small error with high probability. Valiant [ 40 ] proposed the following theorem (Formula 1 ), for use with finite concept classes, which gives an upper bound on required data as a function of the accuracy (ε) and confidence (δ) parameters:

m is the amount of data needed to assure that any consistent hypothesis will be probably (with probability (1-δ)) approximately (within error ε) correct. We note that m grows linearly in 1/ε and logarithmically in 1/δ and H. Which means as ε gets smaller (i.e., as we want a more accurate hypothesis), we need more and more data. As there are more hypotheses in our hypothesis space, we also need to see more data. Likewise, as the probability of an approximately correct learner grows. More plainly, as we consider more possible classifiers, or desire a lower error or higher probability of correctness, we absolutely need more data. However, it is worth noting that there’s only a logarithmic dependency on 1/δ, which means we can learn within an exponentially small probability of error using only a polynomial number of training data. There’s also a log dependence on the number of hypothesis H, which means that even if there’s an exponential number of hypotheses in our hypothesis space, we can still do the learning with a polynomial amount of data.

The theoretical role and influence of data in the learning process emphasized before, have been supported by a large number of empirical studies [ 41 , 42 , 43 ] which based on practical observations and experiences have confirmed the premise of “using more training data is necessary to improve performance”. Figure  4  is an example of a learning curve of some of these works.

figure 4

Examples of learning curves. a  Learning curve of medical image classification using differing sample size [ 42 ], b  The number of misclassified medical images on the whole body in an increasing number of training data sets [ 42 ], c  Algorithm performance with increasing training sample sizes and differing algorithms [ 41 ], d  Learning curve in the tumor versus nontumor classification constructed using differing sample size [ 43 ]

As noted by Gibbons et al. [ 41 ] (Fig.  4 c), although for most learning algorithms, performance grows as data expands, not all algorithms are equally hungry. Indeed, there are algorithms that are more hunger than others are, and there is often a direct correlation between complexity and hungriness; the most complexe algorithms usually are more demanding in terms of data (e.g. deep learning), and the less complexe algorithms usually do not require massive data to learn (e.g. decision tree) [ 44 ]. Another observation to point out is that there is a saturation point which, as depicted in Fig.  4 a), marks certain stability in performance and beyond which more data does not improve the overall model.

Business perspective

If ML algorithms were some products we buy in the supermarket, packing would surely be labeled: “Just add data and mix”. From a business perspective, rather than being inherent, this dependency seems to be naturally derived from the data-driven strategy empowering the modern economy. The declining costs of computing elements and the ever-rising amount of accessible data has brought us past a point of inflection that marks the transition to a new economic era, that of the data-driven economy (DDE) [ 45 ]. In DDE, data is considered a strategic asset, and the success or failure of a firm now depends on how much data it has. An ever-expanding body of evidence points to the crucial role of data in economic growth. According to Research and Markets report [ 46 ], the global big data market size will grow from USD 138.9 billion in 2020 to USD 229.4 billion by 2025, at a compound annual growth rate of 10.6 % during the forecast period. 98.8 % of Fortune 1000 represented in the last version of NewVantage Partners executive survey [ 47 ] confirm an active investment in Big Data underway. 64.8 % of these leading firms are investing greater than USD 50MM in Big Data and AI initiatives. The business benefits of data-driven strategies can be grouped into three broad axes:

Innovating better : Data is regarded as a relevant determinant for the likelihood of a firm becoming a product innovator as well as for the market success of product innovations [ 48 ]. Insights obtained from data can be used to generate new innovative products, services, and processes leading to improved firms’ innovation performance and thus the firm performance overall.

Understanding customer better : Data is revolutionizing the matching function between firms and customers. Data is generated primarily by customers and can be used to getting close to customers, understand their behavior, and reflect their value co-creating actions. This benefits firms with regard to precision marketing, new product development, and realigning business strategy to maintain sustainable competitive advantage [ 49 ].

Managing better : Another business benefit from data is realized through significant changes in management practices. More precisely speaking, the emergence of data driven decision making [ 50 ]. Data have led many managers to change how they make decisions, relying less on intuition and more on data. As Jim Barksdale, the former CEO of Netscape quipped, “If we have data, let’s look at data. If all we have are opinions, let’s go with mine” [ 50 ]. According to scholars [ 50 ,  51 ] embracing the data-driven decision-making practice creates opportunities to make better decisions that lead to commercial growth, evolution, and an increased bottom line.

All these cases of data use generate more business value, but also more data-hungry apps and practices. Digital native firms that, naturally, due to their business model, utilize data (e.g. Google and Amazon) are the most innovators in the area. However, not all firms or businesses can afford to generate or collect massive data, for such businesses it should be some viable alternatives to compete in the AI race.

In the end, the multi-perspective study has genuinely served to clearly articulate the issue. Each perspective has brought its own valuable insights to understand the complementary and closely intertwined nature of the “data/algorithms” relationship. The main conclusion that we can draw so far is that data and algorithms have shared a long history together, and a tight relationship exists between the two concepts in nature, however, in stark contrast to their artificial counterparts, the biological algorithms do not require massive data to learn. Formally, the theory of learning shows that the amount of data we need for a learning algorithm depends on the targeted performance. If we want more performance, we need more data. Last, seeking more business value from data has resulted in a data-driven economy that lacks alternative algorithms that can learn even if only small data is available.

Data hungry algorithms: an inconvenient truth

Against the presented backdrop, resolving the hungriness issue might appear evident. After all, we live in the data era. If algorithms need more data, we should feed them with more data. While it might be value in this approach, whether it is the right or the wrong one, is debatable. There are largely two main reasons that make such approach problematic: data scarcity and AI sustainable progress, which reflects respectively industrial need for cost-effective learning and academic ambition of Artificial General Intelligence (AGI).

The curse of scarcity

While it would be natural to think that almost every single business or market is snowed under with an avalanche of data, it is far from the truth. Data are available for only a subset of companies, in many cases data are considered a scarce resource. In fact, in a real-world setting, data are hard to acquire and if it openly exists, it often has questionable quality. A recent survey by Dimensional Research shows that 96 % of enterprises encounter data quality and labeling challenges in ML projects [ 54 ]. Obtaining voluminous and accurately labeled data is challenging for many reasons: First, (i) the high cost of data collection or annotation, learning algorithms do not only crave massive samples, but the data have also to be manually annotated beforehand, involving tasks as complex as making human-like judgments about images or videos which implies a significant cost, time and effort. Crowdsourcing can be utilized to harness the crowd to directly annotate data and thus reducing human labor cost. The result, however, inevitably contain a large amount of low-quality annotations [ 55 ]. Second, (ii) limited domain expertise, to label some general images with trivial categories such as “cats” and “dogs” we need to understand the difference between these two animals, which might seem very “common sense”, but to label medical images as “cancer” or “not cancer”, we need deep medical expertise, and it is often hard to find such domain experts who can credibly identify and label a specific type of data such as tumors or chemical formulas [ 56 ]. Finally, (iii) access limitation is also a pain point for acquiring more data. Especially for domain involving sensitive data, the amount of data can be limited due to privacy, safety or ethical issues. For example, the collection and labeling of DICOM medical image scans is challenging for privacy reasons [ 47 ]. Recently, compliance and regulatory issues have become pressing concerns for enterprises dealing with data, especially after GDPR entered into force [ 58 ]. Even companies who used to have access to a large amount of data might face increasing difficulties.

Moreover, besides being laborious and expensive, in some cases having more data is simply impossible. In some domains and for rare events sufficient data might not be available, which may hinder ML adoption in such applications. The most striking scenarios include: (i) Studying rare phenomena such as earthquakes, epidemics and floods. (ii) Aggregate modeling of states, countries, rare animal’s race, or any situation where the population itself is limited. And (iii) time series forecasting which often lacks historical or seasonality data for a target variable. These scenarios can be found in many domains such as production, marketing, government, military, and education. Perhaps the most obvious domains that are the most cursed by the scarcity of data are: (a) Robotics, robots are expected to act like humans or animals, they often have to operate in ever-changing, uncontrolled real-world environments, and will inevitably encounter instances of classes, scenarios, textures, or environmental conditions that were not covered by the training data [ 59 ]. This is why operating reliably in irregular scenarios such as strange weather occurrences or other vehicles’ unpredictable driving patterns still problematic for autonomous vehicles. (b) Medicine is also deeply cursed; medical data require a ground truth normally provided by an expert physician ending up with only a small set of annotated data, microarray and RNA-Seq data are typical of this type of small sample problem [ 310 ,  311 ]. In addition, new diseases consistently occur with little historical data, and rare diseases frequently occur with few cases, which can only obtain scarce training data with accurate labels [ 56 ]. This is why it is still challenging for Computer-Aided-Diagnosis to detect rare tumors such as bladder cancer, for which there are only a few medical records [ 60 ].

AI’s next frontier

Data superpowers to increase algorithm’s performance has sparked the so-called “The Unreasonable Effectiveness of Data” mindset [ 61 ], which advances that even very complex problems may be solved by simple statistical models trained on massive datasets. Google’s director of research, Peter Norvig, puts it this way: “We don’t have better algorithms. We just have more data” [ 52 ]. This leads us to consider a fundamental question about the ML field future: will continually increase amounts of training data be sufficient to drive continued progress in ML absent the development of more advanced and sophisticated algorithms?

While many primary works answered affirmatively to the question [ 61 , 62 ], other recent studies are attempting to prove the opposite. As stated by Domingos [ 53 ] “Data alone is not enough, no matter how much of it you have”. Basing his statement on “no free lunch” theorem, Domingos argued that learners need to be improved in a way to embody knowledge beyond the data it is given in order to generalize beyond it. Zhu et al. [ 63 ] highlighted by the evidence a surprising observation that is at some point, off-the-shelf implementations often decrease in performance with additional data. At such point, improving the algorithm is required to uphold performance. Greco et al. [ 64 ] provided a qualified defense of the value of less data, they claimed that seen through the lens of cognitively inspired AI, the future of the field is about less data, not more.

Furthermore, the move toward data-efficient AI is also a necessity in order to make learners more human-like. In fact, in spite of their biological inspiration and performance achievements, in their current implementation learners differ from human intelligence in crucial ways [ 66 ]. As previously discussed, humans or even animals can quickly learn new skills or adapt to changing circumstances based on a few experiences. A child, for instance, can learn to recognize a new kind of object or animal from just a short exposure to a single example. Getting burned once will teach him to be careful with fire, he may recognize a face that he has seen only briefly, and he can recognize a lullaby that he has heard only a few times before [ 67 ]. Similarly, soon after birth and without the benefit of massive data sets, animal babies start to figure out how to solve problems of feeding, fighting, fleeing, and mating. A squirrel can jump from tree to tree within months of birth, a colt can walk within hours, and spiders are born ready to hunt [ 68 ]. The main reasons why young animals (including humans) learn faster, better, and with less data, is that they rely heavily on innate mechanisms [ 68 ] and make use of prior knowledge. Such innate learning process is yet to be reproduced in artificial learners, hence the need to rekindle the old “nature versus nurture” debate [ 69 ] in AI context, if the goal is to achieve AGI, a human-level machine intelligence that is capable of learning the way we do.

A final and a more general motivation for developing data-efficient algorithms stems from the aim of achieving a more robust AI. As described by Marcus [ 70 ], the next level of AI is not necessarily superhuman, but has to solve problems in a systematic and reliable way, it should not be a “pointillistic” intelligence that works in many cases but fails in many others, but rather it must implement a “solutionism intelligence” that solves any problem encountered, under all conditions. Thus, it is our contention that to reach the next level of AI, algorithms should work for both data and non-data-driven settings. Hence, a research agenda for robust AI should include solutions to improve learners’ performance for problems with a small dataset.

As a result, at the frontiers of AI, efforts should be underway to develop improved forms of ML that are more robust and more human-like. Enhanced algorithms that will allow autonomous vehicles to drive us around both usual and unpredictable places, and that will work as well for rare diseases as for the common ones without being excessively demanding in their requirements for data. This new generation of learners will reshape our understanding of AI and disrupt the business landscape in profound ways.

This part reports on the findings of our extensive literature review conducted by examining relevant works dealing with learners’ data-efficiency issue. Different perceptions to approach the problem lead to different ways to solve it. Based on the study of the related body of research, we distilled four main strategies to alleviate algorithms data hungriness. Each one is spanning its own spectrum and together they shape the advanced in this research landscape. Figure  5 categorizes existing techniques into a unified taxonomy and organizes them under the umbrella of each strategy. We devote a section to each strategy. First, we point out research exploring learning algorithms that go beyond the realm of supervised learning. Second, we review relevant techniques to enlarge artificially the training dataset. Third, we overview the different forms that learning from previous experiences can take. Finally, we introduce a new research direction that aims to conceive innovative hybrid systems that combine both high-prediction, explainability, and data-efficiency.

figure 5

Taxonomy of data-efficient methods, while many methods lie in the intersections of multiple categories they are grouped by where they are reviewed in this survey

Non-supervised learning paradigms

When talking about data hungriness in ML, we are mostly referring to supervised learning algorithms, it is this type of learning that had the most voracious appetite for data. Supervised methods need labelled data to build classification and regression models and the performance of these models relies heavily on the size of labelled training data available. One straightforward strategy to alleviate this data-dependency would be then to use other learning paradigms. Paradigms that either do not require pre-existing data and could generate ones by interacting with their environment (i.e. reinforcement learning), or paradigms that need only small set of labelled data (i.e. semi-supervised learning), or paradigms that use for learning raw unlabelled data (i.e. unsupervised learning). In this section, we scan recent methods in the literature that involve these non-supervised learning paradigms.

Semi‐supervised learning methods

The wide availability of unlabeled data in several real-world scenarios, and, at the same time, the lack of labeled data has naturally resulted in the development of semi-supervised learning (SSL) [ 71 ]. SSL is an extension of supervised learning that uses unlabeled data in conjunction with labeled data for better learning. SSL can also be viewed as unsupervised learning with some additional labeled data. Accordingly, SSL may refer to either semi-supervised classification [ 71 ] where unlabeled data are used for regularization purposes under particular distributional assumptions to enhance supervised classification. Or semi-supervised clustering [ 72 ], where labeled data are used to define some constraints to obtain better-defined clusters than the ones obtained from unlabeled data. In the literature, most attention has been paid to the methods of these two groups. Relatively less studies deal with other supervised/unsupervised problems such as semi-supervised regression [ 73 ] and semi-supervised dimensionality reduction [ 74 ]. Depending on the nature of the training function, SSL methods are commonly divided, in the literature, into two settings: inductive and transductive. Given a training dataset, inductive SSL attempts to predict the labels on unseen future data, while transductive SSL attempts to predict the labels on unlabeled instances taken from the training set [ 71 ]. Abroad variety of SSL methods have been proposed in the two settings. These methods differ in how they make use of unlabeled data, and in the way they relate to supervised algorithms. Next, we review the most three dominant families of methods namely: (i) self-labeled methods, (ii) graph-based methods, and (iii) extended supervised methods.

Self-labeled methods

These techniques are used to solve classification tasks, they aim to obtain enlarged labeled data by assigning labels to unlabeled data using their own predictions [ 75 ]. As general pattern, one or more supervised base learners are iteratively trained with the original labeled data as well as previously unlabeled data that is augmented with predictions from earlier iterations of the learners. The latter is commonly referred to as pseudo-labeled data. The main advantage of this iterative SSL approach is that it can be “wrapped” around any supervised learner.

The basic iterative process schema for self-labeled techniques is self-training [ 75 ], it consists of a single supervised classifier that is iteratively trained on both labeled data and data that has been pseudo-labeled in previous iterations of the algorithm. Tanha et al. [ 76 ] discussed the choice of the base learner, they stated that the most important aspect of the learner is to correctly estimate the confidence of the predictions so as to be successful. They experimentally showed that ensemble learner as a base learner gives an extra improvement over the basic decision tree learners. Livieris et al. [ 77 ] proposed an algorithm that dynamically selects the most promising base learner from a pool of classifiers based on the number of the most confident predictions of unlabeled data. Li and Zhou [ 78 ] addressed the issue of erroneous initial predictions that can lead to the generation of incorrectly labeled data, they presented the SETRED method which incorporates data editing in the self-training framework in order to actively learn from the self-labeled examples.

Co-training is a variant of self-training schema that uses multiple supervised classifiers [ 79 ]. Considered as a special case of the multiview learning [ 80 ], cotraining schema assumes that, by dividing the feature space into two separate categories, it is more effective to predict the unlabeled data each time. In the work of Didaci et al. [ 81 ], the relation between the performance of cotraining and the size of the labeled training set was examined, and their results showed that high performance was achieved even in cases where the algorithm was provided with very few instances per class. Jiang et al. [ 82 ] introduced a hybrid method which combines the predictions of two different types of a generative classifier (Naive Bayes) and a discriminative classifier (Support Vector Machine) to take advantage of both methods. The final prediction is controlled by a parameter that controls the weights between the two classifiers. Their Experimental results showed that their method performs much when the amount of labeled data is small. Qiao et al. [ 83 ] proposed a deep cotraining method that trains multiple deep neural networks (DNN) to be the different views and exploits adversarial examples to encourage view difference, in order to prevent the networks from collapsing into each other. As a result, the co-trained networks provide different and complementary information about the data.

Graph-based methods

Transductive methods typically define a graph over all data points, both labeled and unlabeled, the nodes of the graph are specified by unlabeled and labeled samples, whereas the edges specify the similarities among the labeled as well as unlabeled samples [ 84 ]. The common graph-based SSL methods are based on a two-stage process that are: (i) constructing a graph from the samples and then (ii) propagating the partial labels to infer those unknown labels via the graph [ 71 ]. Initial research on graph-based methods was focused on the inference phase. Pang and Lee [ 85 ] approached the inference from a min-cut perspective. They used the min-cut approach for classification in the context of sentiment analysis. Other works approached graph-based inference phase from the perspective of Markov random fields [ 86 ] and Gaussian random fields [ 87 ]. On the other hand, the process of construction of the graph basically involves two stages: the initial phase involves graph adjacency matrix construction, and the second phase deals with graph weight calculation. Blum and Chawla [ 88 ] experimented graph construction using k-nearest neighbor and ε nearest neighbor. The approach simply connects each node to all nodes to which the distance is at most ε. The most used functions for calculation of graph weights are: the Gaussian similarity function and the inverse Euclidean distance [ 71 ]. We note that although graph-based methods are typically transductive, inductive graph-based methods do also exist in the literature, this line of work encompasses approaches that utilize the intrinsic relationship from both labeled and unlabeled samples to construct the graph to estimate a function [ 89 ]. However, it is generally acknowledged that transductive graphs usually perform better than inductive ones [ 84 ]. Another line of work, that has received recently much attention, is the scalable graph with SSL. A commonly used approach to cope with this issue is called anchor graph regularization [ 90 ]. This model builds a regularization framework by exploring the underlying structure of the whole dataset with both datapoints and anchors. Liu et al. [ 90 ] provided a complete overview of approaches for making graph-based methods more scalable.

Extended supervised methods

These methods are direct extensions of traditional supervised learning methods to the semi-supervised setting. The most prominent examples of this class of methods are: (i) semi-supervised support vector machine and (ii) semi-supervised neural networks.

Mainstream models of semi-supervised SVM include many variants such as S3VM [ 91 ], TSVM [ 92 ], LapSVM [ 93 ], meanSVM [ 94 ], and S3VM based on cluster kernel [ 95 ]. The related literature presents S3VM and TSVM as the two most popular variants. The optimal goal of S3VM is to build a classifier by using labeled data and unlabeled data. Similar to the idea of the standard SVM, S3VM requires the maximum margin to separate the labeled data and unlabeled data, and the new optimal classification boundary must satisfy that the classification on original unlabeled data has the smallest generalization error. TSVM exploits specific iterative algorithms which gradually search a reliable separating hyperplane (in the kernel space) with a transductive process that incorporates both labeled and unlabeled samples. Since their introduction, semi-supervised SVM models have evolved on different aspects and various approaches have proposed to improve existing variants or to create new ones [ 96 ].

Recently, numerous research efforts have been made to build an effective classification model using semi-supervised neural networks (SSNN) methods. The hierarchical nature of representations in DNN makes them a viable candidate for semi-supervised approaches. If deeper layers in the network express increasingly abstract representations of the input sample, one can argue that unlabeled data could be used to guide the network towards more informative abstract representations. A common strategy of this line of research is to train the DNN by simultaneously optimizing a standard supervised classification loss on labeled samples along with an additional unsupervised loss term imposed on either unlabeled data or both labeled and unlabeled data [ 97 ]. The typical structure for such strategy is Ladder Networks [ 98 ], an autoencoder structure with skip connections from the encoder to decode. proposed by Rasmus et al. [ 98 ], this model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Prémont-Schwarz et al. [ 99 ] extended the Ladder Network architecture to the recurrent setting by adding connections between the encoders and decoders of successive instances of the network. A related group of SSNN methods is known as teacher- student models [ 71 ] where a single or an ensemble of teacher models are trained to predict on unlabeled data and the predicted labels are used to supervise the training of a student model. Thus, the teacher guides the student to approximate its performance under perturbations in the form of noises applied to the input and hidden layers of models. The teacher in the Teacher-Student structure can be summarized as being generated by an exponential moving average (EMA) of the student model. Various ways of applying the EMA lead to a variety of methods of this category. In the VAT Model [ 100 ] and the Π Model [ 101 ], the teacher shares the same weights as the student, which is equivalent to setting the averaging coefficient to zero. The Temporal Model [ 101 ] is similar to Π Model except that it also applies an EMA to accumulate the historical predictions. The Mean Teacher [ 102 ] applies an EMA to the student to obtain an ensemble teacher. There are other types of SSNN methods that are based on generative models [ 130 ], the primary goal of these methods is to model the process that generated new data, this technique will be reviewed in the “Data Augmentation” section.

Unsupervised representation Learning methods

The limited performance of data-hungry models when only a limited amount of labeled data is available for training has led to an increasing interest in literature to learn feature representations in an unsupervised fashion to solve learning tasks with insufficient labeled data. Unsupervised representation learning [ 300 ] encompasses a group of methods that make use of unlabeled data to learn a representation function f such that replacing data point x by feature vector f(x) in new classification tasks reduces the requirement for labeled data. Such learners seek to learn representations that are sufficiently generalizable to adapt to various learning tasks in future. In this case, the representations learned from unsupervised methods are usually assessed based on the performances of downstream classification tasks on top of these representations. Thus, the focus here is not on clustering or dimensionality reduction, but rather on learning unsupervised representations. Accordingly, we review in this subsection the recent progress and the most representative efforts on unsupervised representation learning methods. Generally, three groups of research fall under the umbrella of methods for training unsupervised representations, namely: (i) Transformation-Equivariant Representations, (ii) Self-supervised methods, and (iii) Generative Models.

Transformation-equivariant representations

The learning of Transformation-Equivariant Representations (TERs), was introduced by Hinton et al. [ 103 ] as the key idea of training capsule nets and has played a critical role in the success of Conventionnel Neural Networks (CNNs). It has been formalized afterward in various ways. Basically, TER learning seeks to model representations that equivary to various transformations on images by encoding their intrinsic visual structures. Then the successive problems for recognizing unseen visual concepts can be performed on top of the trained TER in an unsupervised fashion. Along this line of research, Group-Equivariant Convolutions (GEC) [ 104 ] have been proposed by directly training feature maps as a function of different transformation groups. The resultant feature maps are proved to equivary exactly with designated transformations. However, GEC have a restricted form of feature maps as a function of the considered transformation group, which limits the flexibility of its representation in many applications. Recently, Zhang et al. [ 105 ] proposed Auto-Encoding Transformations (AET), this form of TER guarantees more flexibility to enforcing transformation equivariance by maximizing the dependency between the resultant representations and the chosen transformations. Qi et al. [ 106 ] proposed later an alternative Auto-encoding Variational Transformation (AVT) model that reveals the connection between the transformations and representations by maximizing their mutual information.

Self-supervised methods

Self-supervision is a form of unsupervised learning where the data provides the supervision. Broadly speaking, self-supervised learning converts an unsupervised learning problem into a supervised one by creating surrogate labels from the unlabeled dataset, potentially greatly reducing the number of labeled examples required [ 107 ]. Currently, there are several techniques to achieve that, including Autoregressive models, such as PixelRNN [ 108 ], PixelCNN [ 109 ], and Transformer [ 110 ]. These methods are trained by predicting the context, missing, or future data, they can generate useful unsupervised representations since the contexts from which the unseen parts of data are predicted often depend on the same shared latent representations. Generative models can also be considered as self-supervised, but with different goals: Generative models focus on creating diverse and realistic data, while self-supervised representation learning care about producing good features generally helpful for many tasks.

Generative models

As for SSL, Auto-Encoders [ 141 ], Generative Adversarial Nets (GAN) [ 130 ] and many other generative models have been widely studied in unsupervised learning problems, from which compact representations can be learned to characterize the generative process for unlabeled data. By using an unsupervised fashion such models aim essentially at generating more data, this is why, as mentioned before, generative models are reviewed under the “Data Augmentation” strategy.

Reinforcement learning

Another learning paradigm that has driven impressive advances in recent years without the need for gobs of real-world data is Reinforcement Learning (RL) [ 111 ].

RL is one step more data-efficient than supervised learning. In supervised learning, the learner learns from a labeled dataset with guidance. Whereas RL agent interacts with its environment, performs actions, and learns by a self-guided trial-and-error method [ 301 ]. In other words, in the absence of a training dataset, RL agent is bound to learn from its experience. Seen from this perspective, RL algorithms can be viewed as an optimized-data alternative to supervised learning algorithms, since the sample complexity does not depend on preexisting data, but rather on the actions the agent takes and the dynamics of the environment [ 302 ].

One of the remarkable achievements of such learning paradigm is AlphaGo Zero [ 1 ], as given absolutely no prior data other than the game’s rules. With no other input, simply by playing against itself, AlphaGo Zero learned the game of Go better than any human or machine ever had. Another example is PILCO (Probabilistic Inference for Learning Control) [ 303 ], a model-based policy search method that propagates uncertainty through time for long-term planning and learns parameters of a feedback policy by means of gradient-based policy search. It achieved an unprecedented data efficiency for learning control policies from scratch (it requires only about 20 trials, experience of about 30 s), and is directly applicable to physical systems, e.g., robots.

Following the taxonomy of Arulkumaran et al. [ 112 ] two main RL approaches can be distinguished: (i) methods based on value functions which are based on estimating the value (expected return) of being in a given state. This approach forms the foundation of the state-action-reward-state-action (SARSA) algorithm [ 113 ], and Q-learning [ 114 ] the most commonly used RL algorithms. And (ii) methods based on policy search that do not need to maintain a value function model, but directly search for an optimal policy. There is also a hybrid, actor-critic approach, which employs both value functions and policy search. Between the two approaches, policy-based methods are known to be significantly more sample-efficient because they reuse data more effectively [ 304 ]. For instance, Guided Policy Search [ 305 ] is very data-efficient as it uses trajectory optimization to direct policy learning and avoid poor local optima.

From the model perspective, RL algorithms can be categorized as (i) model based and (ii) model free depending on whether the agent has the access or learns a model of the environment [ 112 ]. Having a model in hands allows the agent to plan ahead to predict state transitions and future rewards. Thus, If the model is correct, then the learning would be greatly benefited in terms of sample efficiency compared to model-free methods. Hence, model-based algorithms are taking the lead in terms of data efficiency as they try to derive a model of the environment and use that model for training the policy instead of data from real interactions (e.g., PILCO) [ 304 ].

Contemporary deep reinforcement learning (DRL) has led to tremendous advancements [ 115 ], but has also inherited shortcomings from the current generation of deep learning techniques that turned the paradigm of trial-and error-learning to a data-hungry model [ 116 ]. Indeed, the combination requires humongous experience before becoming useful, it is even claimed that DRL hunger for data is even greater than supervised learning. This is why although DRL can potentially produce very complex and rich models, sometimes simpler, more data-efficient methods are preferable [ 112 ].

In fact, DRL excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment such as game settings. However, training DRL model with limited interaction environment such as production-scale, healthcare or recommender systems is challenging because of the expensiveness of interaction and limitation of budget at deployment. The recent wave of DRL research tried to address this issue, for instance, Botvinick et al. [ 117 ] suggested in its recent work two key DRL methods to mitigate the sample efficiency problem: episodic deep RL and meta-RL. Buckman et al. [ 306 ] proposed a stochastic ensemble value expansion (STEVE) to combine deep model-free and deep model-based approaches in RL in order to achieve the high performance of model-free algorithms with low sample complexity of model-based algorithms. To reduce the number of system interactions while simultaneously handling constraints, Kamthe et al. [ 307 ] introduced a model-based DRL framework based on probabilistic Model Predictive Control (MPC) with learned transition models using Gaussian processes. The proposed approach requires on average only six trials (18 s). Popov et al. [ 308 ] introduced two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG) for data-efficient DRL. They showed that by decoupling the frequency of network updates from the environment interaction, data-efficiency has substantially been improved. In a recent work, Schwarzer et al. [ 309 ] proposed Schwarzer Self-Predictive Representations (SPR), a method that makes use of self-supervised techniques along with data augmentation to train DLR in limited interaction environment. The model achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100 k steps of environment interaction, which represents, according to the authors, 55 % relative improvement over the previous state-of-the-art.

Ultimately, unlabeled data are expected to be a game-changer for AI to move forward beyond supervised, data-hungry models. While introducing his most recent research « SimCLR » [ 118 ] a framework for contrastive learning of visual representations that has achieved a tremendous performance leap in image recognition using unsupervised learning, AI pioneers Geoff Hinton quoted recently in AAAI 2020 Conference that « unsupervised learning is the right thing to do ». Appearing on the same AAAI stage, Turing Award winner Yann LeCun agreed that unsupervised learning, semi-supervised learning, or any model training that does not require manual data labeling are vital tools for the progress of ML and its applications. The literature is flourishing with a broad variety of semi-supervised and unsupervised algorithms (Fig.  6 ; Table  1 summarizes the key discussed methods). As a matter of fact, recently, both lines of research have strongly focused on DNN, particularly deep generative models that have been extensively used for self-supervision and have also been extended to the semi-supervised setting. However, despite the success of these methods, a considerable amount of empirical studies reveals that exploiting unlabeled data might deteriorate learning performance [ 71 ]. The potential performance degradation caused by the introduction of unlabeled data is one of the most important issues to be resolved especially in SSL. Furthermore, we noted that the evaluation aspect has received relatively little attention in the literature. Pragmatic baselines to be used for empirically evaluating the performance of non-supervised learning methods in order to choose an approach that is well suited to a given situation are relatively rare. Recently, Oliver et al. [ 119 ] established a set of guidelines for the realistic evaluation of SSL algorithms. In turn, Palacio-Ninoe et al. [ 120 ] have proposed evaluation metrics for unsupervised learning algorithms. In recent works, there has been a notable shift towards automatic selection and configuration of learning algorithms for a given problem. However, while automating ML pipeline has been successfully applied to supervised learning [ 121 ], this technique is yet to be extended to the non-supervision settings.

figure 6

Categorization of non-supervised learning methods

  • Data augmentation

To fight the data scarcity problem and to increase generalization, the literature suggests the use of Data Augmentation (DA) techniques. DA entails a set of methods that apply mutation in the original training data and synthetically creating new samples [ 122 ]. It is routinely used in classification problems to reduce the “overfitting” caused by limited training data [ 123 ]. Indeed, when a model is trained with a small training set, the trained model tends to overly fit to the samples in the training set and results in poor generalization. DA acts as a regularizer to combat this. Considered more and more as a vital and ubiquitous instrumental data processing step in modern ML pipelines, DA has become a subject of big interest in both academic and industrial settings. Contributions in this field are actively growing; new DA techniques emerge in a regular basis. Being unable to cover all existing techniques, based on the studied literature, we rather propose a classification of existing augmentation strategies hinge on four aspects: (i) Whether the mutation/transformation is handcraft or smart (learning-based), accordingly we distinguish between basic and generative augmentations. (ii) Whether the augmentation is performed in the data or the feature space, accordingly we distinguish between data-space and feature-space augmentations. (iii) Whether the data to be augmented are acquired or come from another similar dataset, accordingly we distinguish between in situ augmentation and borrowed augmentations. (iv) Whether the data to be augmented are labeled or unlabeled, accordingly we distinguish between supervised and unsupervised augmentations. In the following, we briefly introduce the main methods and review works that made the biggest impact in each class of augmentation.

Basic vs generative augmentations

The most popular and basic augmentation schema is the traditional transformations, the aim of this class of methods is to preserve the label of the data through simple transformations which can happen in realistic data. For image augmentation, for example, this can be achieved by performing geometric transformations (such as random flipping, cropping, translation, rotation…), or by changing color, brightness, or contrast (Fig.  7 ). Intuitively, a human observer can still recognize the semantic information in the transformed image, while for the learner it is perceived as new data. The manipulations applied to ImageNet [ 123 ], remains the standard for this class of technique. The model has been used extensively for various purposes since its development. Vast amounts of research have used it to benchmark their models against or as a base model to test new transformations. On the other hand, the MNIST (handwritten digit) dataset [ 124 ] is commonly augmented using elastic distortions [ 125 ], another transformation technique that mimics the variations in pen stroke caused by uncontrollable hand muscle oscillations. Yaegeret al. [ 126 ] also used the same technique for balancing class frequencies, by producing augmentations for under-represented classes. Mixing paring samples [ 127 ] proposed by Inoue et al. is another basic augmentation technique for image classification task, which can create the new image from an original one by overlaying another image randomly picked from the training set. Zhong et al. [ 128 ] introduced random erasing as a means to make models more robust to occlusion, by randomly erasing rectangular regions of the input image. Generally, basic class of augmentations has been proven to be fast, reproducible, and reliable technique with an ease implementation [ 122 ,  129 ]. However, it relies on simple and basic transformation functions, in some specific cases, this could result in further overfitting. This has prompted further investigation for new more advanced and powerful DA techniques that include learning algorithms in the augmentation process.

Motivated by the recent advance of generative models especially adversarial learning, Generative Adversarial Networks (GAN) [ 130 ] have been increasingly used for generating synthetic data. In a nutshell, in GAN based augmentation, two networks are trained to compete with each other, the Generator and the Discriminator, the first creates new data instances (typically an image) while the second evaluates them for authenticity (real or fake), this co-optimized process results in generating realistic synthesized data (Fig.  7 ). The result obtained using generative models differs from the one obtained by basic transformations. The latter modifies real data with some sort of predefined transformation functions while the former creates new synthetic data. The synthetic data need to be different enough from the original ones so that these variations lead to a better generalization capacity. In contrast to basic augmentation techniques which are limited to minor changes on data to not damage the semantic content. This makes generative augmentation similar to imagination or dreaming, it has a creative effect that makes it known for its artistic applications, but this schema also serves as a great tool for DA.

figure 7

Main augmentation techniques (images for domain adaptation are from [ 181 ])

Basic GAN architectures are unable to create high-quality new samples. This is why the main contributions in GAN based augmentation are new architectures that modify the standard GAN framework through different network architectures, loss functions, evolutionary methods, and others to produce higher quality additional data. One of these variants is conditional GAN introduced by Odena et al. [ 131 ] in 2016 to generate data by controlling the random noise generation. Many extensions of conditional GANs have been proposed afterward. ACGAN (Auxiliary classifier GAN) [ 131 ] changed the GAN energy function to add the discrimination class error of the generated sample and the real sample. This variant demonstrates that a complex latent coder could boost the generative sample’s resolution. Antoniou et al. proposed DAGAN (Data Augmentation GAN) [ 132 ] that generates synthetic data using a lower-dimensional representation of a real image. The authors train a conditional GAN on unlabeled data to generate alternative versions of a given real image. Mariani et al. proposed BAGAN (balancing GAN) [ 133 ] as an augmentation tool to restore balance in imbalanced datasets. The use of non-conditional GANs to augment data directly has only very recently been explored. Karras et al. used PGGAN (Progressive Growing of GAN) [ 134 ] a stable architecture to training GAN models to generate large high-quality images that involves incrementally increasing the size of the model during training. This approach has proven effective at generating high-quality synthetic faces that are startlingly realistic. The DCGAN (deep conventional GAN) [ 135 ] is one of the successful network architectures for GANs. The main contribution of the DCGAN is the use of convolutional layers in the GAN framework which provides stable training in most cases and produces higher resolution images.

Rather than generating addition samples, another class of innovative variants of GAN attempts to translate data across domains, this consists of learning a mapping between data from a source domain (typically with large samples) and data from a similar target domain (with small samples), such as dogs to wolfs. This helps to compensate the domain with few samples by data from other related domains. Inpix2pix [ 136 ], a conditional GAN was used to learn a mapping from an input image to an output image; Inpix2pix learns a conditional generative model using paired images from source and target domains. CycleGAN (Cycle consistent adversarial networks) was proposed by Zhu et al. [ 137 ] for image-to-image translation tasks in the absence of paired examples through introducing the cycle consistency constraint. Similarly, Disco GAN [ 138 ] and Dual GAN [ 139 ] used an unsupervised learning approach for image-to-image translation based on unpaired data, but with different loss functions. CoGAN [ 140 ] is a model which also works on unpaired images, using two shared-weight generators to generate images of two domains with one random noise.

Another generative technique to synthesize data using neural networks is the so-called variational autoencoder (VAE). Originally proposed in [ 141 ], VAE can be seen as a generative model that learns a parametric latent space of the input domain from which new samples can be generated. This has been mostly exploited for image generation [ 142 ]. However recently, VAEs have also been recently used for speech enhancement [ 143 ] and also for music sounds synthesis [ 144 ].

As reported by many scholars [ 122 ,  145 ], the primary problem with generative augmentations is that it is hard to generate data other than images, and even within image data setting it is very difficult to produce high-resolution output images. Moreover, like any ANN, GAN and VAE require a large amount of data to train and its model can be unstable or it can overfit. Thus, depending on how limited the initial dataset is, generative may not be a practical solution [ 145 ].

Data‐space vs feature‐space augmentation

Basic augmentations discussed above are applied to data in the input space, they are called “data warping” methods [ 146 ] as they generate additional samples through transformations applied in the data-space. The main challenge with such augmentation schemes is that they are often tuned manually by human experts. Hence, they are “application-dependent” (transformations are domain-specific) and they require domain expertise to validate the label integrity and to ensure that the newly generated data respects valid transformations (that would occur naturally in that domain).

On the other end of the spectrum, we have “synthetic over-sampling” methods, which create additional samples in feature-space. This class of techniques presents thus the advantage of being domain-agnostic, requiring no specialized knowledge, and can, therefore, be applied to many different types of problems [ 146 ,  147 ]. Synthetic Minority Over-sampling Technique (SMOTE) [ 148 ] is a well-known feature augmentation method which handles imbalanced dataset by joining the k nearest neighbors to form new instances. Adaptive Synthetic (ADASYN) [ 149 ] is similar to SMOTE, they function in the same way. By contrast, ADASYN adds a random small bias to the points after creating the samples to make them not linearly correlated with their parents, which increases the variance in the synthetic data. The fact that image datasets are often imbalanced poses an intense challenge for DA. Like SMOTE and ADASYN, a lot of work has emerged focusing on restoring the balance in imbalanced images while creating new samples. Milidiu et al. [ 312 ] proposed the Seismo Flow, a flow-based generative model to create synthetic samples, aiming to address the class imbalance. Shamsolmoali et al. [ 313 ] introduced a GAN variation called CapsAN that handles the class imbalance problem by coalescing two concurrent methods, GANs and capsule network. Lee et al. [ 314 ] showed that pre-training DNNs with semi-balanced data generated through augmentation-based over-sampling improves minority group performance.

Furthermore, by manipulating the vector representation of data within a learned feature space, a dataset can be augmented in a number of ways, DeVries and Taylor [ 147 ] discussed adding noise, interpolating, and extrapolating as useful forms of feature space augmentation, while Kumar et al. [ 150 ] studied six feature space DA methods to improve classification, including Upsampling, Random Perturbation, Conditional Variational Autoencoder, Linear Delta, Extrapolation and Delta-Encoder.

In situ augmentations vs borrowed augmentations

Common augmentation techniques described so far are self-sufficient, that is they make use of the available small data to generate larger dataset without the need for any external data. For this, we can consider them “In situ augmentations”. However, they only work under the assumption that some initial data are available in the first place. In scenarios where no primary data are available, previously discussed techniques are not applicable. A very human-like way to tackle this issue is to ask someone to lend you what you are missing (such as borrowing salt or pepper from a neighbor or asking a dress from a friend). Similarly, instead of being limited only to the available training data, a “Borrowed augmentations ” schema -if we may call it- augments data by aggregating and adapting input-output pairs from similar but larger data sets. A typical application of this method is autonomous vehicle where training data can be transferred into a night-to-day scale, winter-to-summer, or rainy-to-sunny scale (Fig.  7 ). Basically, transforming samples from a dataset to another aims at learning the joint distribution of the two domains and finding transformations between them. This line of research addresses the problem of domain shift [ 151 ] known as the dataset bias problem, i.e. mismatch of the joint distribution of inputs between source and target domains. An early work [ 152 ] that addressed the problem, proposed to learn a regularized transformation using information-theoretic metric learning that maps data in the source domain to the target domain. This is considered one of the first studies of domain adaptation [ 153 ] in the context of object recognition. However, this approach requires labeled data from the target domain as the input consists of paired similar and dissimilar points between the source and the target domain. In contrast, Gopalan et al. [ 154 ] proposed a domain adaptation technique for an unsupervised setting, where data from the target domain is unlabeled. The domain shift, in this case, is obtained by generating intermediate subspaces between the source and target domain, and projecting both the source and target domain data onto the subspaces for recognition. Unsupervised domain adaptation has been largely investigated afterward [ 155 , 156 , 157 , 158 ]. Recently, it was shown that a GAN objective function can be used to learn target features indistinguishable from the source ones. Hence, most recent works regarding data transportation cross-domains are based on generative models. For instance, the aforementioned technique of image-to-image translation based on GANs is a successful example of such schema, other similar techniques include neural style transfer (translate images from one style to another) [ 159 ], Text-to-Image Translation [ 160 ], Audio-to-Image Generation [ 161 ], Text-to-Speech synthesis [ 162 ] … etc. By relying on GAN, other recent works made use of this model to boost performance. Wang et al. [ 163 ] proposed Transferring GANs (TGANs) which incorporate a fine-tuning technique into GAN, to train this latter with low-volume target data. Yamaguchi et al. [ 164 ] import data contained in an outer dataset to a target model by using a multi-domain learning GAN. Huang et al. [ 165 ] proposed AugGAN, a cross-domain adaptation network, which allows to directly benefit object detection by translating existing detection RGB data from its original domain other scenarios. As one may note, while most works address transferring data cross domain for image generation, the challenge is still modestly explored in other domains [ 166 ].

Supervised vs unsupervised data augmentation

Augmentations schemas are class-preserving transformations, they rely on labeled data (supervised augmentation). However, if getting more data is hard, getting more labeled data is harder. Whilst collecting unlabeled data is easier and cheaper as human effort is not needed for labeling, a major issue is how to augment data without labels. Typically, SSL and unsupervised methods discussed previously are the best candidates to address the issue. Remarkably, tackling the challenge of using unlabeled data has been the subject of relatively few works in the literature in comparison with supervised augmentation methods. In recent work, Xie et al. [ 167 ] showed that data augmentation can be performed on unlabeled data to significantly improve semi-supervised learning. Their model relies on a small amount of labeled examples to make correct predictions for some unlabeled data, from which the label information is propagated to augmented counterparts through the consistency loss. Aside cycle consistency regularization, the commonly used approach for augmenting smaller labeled datasets using larger unlabeled datasets is self-training or more generally co-training [ 75 ], as discussed in the previous strategy, this type of training relies on an iterative process that use pseudo-labels on unsupervised data to augment supervised training. Always with the goal of leveraging a large amount of unlabeled data and a much smaller amount of labeled data for training, others methods have been proposed in the literature using methods such as Temporal Ensembling [ 101 ], Mean Teacher [ 102 ], self-paced learning [ 168 ], and data programming [ 169 ].

To sum up, there no best augmentation schema, the choice of the technique to use depends on the application scenario. When no data is available, borrowed augmentations should be considered. When a large amount of unlabeled samples exists, unsupervised augmentations are the best choice. Fig.  7 ; Table  2 depicts the main reviewed DA techniques. However, it is noteworthy that there are very few studies in the literature that compare empirically the performance of the different augmentations. Wong et al. [ 146 ] compared data-space or feature-space and found that it was better to perform data augmentation in data-space, as long as label preserving transforms are known. Shijie et al. [ 170 ] compared generative methods with some basic transformations. They found the combinations of the two types of augmentation drive better performance. Indeed, the choice of combining augmentation techniques can result in massively inflated dataset sizes. However, this is not guaranteed to be advantageous, especially in very limited data setting, this could result in further overfitting [ 122 ]. Furthermore, the classes of techniques described in this section are neither mutually exclusive nor exhaustive. That means depending on the complexity, the space, the domain, and the data annotability on which the augmentation occurs, techniques can belong to different classes. For example, generative augmentations like cycleGAN are used to implement image to image translation, which is a type of borrowed augmentations. GANs have been also exploited in the context of unsupervised augmentation. For instance, Wang et al. [ 171 ] proposed a variant of CycleGAN (DicycleGAN) that performs an unsupervised borrowed augmentation based on a generative model.

Regardless their numbers and capacities, current DA implementations remain manually designed. A key research question is then to find automatically the effective DA schema for a given dataset by searching in a large space of candidate transformations. State-of-the-art approaches to address this problem include TANDA a framework proposed by Ratner et al. [ 172 ] to learn augmentations based on GAN architecture. And, AutoAugment [ 173 ] demonstrated state-of-the-art performance using a reinforcement learning algorithm to search for an optimal augmentation technique amongst a constrained set of transformations with miscellaneous levels of distortions. Several subsequent works including RandAugment [ 174 ] and Adversarial AutoAugment [ 175 ] have been proposed to reduce the computational cost of AutoAugment, establishing new state-of-the-art performance on image classification benchmarks.

As noted several times before, DA has essentially been used to achieve nearly all state-of-the-art results for image data, particularly for medical imaging analysis. In this domain where high-quality supervised samples are generally scarce and fraught with legal concerns regarding patient privacy, image augmentation is considered a de facto technique [ 176 , 177 , 178 ]. Medical data suffer also from the so-called “p large, n small” problem (where p is the number of features and n is the number of samples), hence, some works [ 310 ] attempted to fight the curse of data dimensionality along with the curse of data scarcity by proposing a dimensionality reduction-based method that can be used for data augmentation. Unfortunately, dataset augmentation is not as straightforward to apply in other domains as it is for images. Current effort of exploring DA in others non-image domains includes mainly sound, speech, and text augmentation. In this vein, Schluter and Grill [ 179 ] investigated a variety of DA techniques for application to singing voice detection. Wei et al. [ 180 ] proposed a text augmentation technique for improving NLP application performance.

Knowledge sharing

A common assumption in most ML algorithms states that the training and future (unknown) data must be drawn from the same data space and have to follow the same distribution [ 182 ] (as stressed before, following the PAC-learnability criteria, the distribution D must be stationary-see the " Background " section). This implies that when the task to be learned or its domain change, the model needs to be rebuilt from scratch using newly collected training data. This paradigm is called single task learning or isolated learning. The fundamental problem with this way of learning is that it does not consider any other related information or the previously learned knowledge to alleviate the need for training data for a giving task. This is in sharp contrast of how we humans learn. As discussed in the " Background " section, human learning is very knowledge-driven: we accumulate and maintain the knowledge learned from previous tasks and use it seamlessly in learning new tasks and solving new problems with little data and effort. Towards the ultimate goal of building machines that learn like humans, some research areas attempted to break the training data exclusive dependency by exploring the idea of using prior knowledge as additional inputs for ML models apart from standard training data. We characterize this family of approaches as knowledge sharing strategy. Depending on how, when and what extent of knowledge is shared, the research is conducted under different guises, however all approaches share the same spirit: reusing knowledge instead of relying solely on the tasks’ training data. Next, we investigate the four main ways of sharing knowledge found in the literature, namely (A) Transfer Learning, (B) Multi-Task-Learning, (C) Lifelong Learning, and (D) Meta-Learning.

Transfer Learning

Inspired by human beings’ capabilities to transfer knowledge across tasks, Transfer Learning (TL) aims to improve learning and minimize the amount of labeled samples required in a target task by leveraging knowledge from the source task. Following the Pan et al. [ 182 ] definition: given a source domain D S and a learning task T S , a target domain D T and a learning task T T , TL aims to help improve the learning of the target predictive function fT(.) in D T using the knowledge in D S and T S , where D S ≠ D T or T S ≠ T T . Accordingly, TL allows the tasks and distributions used in training (source) and testing (target) to be different. When the target and source domains are the same, i.e., D S = D T , and their learning tasks are the same, i.e., T S = T T , the learning problem becomes a traditional ML problem.

Surveys [ 182 ] and [ 183 ] proposed and discussed a taxonomy of TL which has been widely accepted and used. Depending on the availability of labeled data in source and/or target data, they distinguished between [ 182 ]: (i) inductive TL, (ii) transductive TL and (iii) unsupervised TL, which correspond respectively to the case of having available labeled target domain data, the case of having labeled source and no labeled target domain data, and the case of having no labeled source and no labeled target domain data. Domain adaptation, the DA technique discussed before is a type of transductive TL in which the source task and the target task are the same but their domains are different. Furthermore, regardless of the availability of labeled and unlabeled data, TL problems can generally be categorized into two main classes [ 183 ]: homogeneous transfer learning and heterogeneous transfer learning, the former category focused on generalization performance across the same domain representations, meaning that the samples in a source domain and those in a target domain share the same representation structure but follow different probability distributions, the majority of TL approaches belong to this category. In the latter category, the feature spaces between the source and target are nonequivalent and are generally non-overlapping, this case is more challenging as knowledge is available from source data but it is represented in a different way than that of the target. This method thus requires feature and/or label space transformations to bridge the gap for knowledge transfer, as well as handling the cross-domain data distribution differences.

The effectiveness of any transfer method depends on the source task and how it is related to the target task. A transfer method would produce positive transfer between appropriately related tasks, while negative transfer occurs when the source task is not sufficiently related to the target task or if the relationship is not well leveraged by the transfer method [ 184 ]. Increasing positive transfer, and avoiding negative transfer is one of the major challenges in developing transfer methods.

TL methods in the literature share the same function: leveraging the knowledge in the source domain. Three classes of TL methods can be defined based on the type of the shared knowledge: instance, feature, or model (parameter), accordingly we can distinguish between: (i) instance-based TL approaches that reuse labeled data from the source domain by re-weighting or resampling instances to help to train a more precise model for a target learning task [ 185 ]. (ii) feature-based TL approaches, the transfer in this type of approaches is operated in an abstracted “feature space” instead of the raw input space. The aim is to minimize domain divergence and reduce error rates by identifying good feature representations that can be utilized from the source to target domains [ 186 ]. And Model-based TL, also known as parameter-based TL, here the transferred knowledge is encoded into model parameters, priors or model architectures. Therefore, the goal of this class of approaches is to discover what part of the model learned in the source domain can help the learning of the model for the target domain [ 187 ]. Model-based TL is arguably the most frequently used method. Additionally, we also identified relational based TL where data are non-independent and identically distributed. The three main TL approaches implicitly assume that data instances are independent and identically distributed. However, in real-world scenarios often contain some structures among the data instances, leading to relational structures in these domains, like for example social network domain. A family of approaches called relational-based TL attempts to handle this issue by building a mapping of the relational knowledge between the source relational domain and the target relational domain [ 188 ].

In the studied literature, TL methods are used in the classic learning tasks including classification, regression, and clustering tasks, relatively fewer but impactful works have also handled TL for reinforcement learning [ 189 ]. Success applications of TL include computer vision [ 190 ], NLP [ 191 ], and urban computing [ 192 ]. Emerging and promising research lines in TL include (i) Hybrid-based approaches, TL solutions that focus on transferring knowledge through the combination of different TL methods, for instance by using both instances and shared parameters. This is relatively a new approach and a lot of interesting research is emerging [ 193 ]. (ii) Deep transfer learning, as deep learning becomes a ubiquitous technique, researchers have begun to endow deep models with TL capabilities. The powerful expressive ability of deep learning has also been leveraged to extract and transfer knowledge such as the relationships among categories. Fine-tuning [ 194 ] is a glaring example of popular and effective technique for knowledge transfer in terms of model parameters based on pre-trained models. The knowledge distillation technique [ 195 ], which involves a teacher network and a student network, is also a good example of this line of work. (iii) Transitive TL [ 196 ], a new type of TL problem where the source and target domains have very few common factors, making most TL solutions invalid. Always by following the human learning model which can conduct transitive inference and learning, novel TL solutions have proposed to connect the source and target domains by one or more intermediate domains through some shared factors. (iv) AutoTL, addresses the issue of learning to transfer automatically [ 197 ]. Wei et al. [ 198 ] proposed a transfer learning framework L2T that automatically explores the space of TL method candidates to discover and apply the optimal TL method that maximally improves the learning performance in the target domain.

Multi-task learning

If a TL method aims to improve the performance of the source task and target task simultaneously, we are dealing with a Multi-task learning (MTL) problem [ 199 ]. MTL shares the general goal of leveraging knowledge across different tasks. However, unlike TL there is no distinction between source and target tasks, multiple related tasks each of which has insufficient labeled data to train a model independently, are learned jointly using a shared representation. The training data from the extra tasks serve then as inductive bias, acting in effect as constraints for the others, improving general accuracy, and the speed of learning. As a result, the performance of all tasks is improved at the same time with no task prioritized. MTL is clearly close to TL, in some literature it is even considered as a type of inductive TL [ 182 ], this is why it is generally acknowledged that MTL problem could be approached with TL methods, however the reverse is not possible [ 200 ]. Some works investigated hybrid scenarios where new task is arrived when multiple tasks have been already learned jointly by some MTL method. This could be seen as MTL problem for old tasks and TL problem to leverage knowledge from the old tasks to the new task. Such setting is called asymmetric multi-task learning [ 201 ].

A variety of different methods has been used for MTL, basically to each nature of the learning task corresponds a different setting in MTL [ 202 ]. Accordingly, (i) the multi-task supervised learning is based on training labeled data for each task. As for TL, researches in this area have been conducted on three categories, that are, (a) feature-based multi-task supervised learning, specifically the problem of feature-selection [ 203 ] and feature transformation [ 204 ]. (b) Model-based multi-task supervised learning, notably, the low-Rank approach [ 205 ], the task clustering approach [ 206 ], and task relation learning approach [ 207 ]. Finally, very modest contributions have been done on the third category, (c) instance-based multi-task supervised learning [ 208 ]. (ii) In multi-task unsupervised learning, each task deals with discovering useful patterns in data. (iii) In multi-task semi-supervised learning, tasks based their predictions on labeled data as well as unlabeled data. (iv) In multi-task active learning, each task selects representative unlabeled data to query an oracle with the hope to reduce the labeling cost as much as possible. (v) In multi-task reinforcement learning, each task aims to maximize the cumulative reward by choosing actions. (vi) In multi-task multi-view learning, each task exploits multi-view data. Recent years witness extensive studies on streaming data, known as online multi-task learning [ 209 ], this class of methods is used when training data in multiple tasks arrive sequentially, hence (vii) in multi-task online learning, each task is to process sequential data.

In settings where MTL consists of tasks with different types including supervised learning, unsupervised learning, reinforcement learning…etc., the MTL is characterized as heterogeneous. In contrast to the homogeneous MTL which consists of tasks of the same type. Unless it is explicitly underlined, the default MTL setting is the homogeneous one [ 202 ].

Given the nature of its process, MTL has been studied under the decentralized settings where each machine learns a separate, but related, task. In this vein, multiple parallel and distributed MTL models have been introduced in the recent literature [ 209 , 210 , 211 ]. Recently, research in MTL using DNN has produced a wide spectrum of approaches that have yielded impressive results on some tasks and application such as image processing [ 212 ], NLP [ 213 ] and biomedicine [ 214 ]. Conversely, there have been exciting results using MLT methods in DNN. Generally, there are two commonly used approaches to carrying out MTL in deep learning: hard and soft [ 215 ]. Hard parameter sharing implies the sharing of hidden layers between all tasks, and the output layers are different. Soft parameter sharing gives each task its own model with its own parameters, where these model parameters have a regularized distance to facilitate the sharing of learning.

Lifelong learning

One of the long-standing challenges for both biological systems and computational models (especially ANN) is the stability-plasticity dilemma [ 216 ]. The basic idea is that a learner requires plasticity for the integration of new knowledge, but also stability in order to prevent the forgetting of previous knowledge. The dilemma is that while both are desirable properties, the requirements of stability and plasticity are in conflict. Stability depends on preserving the structure of representations, plasticity depends on altering it. An appropriate balance is difficult to achieve. Generally, ANN models tend often to have excessive plasticity, a problem that is dramatically referred to as “catastrophic forgetting” (or “catastrophic interference”) [ 216 ] which basically means the loss or disruption of previously learned knowledge when a new task is learned. Recently, a number of approaches have been proposed to mitigate catastrophic forgetting. They aim to design models that are sensitive to, but not disrupted by, new data. These approaches are categorized as lifelong/continual learning (LL) approaches. LL embodies a knowledge sharing process as it makes use of prior knowledge from the past observed tasks to help continuously learning new/future tasks. Hence, LL studies scenarios where a large number of tasks come over time. Thus, to deal with the continuous stream of information, LL approaches include essentially two elements: (a) a retention strategy to sequentially retain previously learned knowledge and (b) a transfer mechanism to selectively transfer that knowledge when learning a new task. Most of the research effort in LL has focused primarily on how to retain knowledge, in doing so, the focus has been shifted to counter catastrophic forgetting. various approaches have been proposed in this sense including (i) architectural methods, (ii) regularization methods, and (iii) rehearsal methods [ 217 ]. A high-level analysis of LL literature shows that since its introduction 25 years ago in [ 218 ], LL concept has mainly evolved in respect of the four-learning paradigms:

Lifelong supervised learning : Early contributions in this area were based on memory systems and neural networks. Thrun [ 219 ] proposed two memory-based learning methods: k-nearest neighbors and Shepard’s method. Although they are still used today, memory-based systems suffer from the drawback of large working memory requirements as they require explicit storage of old information [ 216 ]. On neural networks level, initially, Thrun and Mitchell worked [ 220 ] on a LL approach called explanation-based neural networks EBNN. Since, Silver et al. have extensively work on the extension and the improvement of the neural network approaches through many works [ 221 , 222 , 223 ]. Furthermore, a lifelong naive bayesian classification technique was proposed by Chen et al. [ 224 ], which is applied to a sentiment analysis task. Ruvolo and Eaton [ 225 ] proposed an efficient LML algorithm (ELLA) to improve an MTL method to make it a LL method. Clingerman and Eaton [ 226 ] proposed GP-ELLA to support Gaussian processes in ELLA.

Lifelong unsupervised learning : Works in this area are mainly about lifelong topic modeling and lifelong information extraction. Lifelong Topic Modeling approaches extract knowledge from topic modeling results of many previous tasks and utilizes the knowledge to generate coherent topics in the new task related works in this vein include [ 227 , 228 ]. As the process of information extraction is by nature continuous and cumulative, information extraction represents an evident area for applying LL. Significant works of this line of research include [ 229 , 230 ].

Lifelong semi-supervised learning : The most well-known and impactful work in this area is NELL, which stands for Never-Ending Language Learner [231 − 230]. NELL is a lifelong semi-supervised learning system that has been reading the Web continuously for information extraction since January 2010, and it has accumulated millions of entities and relations.

Lifelong reinforcement learning : Thrun and Mitchell [ 218 ] first studied lifelong reinforcement learning for robot learning. Recently, many works have been proposed in this area due to the recent surge in research in RL after being successfully used in the computer program. Bou Ammar et al. [ 232 ] presented a policy gradient efficient lifelong reinforcement learning algorithm. Tessler et al. [ 233 ] proposed a lifelong learning system that transfers reusable skills to solve tasks in a video game. Rolnick et al. [ 234 ] introduced CLEAR, a replay-based method that greatly reduces catastrophic forgetting in multi-task reinforcement learning.

By analyzing the LL literature, we note that despite the first pioneering attempts and early speculations, research in this field has never been carried out extensively until the recent years. In their book, Chen et al. [ 235 ] emphasized some reasons behind the slow advancement. The main reason according to them is that ML research in the past 20 years focused only on statistical and algorithmic approaches. Moreover, much of the past ML research and applications focused on supervised learning using structured data, which are not easy for LL because there is little to be shared across tasks or domains. They also underlined the fact that many effective ML methods such as SVM and deep learning cannot easily use prior knowledge even if such knowledge exists. However recently as most of the limits caused by these factors have been exceeded, LL is becoming increasingly a rich area of scientific contributions and new approaches have emerged. Notably, continual learning in DNN [ 216 ] and lifelong interactive knowledge learning for chatbots [ 236 ]. Still, we believe that existing LL literature does not sufficiently cover the evaluation aspect, that is what makes a LL system successful, how to compare existing LL algorithms, and what metrics are most useful to report. Hence, much more efforts are expected in the research area for years to come.

Meta‐learning

Meta-learning, or learning to learn (LTL), improves the learning of a new task by using meta-knowledge extracted across tasks [ 237 ]. In a nutshell, LTL treats learning tasks as learning examples. It aims to improve the learning algorithm itself, given the experience of multiple learning episodes. Basically, in a meta-learning system, we distinguish the meta-learner, which is the model that learns across episodes, and the inner-learner, which is instantiated and trained inside an episode by the meta-learner. More specifically, the inner-learner model, typically a CNN classifier, is initialized, and then trained on the support set (e.g., the base training set). The algorithm used to train the inner-learner is defined by the meta-learner model. This latter, updates the inner-learner to be able to improve while solving a task in the classic way (base learning) with only a very small set of training examples. At the end of the episode, the meta-learner’s parameters are trained from the loss resulting from the task learning error [ 238 ]. Thus, meta-learning is tightly linked to the process of collecting and exploiting meta-knowledge. Meta-knowledge collecting is performed by extracting algorithm configurations such as hyperparameter settings, pipeline compositions and/or network architectures, the resulting model evaluations, the learned model parameters, as well as measurable properties of the task itself, also known as meta-features. Then the meta-knowledge is transferred to guide the search for optimal models for new tasks [ 239 ].

From our perspective, we consider LTL a tool for knowledge sharing more than an approach of reusing knowledge per se. Indeed, in the scanned literature, LTL is usually introduced as a method to solve other knowledge-sharing scenarios. Particularly, LTL is commonly described as the de facto method to solve few-shot learning (FSL) problems [ 7 ], a regime where only few experiences are available. Therefore, we propose in the following to review LTL methods in respect of the three discussed approaches, namely: TL, MTL, and LL, while shedding light on FSL, the most popular instantiation of LTL in the field of supervised learning.

Meta-learning-based methods for FSL

As the name implies, FSL refers to the problem of learning a new concept or task with only a few training examples or no pre-labeled learning example [ 7 ]. FSL is not a knowledge sharing approach itself, but it is an umbrella term encompassing techniques that make use of prior knowledge methods to deal with data scarcity scenarios. There are three main variants of FSL, (i) zero-shot learning [ 240 ], which deals with learning a task that has no associated labeled training samples, (ii) one-shot learning [ 241 ] where tasks are learned from a single example, and (iii) low shot learning, assumes that a handful (typically 2–5) labeled examples exist for target/novel classes. Recently, FSL has sparkled with several successful applications in literature including few-shot classification [ 242 ], few-shot object detection [ 243 ], semantic segmentation [ 244 ], and landmark prediction [ 245 ]. Generally, existing FSL models fall into two main groups, (i) Hallucination-based methods (practically data augmentation) deal directly with the data scarcity by “learning to augment”, however DA could alleviate the issue, but does not solve it. In this section, we focus on the second group (ii) Meta-learning-based methods that tackle the FSL problem by “learning to learn”. The majority of this class of methods can be labeled as either a metric learning algorithm or as a gradient-based meta-learner.

Metric learning algorithm : These methods address the FSL problem by “learning to compare”. The basic idea of metric learning is to learn a distance function between data points (like images). It has proven to be very useful for solving FSL problem for classification tasks: instead of having to fine-tune on the support set (the few labeled images), metric learning algorithms classify query images by comparing them to the labeled images. Koch et al. [ 246 ] proposed the Siamese Neural Networks to solve few-shot image classification. Their model learns a siamese network by metric-learning losses from a source data, and reuses the network’s features for the target one-shot learning task. Vinyals et al. [ 247 ] proposed Matching Networks that use an episodic training mechanism. Snell et al. [ 248 ] introduced prototypical Networks that learn a metric space in which classification can be performed by computing distances to prototype representations of each class. Sung et al. [ 249 ] proposed Relation Network that uses CNN-based relation module as a distance metric. Li et al. [ 250 ] designed a model named Covariance Metric Networks (CovaMNet) to exploit both the covariance representation and covariance metric based on the distribution consistency for the few-shot classification tasks. Wertheimer et al. [ 251 ] localized objects using a bounding box. Garcia et al. [ 242 ] used Graph Neural Network based model. Despite the rich contributions in this line of research, relation measure, that is how to robustly measure the relationship between a concept and a query image remains a key issue in this class of FSL methods.

Gradient-based Meta-Learning : These methods address the FSL problem by “learning to optimize”. They embed gradient-based optimization into the meta learner. More specifically, in such models, there is an inner- loop optimization process that is partially or fully parameterized with fully differentiable modules. The methods of this class differ according their choice of the meta-model algorithm. The most famous meta-learners in the literature are perhaps (i) Meta-LSTM introduced by Ravi & Larochelle [ 252 ], a meta-learner that uses a Long-Short-Term-Memory network to replace the stochastic gradient decent optimizer and the weight-update mechanism. And (ii) Model-Agnostic Meta-Learning (MAML) [ 253 ] is currently one of the most elegant and promising LTL algorithms. MAML provides a good initialization of a model’s parameters to achieve optimal fast learning on a new task with only a small number of gradient steps. This method is compatible with any model trained with gradient descent (model-agnostic), and has been shown to be effective in many classification and reinforcement learning applications. Following this line of work, many recent studies [ 315 , 316 , 317 , 318 ] focused on learning better initialization by adaptively learning task-dependent modifications. In these works, the inner-loop optimization is generally based on first-order optimizer algorithms such as SGD and Adam. A few recent studies propose optimizer-centric approaches [ 319 , 320 , 321 ], they are models that focus not only on adjusting the optimizer algorithm but on learning the inner optimizer itself.

Meta-learning in TL setting

There are many works in the literature that combined TL with LTL. Aiolli [ 254 ] proposed an approach to transfer learning based on meta kernel learning. Eshratifar et al. [ 255 ] propose a joint training approach that combines both TL and meta-learning loss functions into a learning algorithm. Sun et al. [ 256 ] proposed a novel FSL method called meta-transfer learning which learns to adapt a DNN for FSL tasks. Later, the authors introduced the hard task meta-batch scheme as a learning curriculum that further boosts the learning efficiency of the proposed meta transfer learning [ 257 ]. Li et al. [ 258 ] bring forward a novel meta-transfer feature method (MetaTrans) for measuring the transferability among domains. Some of the recent applications of meta-transfer learning include meta-transfer learning for learning to disentangle causal mechanisms [ 259 ], meta-transfer learning for zero-shot super-resolution [ 260 ], meta-transfer learning for code-switched speech recognition [ 261 ], and meta transfer learning for adaptive vehicle tracking in UAV Videos [ 262 ].

Meta-learning in MTL setting

LTL has recently emerged as an important direction for developing algorithms for MTL. Indeed, meta-learning can be brought in to benefit MTL, notably by learning the relatedness between tasks or how to prioritize among multiple tasks. In this vein, Lin et al. [ 263 ] proposed an adaptive auxiliary task weighting to speed up training for reinforcement learning. Franceschi et al. [ 264 ] proposed a forward and reverse gradient-based hyperparameter optimization for learning task interactions. Epstein et al. [ 265 ] proposed a meta-learning framework for extracting sharing features among multiple tasks that are learned simultaneously. Chen et al. [ 266 ] used a shared meta-network to capture the meta-knowledge of semantic composition and generate the parameters of the task-specific semantic composition models in MLT setting. Amit et al. [ 267 ] proposed a PAC-Bayes meta-learning method designed for multi-task learning.

Meta-learning in LL setting

LL can also be realized through LTL. Riemer et al. [ 268 ] proposed a framework called Meta-Experience Replay (MER) that integrates meta-learning and experience replay for continual learning. Javed et al. [ 269 ] proposed OML, a meta-learning objective that directly minimizes catastrophic interference by learning representations that accelerate future learning and are robust to forgetting under online updates in continual learning. He et al. [ 270 ] proposed task agnostic continual learning framework based on meta-learning that is implemented by differentiating task specific-parameters from task agnostic parameters, where the latter are optimized in a continual meta-learning fashion, without access to multiple tasks at the same time. Munkhdalai et al. [ 271 ] introduced a meta-learning model called MetaNet that supports meta-level LL by allowing ANN to learn and to generalize a new task or concept from a single example on the fly. Vuiro et al. [ 272 ] proposed a meta-training scheme to optimize an algorithm for mitigating catastrophic forgetting. Xu et al. [ 273 ] described an LTL method to improve word embeddings for a lifelong domain without a large corpus.

In this section, by knowledge sharing we referred to all types of learning based on prior experiences with other tasks. When, how, and what knowledge is shared determinate the class of methods (Table  3 summarizes the reviewed class of methods). Nevertheless, throughout the literature, we noted a number of terminology inconsistencies. Phrases such as “transfer learning” and “multi-task learning” or “few-shot learning” and “meta-learning” are sometimes used interchangeably. This is often a source of confusion as the studied concepts are closely related and boundaries between them aren’t always clear. Certainly, the reviewed approaches are similar in their common goal of knowledge reuse, however, they are different in their specific ways to handle knowledge transfer (Fig.  8 highlight the transfer mechanism of each approach). TL improves the learning of a target task through the transfer of knowledge from a related source task that has already been learned. MTL considers how to learn multiple tasks in parallel, at the same time, and exploit their intrinsic relationship, such that they help each other to be learned better. LL is sequential learning that continually learns overtime by accommodating new knowledge while retaining previously learned experiences. Meta-learning transfers meta-knowledge across tasks, it can thus be considered a meta-solution to transfer knowledge in TL, MTL, and LL. FSL is a problem and not a solution, that studies learning tasks with a few experiences. Hence, reviewed knowledge sharing solutions can be used to solve this problem, particularly meta-learning approaches. Among the five concepts, TL is probably the largest one, as all reviewed approaches involve, at some level, transfer related operations. However, it is important to note that TL is unidirectional, its goal is to improve the learning of only the target task, learning of the source task(s) is irrelevant and not considered. Similarly, LL (in its vanilla version) only transfer knowledge forward to help future learning and do not go back to improve the model of previous tasks. While in MTL all tasks and data are provided together, allowing the model to be trained on and then to improve all tasks at the same time, but at a potentially high computational cost. Recently, backward or reverse knowledge transfer is increasingly studied in the context of LL [ 274 ]. Furthermore, TL and MLT typically need only few similar tasks and do not require the retention of explicit knowledge. LL, on the other hand, needs significantly more previous tasks in order to learn and to accumulate a large amount of explicit knowledge so that the new learning task can select the suitable knowledge to be used to help the new learning. Hence, the growth of the number of tasks and knowledge retention are key characteristics of LL, this why many optimization efforts have been observed in the presented literature regarding these two aspects. On another note, meta-learning trains a meta-model from a large number of tasks to quickly adapt to a new task with only a few examples. It can be useful for better knowledge retention through metric learning or for measuring relatedness between tasks or to select the useful knowledge to be transferred. However, one key assumption made by most meta-learning techniques is that the training tasks and test/new tasks are from the same distribution, while other approaches do not make this assumption. This is a major weakness that can limit the scope of LTL application and which has to be seriously addressed in the future LTL research.

Despite the underlined differences, clearly, knowledge sharing approaches are closely related, they share many challenging issues that are expected to preoccupy the future literature in this field as well as key characteristics that allow them to work collaboratively and synergistically. For example, if we continuously apply TL in a learning system, we can obtain a lifelong machine learning system, inversely we can view TL as LL system in the particular case where the number of the tasks is two. On the other hand, LL could also be considered as online MTL where we deal with multiple tasks, and data points arrive in sequential order. Another special case of LL that is worth to be mentioned that at level, is curriculum learning [ 275 ]. Similarly to MLT, in this case, all tasks and data are made available, but the problem is to identify the optimal order in which to train on data for the most efficient and effective learning. An intuitive type of curriculum is to learn tasks from “easy” to “hard” (similar to the way humans often learn new concepts). Another common characteristic is the regularization effect, knowledge sharing approaches, especially those dealing with multiple tasks, benefit from the effect regularization due to parameter sharing and of the diversity of the resulting shared representation. They also somehow implicitly augment data (e.g., domain adaptation).

On the other end of the spectrum, knowledge sharing approaches share also the same concerns. Notably, the effectiveness of all reviewed approaches depends on the task relatedness, defining task similarity is a key overarching challenge. As mentioned before, considerably less attention has been given to the rigorous evaluation to compare between methods of the same approach or between approaches of different nature. Also, dealing with knowledge implies to answer some important questions such as what forms of knowledge are important, how to represent them, and what kinds of reasoning capabilities are useful, since reasoning allows the system to infer new knowledge from existing knowledge, which can be used in the new task learning. However, so far, little research has been done to address these questions in knowledge sharing literature. Hence, we believe that research in knowledgeable systems needs more engagement and wider attention of academic researchers, more efforts are expected in order to bring this fields to maturity and make it able to compete classical paradigms of learning.

figure 8

Knowledge sharing methods

2.4 Hybrid learners

Data hungriness is mainly related to DNN when they are used in a supervised fashion, these models represent a branch of learning called connectionism. Another potential strategy to cure hungriness would be then to go out of the box and to look for other branches of learning that are more data-efficient. In his recent book, Domingos [ 276 ] has drawn borders between five schools of thoughts in ML, namely symbolists, connectionists, evolutionaries, bayesians, and analogizers. Driven by the same goal of building learning machines, each type of learner makes different assumptions about data. Evolutionaries take roots in evolutionary biology, they use genetic algorithms to deal with structure discovery problem. By being basically research and optimization algorithms, learners of this family require relatively less data. They are mainly used to optimize other hungry learners [ 277 , 278 ] but they are known to be costly. Bayesians find their origins in statistics, they use probabilistic inference to cope with uncertainty. Algorithms of this family are mostly supervised such as SVM, accordingly they require a large amount of data. Similarly, analogizers also need data about the solution of a known situation to transfer it to a new situation faced using mainly Kernel machines, recommender systems are the most famous application of analogy-based learning. Generally, all three families obey the rule of “more data, better learning”. However, connectionists represented by ANN are without a doubt the most data-driven tribe, inspired by neuroscience this branch produces learning algorithms to find the connection weights that make it possible for a neural network to accomplish some intelligent task. Connectionism is generally associated with an empiricist position that considers all of mind as the result of learning and experience during life. According to connectionists experiences/data are the only sources of learning, the more data we have the more we can learn [ 276 ]. On the other end of the spectrum, symbolists are arguably the most data–efficient tribe. Symbolists view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic. They presume that the world can be understood in the terms of structured representations and assume that intelligence can be achieved by the manipulation of symbols, through rules and logic operating on those symbols to encode knowledge [ 279 ]. “Symbolic” AI is considered as the classic AI, it is sometimes referred to as GOFAI (Good Old-Fashioned AI). It was largely developed in an era with vastly less data and computational power than we have now. Symbolic AI bases its intelligent conclusions and decisions on the memorized facts and rules rather than raw massive data. However, it suffers from several drawbacks regarding generalization and change adaptation that, interestingly, are the strengths of connectionists models. The right move would be then to integrate connectionists models, which excels at perceptual classification, with symbolic systems, which excel at inference and abstraction. This movement is known in the literature as Neural-Symbolic Computing (NSC) [ 280 ].

NSC aims at integrating robust connectionist learning and sound symbolic reasoning. The idea is to build a strong hybrid AI model that can combine the reasoning power of rule-based software and the learning capabilities of neural networks. In a typical neural-symbolic system, knowledge is represented in symbolic form, whereas learning and reasoning are computed by a neural network. Hence, the symbolic component takes advantage of the neural network’s ability to process and analyze unstructured data. Meanwhile, the neural network also benefits from the reasoning power of the rule-based AI system, which enables it to learn new things with much less data. It is believed that this fusing would help to build a new class of hybrid AI systems with a non-zero-sum game conception that are much more powerful than the sum of their parts [ 280 ]. It is also claimed that this way of perceiving intelligence is much more analogical to the brain that uses mechanisms operating in the two fashions [ 4 ]. In that NSC is expected to bring scientists closer to achieving true artificial human intelligence.

The integration of the symbolic and connectionist paradigms has been pursued by a relatively small research community over the last two decades. Recently, with the strong penetration of DNN and the rise of complaints regarding explainability and data hungriness of these models. NSC has yielded several significant results that have shown to offer powerful alternatives for opaque data-hungry DNN. Yi et al. [ 281 ] proposed NS-VQA, neural-symbolic visual question answering approach that disentangles reasoning from visual perception and language understanding. The model uses DNN for inverse graphics and inverse language modeling, and a symbolic program executor to reason and answer questions. According to the authors, incorporating symbolic structure as prior knowledge offers three advantages: (i) robustness, (ii) interpretability, and (iii) data efficiency. They verified that the system performs well after learning on only a small number of training data. In the same vein, Vedantam et al. [ 282 ] also demonstrated that their neural-symbolic VQA model performs effectively in low data regime. Evans et al. [ 283 ] proposed a differentiable inductive logic framework which is a reimplementation of traditional Inductive Logic Programming (ILP) in an end-to-end differentiable architecture. The framework attempts to combine the advantages of ILP with the advantages of the neural network-based systems; a data-efficient induction system that is robust to noisy and ambiguous data, and that does not deteriorate when applied to small data.

Furthermore, the idea of neural-symbolic integration has also tempted knowledge transfer community. The idea is to extract symbolic knowledge from a related domain and transfer it to improve the learning in another domain, starting from a network that does not necessarily have to be instilled with background knowledge [ 284 ]. In this vein, Silver [ 285 ] discussed the link between NSC and LL, he exposed an integrated framework for neural-symbolic integration and lifelong machine learning where the symbolic component helps to retain and/or consolidate existing knowledge. Hu et al. [ 286 ] proposed a self-transfer approach with symbolic-knowledge distillation. They developed an iterative distillation method that transfers the structured information of logic rules into the weights of neural networks. The transferring is done via a teacher training network constructed using the posterior regularization principle. The proposed framework is applicable to various types of neural architectures, including CNN for sentiment analysis, and RNN for named entity recognition.

All in all, there is no general standard solution regarding how to cure data hungriness, many perceptions exist but none of them can be asserted to be an absolute solution. Beyond research laboratories, results produced in real-world conditions indicate that existing techniques are yet to be industrialized. And more importantly, with the absence of rational metrics to evaluate and compare techniques, we cannot objectively justify the choice of a technique over another. That being said, we believe that research on this issue is just in its infancy. Without a doubt, considering the facts from industrial and academic worlds, it is a golden time for data-efficient algorithms to rise. However, considering what has been done in the literature so far, improvements are expected from the community working on the issue in order to advance research in this area. In this section, we discuss some research directions and open challenges distilled from the surveyed works, we propose to group them in four themes, namely: (i) Hybridization, (ii) Evaluation, (iii) Automation, and (iv) Humanization.

Hybridization. The last strategy discussed in the review advocates the use of hybrid systems in order to benefit from the strength of each component and achieve more powerful systems. This perception is an interesting avenue for future research, in the sense that further value-added combinations can be investigated.

In the literature, we have seen how some techniques from the same strategy can be used in complementary to each other, like generative augmentations and basic transformations in DA, and meta-learning and TL in knowledgeable systems. However, works that study this kind of composition are still limited in both variety and depth. Furthermore, hybridization of techniques from different strategies is restrictively steered, in some way, towards almost one direction; combining DA with TL for DNN as an effective method for reducing overfitting, improving model performance, and quickly learning new tasks with limited dataset. Much research has been devoted into this vein [ 287 , 288 , 289 ], the aim is to develop practical software tools for systematical integration of DA and TL into deep learning workflows and helping engineers utilize the performance power of these techniques much faster and more easily. It is indeed the best we can hope to empower DNN and mitigate its limitations. However, we believe it is also healthy to explore the potential of other innovative combinations similar to neural-symbolic systems, that not only integrate the reviewed techniques but also call upon other domains such as evolutionary approaches, statistical models, and cognitive reasoning. In this sense, multi-disciplinary studies like this paper are needed to build links between backgrounds and domains that are studied separately and to bring closer their bodies of research that are moving in different directions. Here, we intuitively and seamlessly bridged between the different strategies by considering, for instance, that FSL problem can be viewed as a semi-supervised learning problem with few available labeled data. Its aim is to transfer the knowledge of learning (e.g., meta-learning) from the source tasks to the target ones. Domain adaptation, which is a particular way of transfer learning is also a useful technique of data augmentation. We believe that making connections and enabling hybridization is a rich, under-explored area for future research that could help to converge to one unified solution. A general, adaptable, data-resistant system that will perform well in domains where ample data is available but also in data-scarce domains. It’s far from obvious how to combine all the pieces and to explore others to conceive such custom systems that work on both settings, but researchers have to shift their attention towards this goal in order to fill the gap in thinking of how to build robust AI.

Evaluation. There are very few studies in the literature that compare empirically the performance of techniques of the same strategy, and even fewer techniques of different strategies.

Semi-supervised and unsupervised methods are often evaluated based on their performances on downstream tasks by using datasets such as CIFAR-10, ImageNet, Places, and Pascal VOC. CIFAR-10 and SVHN are popular choices for evaluating the performances of semi-supervised models by training them with all unlabeled data and various amount of labeled examples. To provide a realistic evaluation, it is important to establish more high-quality baselines to allow for proper assessment of the added value of the unlabeled data. Researchers should thus evaluate their algorithms on a diverse suite of data sets with different quantities of unlabeled data and report how performance varies with the amount of unlabeled data. Oliver et al. [ 119 ] compared several SSNN on two image classification problems. They reported substantial performance improvements for most of the algorithms, and observed that the error rates declined as more unlabeled data points were added. These results are interesting in the sense that they indicate that, in image classification tasks, unlabeled data used by ANN can drive consistent improvement in performance. Likewise, it would be interesting to explore more empirical evaluations to draw more promising results that will guide research for better unlabeled data-based learners.

As for DA and knowledgeable systems, more theory and formalisms are needed in order to accurately compare and fairly evaluate techniques from these strategies. Indeed, despite the rapid progress of practical DA techniques, precisely understanding their benefits remains ambiguous. There is no common theoretical understanding regarding how training on augmented data affects the learning process, the parameters, and the overall performance. This is exacerbated by the fact that DA is performed in diverse ways in modern ML pipelines, for different tasks and domains, thus precluding a general theoretical framework. Hence, more theoretical insights are expected to theoretically characterize and understand the effect of various data augmentations used in practice in order to be able to evaluate their benefits. On the other hand, knowledgeable systems research community still does not have a good understanding of what the knowledge is in general, how to represent knowledge, and how to use knowledge in learning effectively. A unified theory of knowledge and the related issues is urgently needed in order to compare between knowledgeable systems and to measure how they optimize data requirement.

Certainly, enriching the evaluation baselines of each strategy is an important research avenue to pursue. However, the ultimate goal would be to develop approaches to evaluate in an abstract level, that is to be able to evaluate an altered data-hungry system by measuring how the alteration techniques, abstracting from their nature, have optimized the need for data, and by verifying the performance resistance against the change in the availability of data.

Automation. A common research question discussed in the reviewed strategies is automated design. Automatic generation of a DA schema for a given dataset or automatic learning of a transfer algorithm for a given domain or tasks, are examples of the projection of the general concept of Automated Machine Learning (AutoML) [ 290 ]. AutoML has recently emerged as a novel idea of automating the entire pipeline of learners’ design by using ML to generate better ML. AutoML is advertised as a mean to democratize ML by allowing firms with limited data science expertise to easily build production-ready models in an automatic way, which will accelerate processes, reduce errors and costs, and provide more accurate results, as it enables businesses to select the best-performing algorithm. Practically, AutoML automates some or all steps of a standard ML pipeline that includes data preparation, feature engineering, model generation, and model evaluation [ 290 ]. Hence, one of the missions of autoML is to automatically manage data quality and quantity in the first step of the pipeline. Currently, autoML services rely only on data searching [ 291 ] and data simulator [ 292 ] to deal with effective data acquiring, we expect however that advances in autoML will deeply revolutionize the way we deal with data needs in ML pipeline.

Furthermore, it is worth to highlight the strong interaction between autoML and the reviewed techniques. As discussed before autoML as a general concept can also be instantiated for DA [ 172 , 173 , 174 , 175 ] and TL [ 197 , 198 ] solutions that can also be packaged in an end-to-end automatic process. On the other way around, DA, TL, and other techniques are very useful for autoML tools. In the data preparation step, DA can be regarded as a tool for data collection and as a regularizer to avoid overfitting. In model generation step, as auoML become most popular for the design of deep learning architectures, neural architecture search (NAS) techniques [ 293 ] which target at searching for good deep network architectures that suit the learning problem are mostly used in this step. However, this method has a high computational cost, to address this, TL can use knowledge from prior tasks to speed up network design. In this vein, Wong et al. [ 294 ] proposed an approach that reduces the computational cost of Neural AutoML by using transfer learning. They showed a large reduction in convergence time across many datasets. Existing AutoML algorithms focus only on solving a specific task on some fixed datasets. However, a targeted high-quality AutoML system should have the capability of lifelong learning. Pasunuru et al. [ 295 ], introduced a continual architecture search (CAS) approach enabling lifelong learning. In addition, as the core idea of auoML is to learn to learn, it is natural to find a growing body of research that combines meta-learning and autoML, particularly for NAS improvement [ 296 , 297 ]. AutoML has also been studied in few-shot learning scenarios, for instance, Elsken et al. [ 297 ] applied NAS to few-shot learning to overcome the data scarcity, while they only search for the most promising architecture and optimize it to work on multiple few-shot learning tasks. Recently, the idea of unsupervised autoML has begun to be explored, Liu et al. [ 298 ] proposed a general problem setup, namely unsupervised neural architecture search (UnNAS), to explore whether labels are necessary for NAS. They experimentally demonstrated that the architectures searched without labels are competitive with those searched with labels.

Humanization. At the root of every intelligent system is the dream of building machines that learn and think like people. Naturally, all attempts to cure data hunger behavior of ML models stem from mimicking the mechanism of human. Currently, humans still retain a clear advantage in terms of sample efficiency of learning. Hence, an obvious research path to keep pursuing is to explore more human-inspired theories and human-like techniques.

We contend that the quest for non-data hungry learning may profit from the rich heritage of problem descriptions, theories, and experimental tools developed by cognitive psychologists. Cognitive psychologists promote a picture of learning that highlights the importance of early inductive biases, including core concepts such as number, space, and objects, as well as powerful learning algorithms that rely on prior knowledge to extract knowledge from small amounts of training data. Studies and insights drawn from cognitive and psychology can then potentially help to examine and understand mechanisms underlying human learning strengths. After all, FSL has been modeled after children’s remarkable cognitive processes to generalize a new concept from a small number of examples. According to developmental psychologists, humans fast learning is hugely reliant on cognitive biases, Shinohara et al. [ 299 ] suggested symmetric bias and mutually exclusive bias as the two most promising cognitive biases that can be effectively employed in ML tasks. Following this line of thought, many advances might come from exploring other cognitive abilities, an interesting avenue might be the study of the commonsense knowledge, how it develops, how it is represented, how it is cumulated, and how it is used in learning. A related study would be to explore intuitive learning theories of physical and social domains. Children at early age have primitive knowledge of physics and social rules, whether learned or innate, it is an intriguing area of research to investigate the prospects for embedding or acquiring this kind of intuitive knowledge in machines, and to study how this could help capturing more human-like learning-to-learn dynamics that enable much stronger transfer to new tasks and new problems, and thus, accelerate the learning of new tasks from very limited amounts of experience and data.

Conclusions

This paper provided a comprehensive survey on the current progress regarding data-efficiency in ML, a promising area in AI that has been attracting prominent research attention in recent years. Understanding the data-efficiency issue from different perspectives helped to categorize typical methods along four lines of research, according to how they solve the issue, namely, by using non-supervised algorithms, data augmentation, shared knowledge, or hybrid systems. In each category, advances and challenges were thoroughly discussed and some summaries and insights were presented.

The key findings motivate the need for more value-added synergy between existing data-efficient methods in order to build more robust systems. The automated ML design was also identified as an important avenue for optimizing the way AI is using massive data in the ML pipeline. Furthermore, the results also suggest the necessity of drawing more insight from cognitive science and behavioral studies to achieve data-efficient human-like learning.

Availability of data and materials

Not applicable.

Silver D, Huang A, Maddison C, Guez AJ, et al. Mastering the game of Go with deep neural networks and tree search. Nature. 2016;529(7587):484. .

Article   Google Scholar  

He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 1063–6919.

Adiwardana D, Luong M, David R, et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977 (2020). 2020.

Marcus G. Deep learning: a critical appraisal. arXiv preprint arXiv:1801.00631 , 2018.

Ford M. Architects of Intelligence: the Truth About AI From the People Building It. Kindle. Birmingham: Packt Publishing; 2018.

Google Scholar  

Shu J, Xu Z, Meng D. Small sample learning in big data era. arXiv preprint arXiv:1808.04572 , 2018.

Wang Y, Yao Q, Kwok JT, Ni LM. Generalizing from a few examples: A survey on few-shot learning. ACM Comput Surv. 2020;53(3):1–34.

Qi G, Luo J. Small data challenges in big data era: A survey of recent progress on unsupervised and semi-supervised methods. arXiv preprint arXiv:1903.11260 . 2019.

Kitchin R. The data revolution: big data, open data, data infrastructures and their consequences. Thousand Oaks: SAGE Publications Ltd; 2014.

Drucker J. Humanities approaches to graphical display. Digital Humanities Quarterly. http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html . 2011.

Thomas W. Algorithms. From Al-Khwarizmi to Turing and Beyond. In Turing’s Revolution. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-22156-4_2,2015 .

Knuth DE. Ancient Babylonian algorithms. Commun ACM. 1972;15(7):671–7.

Article   MathSciNet   MATH   Google Scholar  

Chabert J. A History of Algorithms: From the Pebble to the Microchip. Berlin: Springer; 1999.

Book   Google Scholar  

Paz EB, Ceccarelli M, Otero JE, Sanz JLM. Machinery during the industrial revolution. Dordrecht: Springer; 2009.

Asperti A, Ricciotti W. Formalizing Turing Machines. Logic, Language, Information and Computation. WoLLIC 2012. Lecture Notes in Computer Science, Vol. 7456. Springer, Berlin. 2012.

Navathe SB. Evolution of data modeling for databases. Commun ACM. 1992;35(9):112–23.

Mitchell JC. Concepts in programming languages. Cambridge: Cambridge Cambridge University Press; 2002.

Book   MATH   Google Scholar  

Waldrop MM. The chips are down for Moore’s law. Nature. 2016;530:7589. p. 144–7.

Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage. 2015;35(2):137–44.

Batra S. Big data analytics and its reflections on DIKW hierarchy. Rev Manag. 2014;4(1):5–17.

Turing AM. Computing machinery and intelligence. Mind. 1950;59(236):433–60.

Article   MathSciNet   Google Scholar  

Lighthill J. Artificial intelligence: A general survey. Artificial intelligence: A Paper Symposium. Science Research Council. 1973.

Krizhevsky A, Sutskever I, Geoffrey E. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NIPS 2012). 2012.p. 1097–1105.

Landauer R. The physical nature of information. Phys Lett A. 1996;217:188–93.

Glattfelder JB. A Universe Built of Information. The Frontiers Collection. Springer. 2019. p. 473–514.

Davies P. Universe from Bit. In Information and the Nature of Reality: From Physics to Metaphysics. Cambridge: Cambridge University Press; 2014. p. 83–117.

Wheeler J, Information, Physics, Quantum: The Search for Links. In Proceedings III International Symposium on Foundations of Quantum Mechanics. 1990. p. 354–358.

Lloyd S. The computational universe. In Information and the Nature of Reality: from Physics to Metaphysics. Cambridge: Cambridge University Press; 2014. p. 118–33.

Cohen S. M. Aristotle’s metaphysics. Stanford Encyclopedia of Philosophy. 2000.

Tang H. Weiwei Huang. Brain Inspired Cognitive System for Learning and Memory. Neural Information Processing. ICONIP 2011. Lecture Notes in Computer Science, vol 7062. Springer. 2011, 477-484.

Kurzweil R. How to Create a Mind: The Secret of Human Thought. Viking. ISBN 978-067002529-9. 2012.

Wang Y, Lu J, Gavrilova M, Fiorini R, Kacprzyk J. 2018. Brain-Inspired Systems (BIS): Cognitive Foundations and Applications. IEEE International Conference on Systems, Man, and Cybernetics (SMC). 2018. p. 995–1000.

Chen J, Chen J. Zhang R, Hu X. Towards Brain-inspired System: Deep Recurrent Reinforcement Learning for Simulated Self-driving Agent. arXiv preprint arXiv:1903.12517 . 2019.

Molina D, Poyatos J, Del Ser J, Garc S, Hussain A, Herrera F. Comprehensive Taxonomies of Nature- and Bio-inspired Optimization: Inspiration Versus Algorithmic Behavior, Critical Analysis Recommendations. Cognitive Computation. 2020. p. 1–43.

Del Ser J, Osaba E, et al. Bio-inspired computation: Where we stand and what’s next. Swarm Evolutionary Computation. 2019;48:220–50.

Zang H, Zhang S, Hapeshi K. A Review of Nature-Inspired Algorithms. J Bionic Eng. 2010;7:232–7.

Sorensen K. Metaheuristics - the Metaphor Exposed.International Transactions in Operational Research. 2013;22:3.p. 3–18.

Mitchell TM. Machine Learning. McGraw-Hill. ISBN978-007115467-3$4 1997.

Kearns MJ, Vazirani U. An introduction to computational learning theory. MIT Press. ISBN 978-026211193-5. 1994.

Valiant LG. A theory of the learnable. Commun ACM. 1984;27(11):1134–42.

Article   MATH   Google Scholar  

Gibbons C, Richards S, Valderas JM, Campbell J. Supervised Machine Learning Algorithms Can Classify Open-Text Feedback of Doctor Performance With Human-Level Accuracy. J Med Internet Res. 2017;19:3. e65.

Cho J, Lee K, Shin E, Choy G, Do S. 2017. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?. arXiv preprint ar arXiv:1511.06348 . 2016.

Mukherjee S, Tamayo P, Rogers S. Estimating Dataset Size Requirements for Classifying DNA Microarray Data. J Comput Biol. 2003;10:2. p. 119–142.

Forman G, Cohen I. Learning from little: Comparison of classifiers given little training. Knowledge Discovery in Databases PKDD 2004. Lecture Notes in Computer Science Springer. 2004; 3202. p. 161–172.

Ciuriak D. The Economics of Data: Implications for the Data-Driven Economy. Chapter 2 in “Data Governance in the Digital Age”. Centre for International Governance Innovation. 2018.

Research and Markets. Big Data Market by Component, Deployment Mode, Organization Size, Business Function (Operations, Finance, and Marketing and Sales), Industry Vertical (BFSI, Manufacturing, and Healthcare and Life Sciences), and Region - Global Forecast to 2025. https://www.marketsandmarkets.com/Market-Reports/big-data-market-1068.html.2020 .

NewVantage Partners. Data-Driven Business Transformation Connecting Data/AI Investment to Business Outcomes. http://newvantage.com/wp-content/uploads/2020/01/NewVantage-Partners-Big-Data-and-AI-Executive-Survey-2020-1.pdf . 2020.

Niebel T, Rasel F, Viete S. BIG data – BIG gains? Understanding the link between big data analytics and innovation. Econ Innov New Technol. 2019;28(3):296–316.

Xie K, Wu Y, Xiao J, Hu Q. Value co-creation between firms and customers: The role of big data-based cooperative assets. Inf Manag. 2016;53(8):1038–48.

Brynjolfsson E, McElheran K. The Rapid Adoption of Data-Driven Decision-Making. American Economic Review. 2016;106(9):39–133.

Brynjolfsson E, Hitt LM, Kim HH. Strength in numbers: how does data-driven decision-making affect firm performance. MIT Sloan Working Paper, Cambridge. Available at SSRN: https://ssrn.com/abstract=1819486 .

Andrew M, Brynjolfsson E. Big data: the management revolution. Harvard Bus Rev. 2012;90(10):60–8.

Domingos P. A few useful things to know about machine learning. Commun ACM. 2012;55(10):77–87.

Dimensional Research. Artificial Intelligence and Machine Learning Projects Are Obstructed by Data Issues. https://cdn2.hubspot.net/hubfs/3971219/Survey%20Assets%201905/Dimensional%20Research%20Machine%20Learning%20PPT%20Report%20FINAL.pdf . 2019.

Zhou ZH. &nbsp;A brief introduction to weakly supervised learning. Natl Sci Rev. 2018;5:1.

Willemink MJ, Koszek WA, Hardell C, et al. Preparing medical imaging data for machine learning. Radiology. 2020;295(1):4–15.

Grünberg K, Jakab A, Langs G, et al. Ethical and Privacy Aspects of Using Medical Image Data. In Cloud-Based Benchmarking of Medical Image Analysis. Springer.2017. p. 33–43.

Zarsky T. Incompatible. The GDPR in the Age of Big Data. Seton Hall Law Review. 2017;47:4.

Mouret JB, Micro-Data, Learning. The Other End of the Spectrum. arXiv preprint arXiv: 1610.00946. 2016.

Ruparel NH, Shahane NM, Bhamare DP Learning from Small Data Set to Build Classification Model: A Survey. Proceedings on International Conference on Recent Trends in Engineering and Technology. 2013.

Halevy A, Norvig P, Pereira F. The Unreasonable Effectiveness of Data. IEEE Intell Syst. 2009;24(2):8–12.

Banko M, Brill E. Scaling to very very large corpora for natural language disambiguation. Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. 2001. p. 26–33.

Zhu X, Vondrick C, Ramanan D, Fowlkes C. Do We Need More Training Data or Better Models for Object Detection? In the British Machine Vision Conference, BMVC 2016.

Greco C, Polonioli A, Tagliabue J. Less (Data) Is More: Why Small Data Holds the Key to the Future of Artificial Intelligence. arXiv preprint arXiv:1907.10424.2019 .

Liebert W, Schmidt JC. Collingridge’s dilemma and technoscience. Poiesis Prax. 2010;7:55–71.

Mastorakis G. Human-like machine learning: limitations and Suggestions. arXivpreprint arXiv:1811.06052 . 2018.

Wolff JG. The SP Theory of Intelligence: Distinctive Features. IEEE Access. 2015; 4. p. 216–246.

Zador AM. A critique of pure learning and what artificial neural networks can learn from animal brains. Nat Commun. 2019;10(3770):1–7.

Marcus G. Innateness, AlphaZero, and Artificial Intelligence. arXiv preprint arXiv:1801.05667 . 2018.

Marcus G. The next decade in AI: four steps towards robust artificial intelligence. arXiv arXiv:2002.06177 . 2020.

van Engelen JE. Hoos. H. A survey on semi-supervised learning. Mach Learn. 2020;109(2):373–440.

Qin Y, Ding S, Wang L, Wang Y. 2019. Cognitive Computation. 2020; 11:5. p. 599–612.

Kostopoulos G, Karlos S, Kotsiantis S, Ragos O. Semi-supervised regression: A recent review. J Intell Fuzzy Syst. 2018;35:2. p. 1483–1500.

Kim K. An improved semi-supervised dimensionality reduction using feature weighting: Application to sentiment analysis. Expert Systems with Applications. 2018;109:49–65.

Triguero I, Garcia S, Herrera F. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge Information systems. 2015;42:2. p. 245–284.

Tanha J, van Someren M, Afsarmanesh H. Semi-supervised self-training for decision tree classifiers. Int J Mach Learn Cybern. 2017;8:1. p. 355–370.

Livieris IE, Kanavos A, Tampakas V, Pintelas P. An auto-adjustable semi-supervised self-training algorithm. Algorithm. 2018;11:9.

Li M, Zhou ZH Self-training with editing. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2005. 611–621.

Zhu X, Goldberg A. Introduction to semi-supervised learning. Synth Lect Artif Intell Mach Learn. 2009;3(1):1–30.

MATH   Google Scholar  

Xu C, Tao D, Xu C. A survey on multi-view learning. arXiv preprint arXiv:1304.5634 (2013).

Didaci L, Fumera G, Roli F. Analysis of co-training algorithm with very small training sets. In Structural, Syntactic, and Statistical Pattern Recognition, vol. 7626 of Lecture Notes in Computer Science. 2012. P.719–726.

Jiang Z, Zhang S, Zeng J. A hybrid generative/discriminative method for semi-supervised classification. Knowl Based Syst. 2013;37:137–45.

Qiao S, Shen W, Zhang Z, Wang B, Yuille A. Deep co-training for semi-supervised image recognition. In Computer Vision – ECCV 2018. Lecture Notes in Computer Science. 2018; 11219.

Chonga Y, Dinga Y, Yanb Q, Pana S. Graph-based semi-supervised learning: A review. Neurocomputing. 2020;408:216–30.

Pang B, Lee L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In the 42nd annual meeting on association for computational linguistics, association for computational linguistics. 2004.

Getz G, Shental N, Domany E. Semi-supervised learning—A statistical physics approach. In Proceedings of the 22nd ICML workshop on learning with partially classified training data. 2005.

Wu X, Li Z, So AM, Wright J, Chang S. Learning with partially absorbing randomwalks. In Advances in neural information processing systems. 2012. p. 3077–3085.

Blum A, Chawla S. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th international conference on machine learning. 2001.p. 19–26.

Dornaika F, Dahbi R, Bosaghzadeh A, Ruichek Y. Efficient dynamic graph construction for inductive semi-supervised learning. Neural Netw. 2017;94:192–203.

Liu W, Wang J, Chang SF Robust and scalable graph-based semi-supervised learning. Proceedings of the IEEE, 2012; 100:9, p. 2624–2638.

Bennett KP, Demiriz A. Semi-supervised support vector machines. In: Advances in neural information processing systems. 1999; 11.p. 368–374.

Joachims T. Transductive inference for text classification using support vector machines. In: Proceedings of the sixteenth international conference. 1999; 99, p. 200–209.

Melacci S, Belkin M. Laplacian support vector machines trained in the primal. J Mach Learn Res. 2011;12:1149–84.

MathSciNet   MATH   Google Scholar  

Li Y, Kwok JT, Zhou Z. Semi-supervised learning using label mean. In the 26th international conference on machine learning (ICML 2009). 2009.p. 633–640.

Li T, Wang XL. Semi-supervised SVM classification method based on cluster kernel. Appl Res Comput. 2013;30:1.p. 42–45.

Ding S, Zhu Z, Zhang X. An overview on semi-supervised support vector machine. Neural Comput Appl. 2015;28:5. p. 969–978.

Ouali Y, Hudelot C, Tami M. An Overview of Deep Semi-Supervised Learning. arXiv preprint arXiv:2006.05278 (2020).

Rasmus A, Berglund M, Honkala M, Valpola H, Raiko T. Semi-supervised learning with ladder networks. In Advances in neural information processing systems. 2015.p. 3546–3554.

Prémont-Schwarz I, Ilin A, Ha TH, Rasmus A, Boney R, Valpola H. Recurrent ladder networks. In Advances in neural information processing systems. 2017.p. 6009–6019.

Miyato T, Maeda S, Ishii S, Koyama M. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2018;41:8. p. 1979–1993.

Laine S, Aila T. Temporal ensembling for semi-supervised learning. In 5th International Conference on Learning Representations (ICLR 2017). 2017.

Tarvainen A, Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems (NIPS 2017). 2017.

Hinton GE, Krizhevsky A, Wang SD Transforming autoencoders. in International Conference on Artificial Neural Networks. Springer. 2011.p. 44–51.

Cohen T. Welling M. Group equivariant convolutional networks. In International conference on machine learning. 2016. p. 2990–2999.

Zhang L, Qi JG, Wang L, Luo J. AET vs. AED: Unsupervised representation learning by auto-encoding transformations rather than data. arXiv preprint arXiv:1901.04596(2019).

Qi GJ, Zhang L, Chen CW, Tian Q. AVT: Unsupervised learning of transformation equivariant representations by autoencoding variational transformations. arXiv preprint arXiv: 1903.10863(2019).

Jing L, Tian Y. 2019. Self-supervised visual featurelearning with deep neural networks: A survey. arXivpreprint arXiv:1902.06162 (2019).

Oord Avd, Kalchbrenner N, Kavukcuoglu K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016).

Oord Avd, Kalchbrenner N, Espeholt L, Vinyals O, Graves A, et al. Conditional image generation with pixelCNN Decoders. in Advances in Neural Information Processing Systems (NIPS 2016).2016. p. 4790–4798.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A. N, Kaiser Ł, Polosukhin I. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS 2017). 2017.p. 5998–6008.

Sutton RS. Barto AG reinforcement learning: an introduction. Cambridge: MIT Press; 2018.

Arulkumaran K, Deisenroth MP, Brundage M, Bharath A. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866 (2017).

Rummery GA, Niranjan M. On-line Q-learning using Connectionist Systems. Cambridge: University of Cambridge, Department of Engineering; 1994.

Watkins CJCH, Dayan P. Q-Learning. Machine Learning. 1992;8(3):279–92.

Henderson P, Islam R, Bachman P, Pineau J, et al. 2018. Deep reinforcement learning that matters. In the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18).

Li Y. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 (2017).

Botvinick M, Ritter S, Wang JX, Kurth-Nelson Z. Reinforcement learning, fast and slow. Trends Cogn Sci. 2017;23:5.

Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).

Oliver A, Odena A, Raffel C, Cubuk ED, Goodfellow I. J. Realistic evaluation of deep semi-supervised learning algorithms. arXiv preprint arXiv:1804.09170 (2018).

Palacio-Nino JO, Berzal F. Evaluation metrics for unsupervised learning algorithms. arXiv preprint arXiv:1905.05667 (2019).

Thornton C, Hutter F, Hoos HH, Leyton-Brown K. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In the 19th ACM international conference on knowledge discovery and data mining (KDD ‘13). 2013.p. 847–855.

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6:48–60.

Krizhevsky A, Sutskever I, Hinton G. E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS 2012).

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998; 86:11. p. 2278–2324.

Ciresan DC, Meier U, Gambardella LM, Schmid-huber J. Deep big simple neural nets excel on digit recognition. Neural Comput. 2010;22(12):3207–20.

Yaeger LS, Lyon RF, Webb BJ. Effective training of a neural network character classifier for word recognition. In Advances in Neural Information Processing Systems (NIPS 1997). 1997.p. 807–816.

Inoue H. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929 (2018).

Zhong Z, Zheng L, Kang G, Li S, Yang Y. Random Erasing Data Augmentation. arXiv preprint arXiv:1708.04896 (2017).

Mikołajczyk A, Grochowski M. Data augmentation for improving deep learning in image classification problem. International Interdisciplinary PhD Workshop (IIPhDW2018). 2012.p. 117–122.

Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR 2015).

Odena A, Olah C, Shlens J. Conditional image synthesis with auxiliary classifier GANs. arXiv preprint arXiv:1610.09585 (2016).

Antoniou A, Storkey A, Edwards H. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017).

Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi C. Bagan: Data augmentation with balancing GAN - arXiv preprint arXiv:1803.09655 (2018).

Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).

Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. In the 4th International Conference on Learning Representations (ICLR 2016).

Isola P, Zhu J-Y, Zhou T, Efros AA Image-to-image translation with conditional adversarial networks. In the IEEE conference on computer vision and pattern recognition. 2017. p. 1125–1134.

Zhu J-Y, Park T, Isola P, Efros A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In the IEEE International Conference on Computer Vision. 2017.p. 2223–2232.

Kim T, Cha M, Kim H, Lee JK, Kim J. Learning to discover cross-domain relations with generative adversarial networks. In the 34th International Conference on Machine Learning (JMLR 2017). 2017.p. 1857–1865.

Yi Z, Zhang H, Tan P, Gong M. DUALGAN: Unsupervised dual learning for image-to-image translation. In the IEEE International Conference on Computer Vision. 2017. p. 2849–2857.

Liu MY,Tuzel O.Coupled generative adversarial networks. In Advances in Neural Information Processing Systems (NIPS2016). 2016.

Kingma D. P,Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2014).

Cai L, Hongyang G, Ji S. Multi-stage variational auto-encoders for coarse-to-fine image generation. In the 2019 SIAM International Conference on Data Mining. 2019.

Leglaive S, Girin L, Horaud R. A variance modeling framework based on variational autoencoders for speech enhancement. In IEEE International Workshop on Machine Learning for Signal Process. 2018.

Esling P, Chemla-Romeu-Santos A, Bitton A. Generative timbre spaces with variational audio synthesis. In the Int. Conf. on Digital Audio Effects. 2018.

Salimans T, Goodfellow I, et al. Improved techniques for training GANs. arXiv preprint arXiv:1606.03498 (2016).

Wong SC, Gatt A, Stamatescu V, McDonnell M. D. Understanding data augmentation for classification: when to warp? In International Conference on Digital Image Computing: Techniques and Applications (DICTA). 2016.

DeVries T, Taylor G. W. Dataset augmentation in feature space. In the international conference on machine learning (ICML 2017).

Chawla NV, Bowyer KW, Hall L, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intellig Res. 2002;16:321–57.

He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE International Joint Conference on Neural Networks. 2008.p. 1322–1328.

Kumar V, Glaude H, de Lichy C, Campbell W. A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification. arXiv preprint arXiv:1910.04176 (2019).

Quinonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND. Covariate shift by kernel mean matching. Dataset Shift in Machine Learning. Cambridge: MIT Press; 2009. p. 131–60.

Saenko K, Kulis B, Fritz M, Darrell T. Adapting visual category models to new domains. In the european conference on Computer Vision (ECCV2010).

Csurka G. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374 (2017).

Gopalan R, Li R, Chellappa R. Domain adaptation for object recognition: An unsupervised approach. In International Conference on Computer Vision (ICCV 2011). 2011.p. 999–1006.

Ganin Y, Lempitsky V. Unsupervised domain adaptation by back propagation. arXivpreprint arXiv:1409.7495 (2014).

Ghifary M, Kleijn WB, Zhang M, Balduzzi D, Li W. Dee preconstruction classification networks for unsupervised domain adaptation. In European Conference on Computer Vision. 2016.p. 597–613.

Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D. Unsupervised pixel-level domain adaptation with generative adversarial networks. In the IEEE conference on computer vision and pattern recognition. 2017.p. 3722–3731.

Sun Y, Tzeng E, Darrell T, Efros AA. Unsupervised Domain Adaptation through Self-Supervision. arXiv preprint arXiv:1909.11825 (2019).

Gatys LA, Ecker AS, Bethge M. Image style transfer using convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). 2016.p. 2414–2423.

Hao W, Zhang Z, Guan H. CMCGAN: A uniform framework for cross-modal visual-audio mutual generation. In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X. 2018. ATTNGAN: Fine-grained text to image generation with attentional generative adversarial networks. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018).

Gibiansky A, Arik S, Diamos G, et al. Deep voice 2: Multi-speaker neural text-to-speech. In Conference on Neural Information Processing Systems (NIPS 2017). 2017.p. 2966–2974.

Wang Y, Wu C, Herranz L, et al. Transferring GANs: generating images from limited data. In the European Conference on Computer Vision (ECCV 2018). 2018.p. 220–236.

Yamaguchi S, Kanai S, Eda T. Effective Data Augmentation with Multi-Domain Learning GANs. arXiv preprint arXiv:1912.11597 (2019).

Huang S, Lin A, Chen SP, et al. Aug-GAN: Cross domain adaptation with GAN-based data. In the European Conference on Computer Vision (ECCV 2018). 2018.p. 731–744.

Raille G, Djambazovska S, Musat C. Fast Cross-domain Data Augmentation through Neural Sentence Editing. arXiv preprint arXiv: 2003.10254 (2020).

Xie Q, Dai Z, Hovy E, Luong M, Le Q. V. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 (2019).

Lin L, Wang K, Meng D, Zuo W, Zhang L. Active self-paced learning for cost-effective and progressive face identification. IEEE Trans Pattern Anal Mach Intell. 2018;40(1):7–19.

Ratner A, Bach SH, Ehrenberg H, et al. Snorkel: Rapid training data creation with weak supervision. VLDB J. 2017;11(3):709–30.

Shijie J, Ping W, Peiyi J, Siping H. Research on data augmentation for image classification based on convolution neural networks. In 2017 Chinese automation congress (CAC). 2017.p. 4165–70.

Wang C, Macnaught G, Papanastasiou G, et al. Unsupervised Learning for Cross-Domain Medical Image Synthesis Using Deformation Invariant Cycle Consistency Networks. In international Workshop on Simulation and Synthesis in Medical Imaging (SASHIMI 2018). 2018.p. 52–60.

Ratner AJ, Ehrenberg HR, Hussain Z, et al. Learning to Compose Domain-Specific Transformations for Data Augmentation. arXiv preprint arXiv:1709.01643 (2017).

Cubuk ED, Zoph B, Mane D, et al. AutoAugment: Learning Augmentation Policies from Data. arXiv preprint arXiv:1805.09501 (2019).

Cubuk ED, Zoph B, Shlens J, Le QV, Randaugment: Practical automated data augmentation with a reduced search space. IEEE F Conference on Computer Vision and Pattern Recognition. 2020.p. 2160–7516.

Zhang X, Wang Q, Zhang J, Zhong Z. Adversarial AutoAugment. arXiv preprint arXiv:1912.11188 (2019).

Eaton-Rosen Z, Bragman F, Ourselin S, Cardoso M. J. Improving data augmentation for medical image segmentation. In International Conference on Medical Imaging with Deep Learning. 2018.

Frid-Adar M, Diamant I, Klang E, et al. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing. 2018;321:321–31.

Armanious K, Jiang C, Fischer M. MedGAN: Medical image translation using GANs. Comput Med Imaging Graph. 2020;79:101684.

Schluter J, Grill T. Exploring data augmentation for improved singing voice detection with neural networks. In International Society for Music Information Retrieval Conference (ISMIR). 2015.

Wei JW, Zou K. Eda. Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196(2019).

Wulfmeier M, Bewley A, Posner I. Addressing Appearance Change in Outdoor Robotics with Adversarial Domain Adaptation. In IEEE International Conference on Intelligent Robots and Systems. 2017.

Jialin Pan S, Yang Q. A survey on transfer learning. IEEE Transactions on knowledge data engineering. 2010;22:10. p. 1345–1359.

Weiss K, Khoshgoftaar T, M,Wang DD. A survey of transfer learning. J Big Data. 2016;3(1):1–40.

Rosenstein M, Marx Z, Kaelbling L. To transfer or not to transfer. In NIPS’05 Workshop, Inductive Transfer: 10 Years Later. 2005.

Liu B, Xiao Y, Hao Z. A Selective Multiple Instance Transfer Learning Method for Text Categorization Problems. Knowl-Based Syst. 2018;141:178–87.

Chen YS, Hsu CS, Lo CL. An Entire-and-Partial Feature Transfer Learning Approach for Detecting the Frequency of Pest Occurrence. IEEE Access. 2020; 8.p. 92490–92502.

Furfaro R, Linares R, Reddy V. Space objects classification via light-curve measurements: deep convolutional neural networks and model-based transfer learning. Advanced Maui Optical and Space Surveillance Technologies Conference (AMOS). 2018.

Yang Z, Zhao J, Dhingra B, et al. Glomo: Unsupervisedly learned relational graphs as transferable representations. arXiv preprint arXiv:1806.05662(2018).

Yang Q, Zhang Y, Dai W, Pan S. Transfer Learning in Reinforcement Learning. In Transfer Learning (pp. 105–125). Cambridge: Cambridge University Press. doi: https://doi.org/10.1017/9781139061773.0102020 .

Lia X, Grandvalet Y, Davoine F, et al. 2020. Transfer learning in computer vision tasks: Remember where you come from. Image Vision Comput. 2020; 93.

Malte A, Ratadiya P. Evolution of transfer learning in natural language processing. arXiv preprint arXiv:1910.07370 arXiv:(2019).

Wang L, Guo B, Yang Q. Smart City Development With Urban Transfer Learning. Computer. 2018;51(12):32–41.

Asgarian A, Sobhani P, Zhang JC. A hybrid instance-based transfer learning method. arXiv preprint arXiv:1812.01063 (2018).

Li H, Chaudhari P, Yang H. Rethinking the Hyperparameters for Fine-tuning. arXiv preprint arXiv:2002.11770 (2020).

Yim J, Joo D, Bae J, Kim J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). 2017.p. 1063–6919.

Liang H, Fu W, Yi F. A Survey of Recent Advances in Transfer Learning. In IEEE 19th International Conference on Communication Technology (ICCT). 2576–7828. 2019.

Yang Q, Zhang Y, Dai W, Pan S. AutoTL: Learning to Transfer Automatically. In Transfer Learning pp (168–176). Cambridge: Cambridge University Press. doi: https://doi.org/10.1017/9781139061773.0142020 .

Wei Y, Zhang Y, Yang Q. Learning to Transfer. arXiv preprint ivarX:1708.05629 (2017).

Caruana R. Multitask learning. Mach Learn. 1997;28:1. p. 41–75.

Olivas ES, Guerrero JDM, Martinez-Sober M, et al. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques: Algorithms, Methods, and Techniques. IGI Global. ISBN 9781605667669. 2009.

Lee HB, Yang E, Hwang S. J. Deep asymmetric multi-task feature learning. arXiv preprint arXiv:1708.00260 (2017).

Zhang Y, Yang Q. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 (2017).

Zhang J. Multi-task feature selection with sparse regularization to extract common and task-specific features. Neurocomputing. 2019;340:76–89.

Liu P, Qiu X, Huang X. Adversarial multi-task learning for text classification. In the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 2017.

Su Y, Li J, Qi H, Gamba P, Plaza A, Plaza J. Multi-Task Learning with Low-Rank Matrix Factorization for Hyperspectral Nonlinear Unmixing. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2019).

Barzilai A, Crammer K. Convex multi-task learning by clustering. In the 18th International Conference on Artificial Intelligence and Statistics (AISTATS). 2015.

Long M, Cao Z, Wang J, Yu P. S. Learning multiple tasks with multilinear relationship networks. In Conference on Neural Information Processing Systems (NIPS 2017).

Bickel S, Bogojeska J, Lengauer T, Scheffer T. Multi-task learning for HIV therapy screening. In the 25th international conference on Machine learning. 2008, p. 56–63.

Yang P, Li P. Distributed Primal-Dual Optimization for Online Multi-Task Learning. In the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20). 2020.

Zhou Q, Chen Y, Pan SJ. Communication-efficient distributed multi-task learning with matrix sparsity regularization. Mach Learn. 2020;109:569–601.

Zhang C, Zhao P, Hao S, et al. Distributed multi-task classification: a decentralized online learning approach. Mach Learn. 2018;107:727–47.

Zhao Y, Tang F, Dong W, Huang F, Zhang X. Joint face alignment and segmentation via deep multi-task learning. Multimedia Tools Appl. 2019;78:13131–48.

Akhtar MS, Chauhan DS, Ekbal A. A Deep Multi-task Contextual Attention Framework for Multi-modal Affect Analysis. ACM Trans Knowl Discovery Data. 2020;14:3.p. 1–27.

Benton A, Mitchell M, Hovy D. Multitask learning for mental health conditions with limited social media data. In the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017.

Ruder S. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).

Parisi GI, Kemker R, Part JL, et al. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71.

Maltoni D, Lomonaco V. Continuous learning in single-incremental-task scenarios. Neural Netw. 2019;116:56–73.

Thrun S, Mitchell TM. Lifelong Robot Learning. In the Biology and Technology of Intelligent Autonomous Agents. 1995;144.

Thrun S. Is learning the n-th thing any easier than learning the first? In Conference on Neural Information Processing Systems (NIPS1996). 1996.p. 640–646.

Thrun S. Explanation-based Neural Network Learning: A Lifelong Learning Approach. The Kluwer International Series in Engineering and Computer Science book series (SECS). 1996; 357.

Silver DL, Mercer RE. The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. Connect Sci. 1996;8:277–94.

Silver DL, Poirier R, Currie D. Inductive transfer with context-sensitive neural networks. Mach Learn. 2008;73(3):313–36.

Silver DL, Mason G, Eljabu L. Consolidation using sweep task rehearsal: Overcoming the stability-plasticity problem. In Advances in Artificial Intelligence, 2015; 9091. p. 307–322.

Chen Z, Ma N, Liu B. Lifelong learning for sentiment classification. In the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). 2015.p. 750–756.

Ruvolo P, Eaton E. ELLA: an efficient lifelong learning algorithm. In the International Conference on Machine Learning. 2013. P.507–515.

Clingerman C, Eaton E. Lifelong learning with Gaussian processes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2017.p. 690–704.

Chen Z, Liu B. Topic modeling using topics from many domains, lifelong learning and big data. In the 31st International Conference on Machine Learning (ICML 2014). 2014.p. 703–711.

Wang S, Chen Z, Liu B. Mining aspect-specific opinion using a holistic lifelong topic model. In the 25th International Conference on World Wide Web (WWW ‘16). 2016.p. 167–176.

Liu Q, Liu B, Zhang YL, Kim D, Gao Z. Improving opinion aspect extraction using semantic similarity and aspect associations. In the 30th AAAI Conference on Artificial Intelligence. 2016.

Mitchell T, Cohen W, Hruschka E, et al. Never-ending learning. Commun ACM. 2018;61(5):103–15.

Carlson A, Betteridge J, Wang RC, et al. Coupled semi-supervised learning for information extraction. In the third ACM international conference on Web search and data mining (WSDM ‘10). 2010.p. 101–110.

Bou Ammar H, Eaton E, Ruvolo P, Taylor M. Online multi-task learning for policy gradient methods. In: the 31st International Conference on Machine Learning. 2014.p. 1206–1214.

Tessler C, Givony S, Zahavy T, et al. A deep hierarchical approach to lifelong learning in minecraft. In the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). 2017; 3. p. 1553–1561.

Rolnick D, Ahuja A, Schwarz J, Lillicrap T. Experience replay for continual learning. In advances in Neural Information Processing Systems (NIPS 2019). 2019.

Chen Z, Liu B. Lifelong Machine Learning. Morgan & Claypool publishers. ISBN 978-1627055017.2018.

Mazumder S, Ma N, Liu B. Towards a continuous knowledge learning engine for chatbots. arXiv preprint arXiv: 1802.06024 (2018).

Hospedales T, Antoniou A, Micaelli P. Meta-learning in neural networks: A survey. arXiv preprint arXiv: 2004.05439 (2020).

Mohammadi FG, Amini MH, Arabnia HR. An Introduction to Advanced Machine Learning: Meta-Learning Algorithms, Applications, and Promises. Optimization, Learning, and Control for Interdependent Complex Networks. 129–144. 2020.

Vanschoren J. Meta-learning: A survey. arXiv preprint arXiv:1810.03548 (2018).

Xian Y, Lampert CH, Schiele B, Akata Z. Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell. 2018;41(9):2251–65.

Bertinetto L, Henriques JF, Valmadre J, Torr P. A. Vedaldi. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems (NIPS 2016). 2016. P.523–531.

Garcia V, Bruna J. Few-Shot Learning With Graph Neural Networks. arXiv preprint arXiv:1711.04043 (2018).

Kang B, Liu Z, Wang X, Yu F, Feng J, Darrell T. Few-shot Object Detection Via Feature Reweighting. In IEEE/CVF International Conference on Computer Vision (ICCV). 2019.

Dong N, Xing EP Few-Shot Semantic Segmentation with Prototype Learning. In the 29th British Machine Vision Conference (BMVC 2018).

Gui LY, Wang YX, Ramanan D, Moura J. Few-Shot Human Motion Prediction Via Meta-learning. In 15th European Conference Computer Vision (ECCV 2018). Lecture Notes in Computer Science. Springer International Publishing. ISBN 978-3-030-01236-6. 2018.

Kosh G, Zemel R, Salakhutdinov R. Siamese Neural Net-works For One-shot Image Recognition. In the 32nd International Conference on Machine Learning (ICML 2015). 2015; 37.

Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. Matching Networks For One Shot Learning. In Conference on Neural Information Processing Systems (NIPS 2016).

Snell J, Swersky K, Zemel R. 2017. Prototypical networks for few-shot learning. in Advances in Neural Information Processing Systems (NIPS 2017).

Sung F, Yang Y, Zhang L, Xiang T, Torr PH, S TM Hospedales. Learning To Compare: Relation Network For Few-Shot Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018.

Li W, Xu J, Huo J, Wang L, Gao Y, Luo J. Distribution consistency-based covariance metric networks for few-shot learning. In the 33th AAAI Conference on Artificial Intelligence (AAAI-19).

Wertheimer D, Hariharan B. Few-shot learning with localization in realistic settings. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.

Ravi S, Larochelle H. Optimization as a model for few-shot learning. In Proceedings of 5th International Conference on Learning Representations (ICLR 2017).

Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In the 34nd International Conference on Machine Learning (ICML 2017). 2017. p. 1126–1135.

Aiolli F. Transfer learning by kernel meta-learning. Workshop on Unsupervised and Transfer Learning. JMLR: Workshop and Conference Proceedings. 2012; 27. p. 81–95.

Eshratifar AE, Abrishami MS, et al. A meta-learning approach for custom model training. In the 33th AAAI Conference on Artificial Intelligence (AAAI-19). 2019.

Sun Q, Liu Y, Chua TS, Schiele B. Meta-transfer learning for few-shot learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.

Sun Q, Liu Y, Chen Z, et al. 2019. Meta-Transfer Learning through Hard Tasks. arXiv preprint arXiv:1910.03648 (2019).

Li XC, Zhan DC, Yang JQ, Shi Y, et al. Towards Understanding Transfer Learning Algorithms Using Meta Transfer Features. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2020). 2020.p. 855–866.

Bengio Y, Deleu T, Rahaman N. A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1905.05667 (2019).

Woong Soh J, Cho S, Ik Cho N. Meta-Transfer Learning for Zero-Shot Super-Resolution. arXiv preprint arXiv: 2002.12213 (2020).

Indra Winata G, Cahyawijaya S, Lin Z. Meta-transfer learning for code-switched speech recognition. arXiv preprint arXiv:2004.14228 (2020).

Song W, Li S, Guo Y, et al. Meta Transfer Learning for Adaptive Vehicle Tracking in UAV Videos. In the international Conference on Multimedia Modeling. 2020.

Lin X, Baweja H, Kantor G, Held D. Adaptive Auxiliary Task Weighting For Reinforcement Learning. in Advances in Neural Information Processing Systems (NIPS 2019).

Franceschi L, Donini M, Frasconi P, Pontil M. Forward And Reverse Gradient-Based Hyperparameter Optimization, In the 34nd International Conference on Machine Learning (ICML 2014).

Epstein B, Meir R, Michaeli T. Joint autoencoders: a flexible meta-learning framework. In the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (PKDD 2018).

Chen J, Qiu X, Liu P, Huang X. Meta multi-task learning for sequence modeling. In 32nd AAAI Conference on Artificial Intelligence (AAAI-18). 2018.

Amit R, Meir R. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In the international Conference on Machine Learning. 2018.p. 205–214.

Riemer M, Cases I, Ajemian R. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910 arXiv:1810.11910 (2018).

Javed K, White M. Meta-learning representations for continual learning. In Advances in Neural Information Processing Systems (NIPS 2019).

He X, Sygnowski J, Galashov A, et al. 2019. Task agnostic continual learning via meta learning. arXiv preprint arXiv:1906.05201 (2019).

Munkhdalai T. Yu H. Meta Networks. arXiv preprint arXiv:1703.00837 (2017).

Vuorio R, Cho DY, Kim D, Kim J. Meta continual learning R Vuorio, Cho DY, Kim D, Kim J. arXiv preprint arXiv:1806.06928 (2018).

Xu H, Liu B, Shu L, Yu PS. Lifelong domain word embedding via meta-learning. arXiv preprint a rXiv:1805.09991 (2018).

Wang H, Liu B, Wang S. Forward and Backward Knowledge Transfer for Sentiment Classification. arXiv preprint arXiv: 1906.03506 (2019).

Portelas R, Colas C, Weng L, et al. 2020. Automatic Curriculum Learning For Deep RL: A Short Survey. arXiv preprint arXiv: 2003.04664 (2020).

Domingos P. The master algorithm: How the quest for the ultimate learning machine will remake our world. Basic Books. ISBN 978-046506570-7. 2015.

Han J, Choi D, Park S, et al. Hyperparameter optimization using a genetic algorithm considering verification time in a convolutional neural network. J Electr Eng Technol. 2020;15:721–6.

Choudhury SD, Pandey S, Mehrotra K. Deep Genetic Network. arXiv preprint arXiv:1811.01845 (2018).

Garnelo M, Shanahan M. Reconciling deep learning with symbolic artificial intelligence: representing objects and relations. Curr Opin Behav Sci. 2019;29:17–23.

Garcez AA, Gori M, Lamb LC. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. arXiv preprint arXiv:1905.06088 (2019).

Yi K, Wu J, Gan C. Neural-symbolic VQA. Disentangling reasoning from vision and language understanding. arXiv preprint arXiv: 1810.02338(2018).

Vedantam R, Desai K, Lee S. Probabilistic neural-symbolic models for interpretable visual question answering. arXiv preprint arXiv: 1902.07864 (2019).

Evans R, Grefenstette E. Learning explanatory rules from noisy data. J Artif Intell Res. 2018;61:1–64.

Tran S, Garcez A. Deep logic networks: Inserting and extracting knowledge from deep belief networks. IEEE T. Neur. Net. Learning Syst. 2018; 29.p. 246–258.

Silver DL. On Common Ground: Neural-Symbolic Integration and Lifelong Machine Learning. In the 9th Workshop on Neural-Symbolic Learning and Reasoning. 2013.

Hu Z, Ma X, Liu Z, Hovy E, Xing E. Harnessing deep neural networks with logic rules. In the 54th Annual Meeting of the Association for Computational Linguistics. 2018. p. 2410–2420.

Wolfe CR, Lundgaard K. T. Data Augmentation for Deep Transfer Learning, arXiv preprint arXiv:1912.00772 (2019).

Han D, Liu Q, Fan W. A new image classification method using CNN transfer learning and web data augmentation. Expert Syst Appl. 2018;95:43–56.

Milicevic M, Obradovic I, Zubrinic K. Data augmentation and transfer learning for limited dataset ship classification. WSEAS Trans Syst Control. 2018;13:460–5.

He X, Zhao K, Chu X. AutoML. A Survey of the State-of-the-Art. arXiv preprint arXiv:1908.00709 (2019).

Yang J, Sun X, Lai YK, Zheng L, Cheng MM. Recognition from web data: a progressive Filtering approach. IEEE Trans Image Process. 2018;27(11):5303–15.

Ruiz N, Schulter S, Chandraker M.Learning to simulate. arXiv preprint arXiv:1810.02513 (2019).

Pham H, Guan MY, Zoph B.Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268 (2018).

Wong C, Houlsby N, Lu Y, Gesmundo A. Transfer learning with neural autoML. In Advances in neural information processing systems.

Pasunuru R, Bansal M. Continual and multi-task architecture search, arXiv preprint arXiv:1906.05226 (2019).

Kim J, Lee S, Kim S. Automated gradient based meta learner search, arXiv preprint arXiv:1806.06927 (2018).

Elsken T, Staffer B, Metzen JH, Hutter F. Meta-learning of neural architectures for few-shot learning. In EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2020). 2020.

Liu C, Dollár P, He K. Are labels necessary for neural architecture search?. arXiv preprint arXiv:2003.12056 (2020).

Shinohara S, Taguchi R, Katsurada K, Nitta T. A model of belief formation based on causality and application to n-armed bandit problem. T Jpn Soc AI. 2007;22:58–68.

Saunshi N, Plevrakis O, Arora S, Khodak M, Khandeparkar H, A Theoretical Analysis of Contrastive Unsupervised Representation Learning, Proceedings of the 36th International Conference on Machine Learning. 2019. p. 5628–5637.

Si J, Barto AG, Powell WB, Wunsch D. Reinforcement Learning and Its Relationship to Supervised Learning. in Handbook of Learning and Approximate Dynamic Programming, IEEE, 2004, p. 45–63, doi: https://doi.org/10.1109/9780470544785.ch2 .

Kakade S, On the Sample Complexity of Reinforcement Learning, Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London. 2003.

Deisenroth MP, Rasmussen CE PILCO: A Model-Based and Data-Efficient Approach to Policy Search. In Proceedings of the International Conference on Machine Learning, 2011.

Nguyen H, La H, Review of Deep Reinforcement Learning for Robot Manipulation. 2019 Third IEEE International Conference on Robotic Computing. (IRC), Naples, Italy, 2019, p. 590–595, doi: https://doi.org/10.1109/IRC.2019.00120 .

Levine S, Koltun V. Guided policy search, in Intern. Conf. on Machine Learning, 2013, p. 1–9.

Buckman J, Hafner D, Tucker G, Brevdo E, Lee H.Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion, Advances in Neural Information Processing Systems (NeurIPS 2018). 2018;31. p. 8224–8234.

Kamthe S, Deisenroth M. Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR. 2018; 84. p. 1701–1710.

Popov I, Heess N, Lillicrap T, et al, Data-efficient Deep Reinforcement Learning for Dexterous Manipulation. arXiv preprint arXiv:1704.03073 (2017).

Schwarzer M, Anand A, Goel R. Data-Efficient Reinforcement Learning with Self-Predictive Representations. arXiv preprint arXiv:2007.05929 (2020).

Arowolo MO, Adebiyi MO, Adebiyi AA, et al. A hybrid heuristic dimensionality reduction methods for classifying malaria vector gene expression data. IEEE Access. 2020;8:182422–30.

Arowolo MO, Isiaka RM, Abdulsalam SO, et al. A comparative analysis of feature extraction methods for classifying colon cancer microarray data. EAI endorsed transactions on scalable information systems. 2017;4:14.

Milidiú RL, Müller LF. SeismoFlow -- Data augmentation for the class imbalance problem, arXiv:2007.12229 (2020).

Shamsolmoali P, Zareapoor M, Shen L, et al., Imbalanced data learning by minority class augmentation using capsule adversarial networks, Neurocomputing, 2020.

Lee H, Park M, Kim J. Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning. In: 2016 IEEE international conference on image processing (ICIP). 2016. p. 3713–7. https://doi.org/10.1109/ICIP.2016.7533053 .

Finn C, Xu K, Levine S. Probabilistic model-agnostic meta-learning. Advances in Neural Information Processing Systems (NeurIPS 2018). 2018;31. p. 9516–9527.

Grant E, Finn C, Levine S, Darrell T, Griffiths T. Recasting gradient-based meta-learning as hierarchical bayes. InICLR, 2018.

Rusu AA, Rao D, Sygnowski J, Vinyals O, Pascanu R, Osindero S, Hadsell R. Meta-learning with latent embedding optimization. InICLR, 2019.

Vuorio R, Sun SH, Hu H, Lim JJ. Multimodal model-agnostic meta-learning via task-aware modulation. Advances in Neural Information Processing Systems (NeurIPS 2019). 2019;32. p. 1–12.

Andrychowicz M, Denil M, Colmenarejo SG, et al. Learning To Learn By Gradient Descent By Gradient Descent. Advances in Neural Information Processing Systems (NeurIPS 2016). 2016;29.

Ravi S, Larochelle H. Optimization As A Model For Few-Shot Learning. inICLR, 2016.

Wichrowska O, Maheswaranathan N, Hoffman M. W, et al. Learned Optimizers That Scale And Generalize. inICML, 2017.

Download references

Acknowledgements

Author information, authors and affiliations.

ISIC Research Team, L2MI Laboratory, Moulay Ismail University, Meknes, Morocco

Amina Adadi

You can also search for this author in PubMed   Google Scholar

Contributions

The author was in charge of conducting the study and elaborating the manuscript. The author read and approved the final manuscript.

Corresponding author

Correspondence to Amina Adadi .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The author declare that she has no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Adadi, A. A survey on data‐efficient algorithms in big data era. J Big Data 8 , 24 (2021). https://doi.org/10.1186/s40537-021-00419-9

Download citation

Received : 12 November 2020

Accepted : 16 January 2021

Published : 26 January 2021

DOI : https://doi.org/10.1186/s40537-021-00419-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data hungry algorithms
  • Data‐efficiency
  • Small sample learning
  • Transfer learning

data revolution research paper

Advertisement

Issue Cover

  • Previous Article
  • Next Article

A New African Economic History

How height data can help to unlock the african past, data transcription, augmentation, and collaboration, the diffusion of the data revolution, the data revolution in african economic history.

Johan Fourie is Associate Professor of Economics, Stellenbosch University. He is the author of “The Remarkable Wealth of the Dutch Cape Colony: Measurements from Eighteenth-Century Probate Inventories,” Economic History Review , LXVI (2013), 419–448; with Erik Green, “The Missing People: Accounting for the Productivity of Indigenous Populations in Cape Colonial History,” Journal of African History, LVI (2015), 195–215.

  • Cite Icon Cite
  • Open the PDF for in another window
  • Permissions
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Search Site

Johan Fourie; The Data Revolution in African Economic History. The Journal of Interdisciplinary History 2016; 47 (2): 193–212. doi: https://doi.org/10.1162/JINH_a_00977

Download citation file:

  • Ris (Zotero)
  • Reference Manager

The recent surge in computing power and access to data-processing software and online resources enables historians to capture historical statistics on a much larger scale than before. The data revolution—encompassing unprecedented advances in data transcription, augmentation, and collaboration—is especially valuable for studying the history of regions where written records are scarce, such as sub-Saharan Africa. The digitization of large numbers of colonial and postcolonial records offers much promise for a greater understanding of Africa’s economic past.

Improvements in computing power, increased connectivity, and more advanced analytical techniques herald the era of Big Data in fields as diverse as astronomy, economics, biology, and management. Yet, in the history profession, this data revolution has gained only limited traction. One obvious reason is the lack of Big Data from the distant past. As Eric Schmidt, the former ceo of Google, remarked in 2010, from the dawn of civilization through 2003, five exabytes of information were created. Only seven years later, that much information was being created every two days. As the saying goes, however, not all of these data are useful, and not everything that is useful is captured in these data. Indeed, historians have to distinguish what is meaningful from what is not. Too much information can be as problematical as too little. 1

The purpose of this Research Note is to show the extent to which the data revolution is consequential for the field of history. The surge in computing power and access to data-processing software and online resources have enabled historians during the past two decades to capture historical statistics on a much larger scale than before. The statistical records stored in the archives of the imperial powers, as well as in those of the former colonies, are now available for analysis on an unprecedented scale. The data revolution is especially valuable when applied to regions where written records are scarce, such as sub-Saharan Africa. This Research Note documents how a new generation of economists, geographers, and historians is rewriting African history using archival records (colonial sources written for purposes unrelated to current research questions) in conjunction with geographical, climatic, and demographic studies of the distant past to gather surprising new information. 2

African history has always suffered from a limited number of written records. Not until the 1960s, when most African countries had gained independence, did historians in Africa and elsewhere begin to borrow methods from archeology, linguistics, and oral history to investigate the continent’s rich and varied past. Information about Africa’s economic history at this point was in high demand. Africa’s rapid economic growth rates during the late colonial and early postcolonial periods and the euphoria immediately following independence raised new questions about the effects of imperialism and colonialism and the possibilities of future prosperity. Economic-history departments were established in universities across the continent. Debates fostered within these newly established departments could be fierce, informed by the ideological battles between Marxist and liberal economics. Evidence in support of the various positions came most often from qualitative sources that originated during the colonial era, although the abundant statistical records were not completely neglected. Many pioneers of African data collection, such as Johnson, Szeresewski, Hopkins, Manning, Eltis, and Austen, began to assemble their data sets during the 1960s and 1970s. 3

By the 1980s, however, three forces had combined to push African economic history to the intellectual periphery. First, African economies were in decline. The oil shocks of the 1970s and the consequent debt crises and imf (International Monetary Fund) structural-adjustment programs had diminished the resources available to African universities and deflated interest in explaining Africa’s past fortunes. Second, as postmodernism became fashionable, historians concentrated more on cultural and social history than on economic development. Third, the cliometrics movement of the 1970s, which emphasized the mathematical and statistical aspects of economics, heightened the methodological barrier between economics and history. As a result of these trends, African economic history, both inside and outside Africa, went into a decline from which it did not begin to emerge until the 2000s. Led by prominent economists, the renaissance of African economic history coincided with the rise of Africa’s economic prospects. A methodological and technological revolution within the economics profession certainly had something to do with the change. The new millennium also marked the dawn of the data revolution in African history. 4

Acemoglu, Johnson, and Robinson are widely credited with the resurgence of African economic history through their provocative claim that disparities in economic development were largely due to the institutional legacies of European settlers. Their main contribution, however, was not so much their large new data set as it was the causal inference that they drew from it using a novel econometric technique. Their claim about the persistence of institutions attracted the interest of a younger generation of economists, eager to tackle the issues of African poverty and underdevelopment. For instance, Nunn based his extensive study of the economic consequences of the slave trade on Curtin’s Big Data project in African history from the 1960s, which culminated, following much debate and controversy, in Eltis and Richardson’s Transatlantic Slave Trade Database. Although the various versions of the slave database that appeared during the past few decades have enriched a vast scholarship about this issue, Nunn’s demonstration of a causal link between the number of slaves exported and the level of African gdp suggested a new direction for African economic history. Although the article later came under severe criticism, the consensus is that it rekindled economists’ interest in African economic history.

Nunn’s later work (with Wantchekon) identified mistrust in African societies as a legacy of the slave trade and (with Puga) showed how the slave trade pushed Africans into locations that still hamper their ability to trade. These contributions laid the foundation for a new trend in African economic history—the attempt to discover causal mechanisms in the past to explain conditions in the present. 5

Given that few effects are more persistent than geographical ones, environmental information is crucial for linking historical events to current circumstances. Nunn and Puga posited rugged terrain as an explanatory variable. Alsan used temperature and humidity to assess the influence of the tsetse fly on development in Africa. Bhattacharyya advanced the prevalence of malaria as the most important factor in explaining African poverty, at least from a statistical perspective. Fenske and Kala’s annual panel of African temperatures and slave exports indicated how past environmental shocks affected the slave trade. Papaioannou’s research into court cases and the records of prisoners and homicides revealed how deviations from the long-term rainfall pattern increased conflict in colonial Nigeria. Papaioannou and de Haas extended this analysis to include other parts of Africa with similar results. Climatic and other environmental information geocoded at the micro-level enabled scholars to test causal persistence with a high degree of statistical accuracy. 6

The new variables gleaned from the data revolution not only measure past events; they also serve effectively as contemporary outcome variables. In the absence of data about regional African economies, Michalopoulos and Papaioannou employed light density at night, obtained from satellite imaging, to test, on the one hand, the effect of precolonial ethnic institutions and, on the other, the effect of the borders created during the European powers’ scramble for Africa on current economic performance. Notwithstanding the strong criticism that Michalopoulos and Papaiannou’s data sources and estimation techniques have attracted, light density has become a popular outcome variable. 7

Contemporary survey data, now with geocoded observations, also frequently become outcome variables. A case in point is the “Afrobarometer survey” by Nunn and Wantchekon to measure levels of trust. Abel’s work regarding forced removals during apartheid also relies on the Afrobarometer survey. Michalopoulos, Putterman, and Weil’s use of Demographic and Health Survey ( dhs ) data in combination with information about the respondents’ ancestral ethnicity discovered that descendants of agriculturalists are wealthier and better-educated than the descendants of pastoralists. 8

Not everyone agrees that such broad-brush studies, which link a historical episode to contemporary data, are always edifying. The main complaint about such an approach is that it “compresses history,” oversimplifying a more elaborate causality. In a recent book, Jerven criticizes what he calls the irresponsible manner in which economists have deployed econometrics to explain African development. In reviewing this book, de Waal calls for African economic history to be “liberated from the tyranny of econometricians.” 9

African economic historians have proposed an alternative methodology that involves unearthing and digitizing colonial-era archival records. For the last decade, scholars have scanned colonial blue books (a compendium of records about the civil establishment, revenue and expenditure, and other statistical particulars of the colonies issued by the Colonial Office), tax censuses, voter rolls, marriage registers, etc., to study the population size, wages, incomes, education, fiscal systems, and transport networks of African societies. Rather than a lengthy survey of every contribution within this context, a single type of archival document—the military attestation form, which includes, among other things, soldiers’ heights—can represent the nature of the data revolution in African economic history. These individual-level observations produce a valuable proxy for living standards in the absence of other evidence. 10

The use of human heights or stature as a proxy for living standards is more than three decades old. Social scientists agree that height accurately reflects an individual’s environmental conditions early in life, including access to nutrition and exposure to disease, and that changes in average height are reflective of changes in a society’s living standards. Although heights might be expected to correlate positively with incomes, economic historians analyzing heights in Western Europe and North America discovered a surprisingly negative correlation between height and income during the early phase of industrialization in England and elsewhere. This apparent anomaly became known as the “Early Industrial Growth Puzzle” in Europe and the “Antebellum Puzzle” in the United States; for most of the last two decades of the twentieth-century, scholars on both sides of the Atlantic have attempted to explain it. 11

The height of nineteenth-century Africans was the subject of two articles by Eltis in 1982 and 1990, but the turn to heights to document the evolution in the living standards of African peoples during an era of unreliable data had to await Moradi and Baten’s pioneering investigation of height differences between historical birth cohorts constructed from dhs data. But dhs data were not available until the colonial era ended in the 1950s. Another source was needed to investigate the colonial and the precolonial eras. Moradi found a sample of 1,046 Ghanaian recruits from World War I and World War II and 730 Kenyan recruits from World War II, adding information from surveys, to measure the effect of colonial policies on African living standards. In a follow-up work about the same question, his sample more than doubled. 12

Moradi’s groundbreaking work stimulated interest in military records elsewhere in Africa, often with startling results. Austin, Baten, and van Leeuwen found that in nineteenth-century West Africa, Ghanaian and Burkinabe recruits were notably shorter than northwestern Europeans but not shorter than southern Europeans. In twentieth-century West Africa, Cogneau and Rouanet found that the rate of increase in the heights of those born in Côte d’Ivoire and Ghana during the late colonial period, 1925 to 1960, was almost as fast as the rate observed in France and Great Britain during the period from 1875 to 1975. Inwood and Masakure found that colored South Africans were, on average, six cm shorter than white South Africans at the start of the twentieth century, a significantly smaller difference than today’s eight cm. The findings from these studies now take their place within a larger literature about the living standards of indigenous populations across the world. 13

Information in attestations is important not only for measuring living standards over time but also for tracing the effects of different colonial policies. Cogneau and Moradi constructed a much larger sample, 11,940 recruits from Ghana and Togo, to evaluate how the partition of German Togoland after World War I influenced educational outcomes. The part of Togo that fell under French mandate had lower levels of literacy after the partition than did the parts that were ruled by the British. According to Cogneau and Moradi, the French authorities were hostile to missionary schools. Jedwab and Moradi’s analysis of colonial railways vis à vis various economic outcomes, including heights, found that railways increased the heights of those born closest to them and generally improved income, boosting the production of cash crops like cocoa. 14

Mariotti’s investigation of the heights of black mineworkers in South Africa confirmed the positive effect of income improvement on height. A 1974 plane crash that killed seventy-three mineworkers provoked the Malawian government to ban the migration of mineworkers to South Africa, forcing South Africa to hire workers from the Transkei “homeland” in South Africa. Mariotti showed that the resulting improvement in household incomes for the newly recruited Transkeian mineworkers increased the heights of children born during or immediately after 1974, and only in those districts from which mineworkers came. Mariotti and Dinkelman later turned their attention to the effects of the plane crash and the sudden prevention of labor migration on Malawian households. 15

Attestation records provide a snapshot of living standards in the absence of the other individual-level records that colonial authorities were often less-inclined to collect for indigenous populations than for Europeans. The demographic information that is available typically comes with a colonial-era bias that is difficult to exclude. Although height information collected for military purposes is unlikely to demonstrate such systematic bias, attestation forms can have a selection bias of their own. Not everyone qualified for military service. The minimum-height requirement meant that attestation heights were at times truncated to the left. Truncation is amenable to a statistical check (by using truncated regression models, for example), but another unobservable selection could always be lurking. Indeed, Bodenhorn, Guinnane, and Mroz have recently addressed this very possibility. They claim that the declining heights observed during the Industrial Revolution—the “Early Industrial Growth Puzzle”—reflects not a decline in living standards but a change in the military’s selection process. When returns in the private sector were small (at the start of the Industrial Revolution), many of the stronger (or taller) men probably chose the military as a career. As the economy grew, however, more of them would have returned to private-sector employment, with no concomitant increase in the military population, leaving the weaker (or shorter) men to join the army. Moreover, Inwood, Mariotti, and Fourie used data about South African recruits in World War I to show that this type of unobservable selection may also happen because of changes in military technology. 16

Selection-bias concerns, however, cannot prevent heights from being a useful proxy for individual-level living standards in the absence of other measures. A comparison of the results derived from military attestation forms together with those derived from survey data, for example, suggests that selection into the military may not have been strong enough to undermine our confidence in earlier findings. The studies that have been done on heights exemplify the contribution of the data revolution in African history to new insights about how colonial-era policies affected the welfare of subjects who remained outside the remit of recordkeeping.

Besides heights, attestation forms offer a wealth of data that can contribute to fields outside economic history, in particular social history. The attestation forms of the South African Constabulary list several characteristics of each recruit, including the ability to speak an African language, the color of hair and eyes, vaccination history, the ability to ride a horse or swim, and the presence of any tattoos. A simple quantitative analysis can bring to light unsuspected details. More than one-in-five Boer recruits from the two Boer Republics had at least one tattoo, and the number of Boer recruits who were redheads was nearly twice that of the Scottish or Irish recruits. 17

The use of large archival data sets is only one aspect of the data revolution. New tools for data transcription, augmentation, and collaboration allow for faster and less-expensive digitization, more rigorous investigation, and better replicability and comparability of results. The data revolution comprises not only new material but also a new method—the use of computational power and statistical techniques to expand the scale and scope of research questions.

Data transcription—the manual copying of archival sources, often into a digital format such as Excel—is laborious and expensive. As an example, Figure 1 shows the attestation of Diederick Alfred Joseph Yates, a Briton who enlisted in the South African Constabulary in 1902. Although the form is standardized and the text clearly legible to anyone familiar with late nineteenth-century handwriting, it does not allow for automated transcription. Optical character recognition ( ocr ) software is not yet able to capture the information contained in such an attestation with a high degree of reliability. Historians must therefore spend countless hours transcribing these attestations into analyzable text or train a research team to do it. Both options are costly. They can also cut corners, as Inwood and Masakure did when investigating South African soldiers who served in World War I and World War II. Although several hundred thousand attestations are available in the Defense Force archives, time and funding constraints forced Inwood and Masakure to limit their transcriptions to only 10,000 attestations for each of the wars. Such sampling introduces additional biases for which researchers must take account when analyzing the results. It also does not fully conform to the methodology now popular in the era of Big Data, which advocates the use of all available data.

An Attestation Form from the South African Constabulary

An Attestation Form from the South African Constabulary

There is reason to believe, however, that improved ocr software will soon make the transcription of archival sources, such as attestation forms, less costly. This rapidly developing technology—the most spectacular use of which is Google’s project to scan every book title in existence (estimated in 2010 to be about 130 million)—already has wide applicability in history, for example in the digitization of old newspapers. But printed texts are relatively easy for the ocr technology to digitize and transcribe. Eventually an ocr algorithm will be able to read a natural image—like the attestation in Figure 1 —that contains written text. This technological breakthrough in computational linguistics could have far-reaching consequences for the field of economic history. 18

Data augmentation, the second step in the new methodology, signifies the addition of information from internal or external sources to the original data. Although supplementing existing data with new information has always been possible, researchers usually had to do it manually. Computerization and the development of matching algorithms have made this process much easier. In the case of the attestation forms, scholars can now match recruits’ names and birth dates to, say, genealogical records. Not only can they learn more about their subjects; they can also solve previously intractable problems, such as calculating intergenerational mobility using height as an indicator of living standards. Data augmentation has countless new avenues to explore. 19

The third step is collaborating across research networks. Data sharing has already resulted in several projects to compare historical measures. At a macroeconomic level, series such as the Penn World Tables and the Maddison project have provided scholars with cross-country comparisons in history. Manning’s Big Data in History project aims to “create a world-historical archive that will trace the last four centuries of historical dynamics and change.” The Global Collaboratory on the History of Labour Relations, based at the International Institute of Social History in Amsterdam, has assembled a team to collect quantitative benchmarks for a global labor history—including population size and composition, workforce size and composition, occupations, and types of labor—setting the years 1500, 1650, 1800, 1900, and 2000 as benchmarks (1950 is included for Africa). Van Zanden has undertaken the Clio Infra project, which aims to establish “a set of interconnected databases … containing worldwide data on social, economic, and institutional indicators for the past five centuries, with special attention to the past 200 years.” This project has now evolved into the clariah (Common Lab Research Infrastructure for the Arts and Humanities) project, consisting of a consortium of more than forty partners who store and share all types of data related to the arts and humanities in a standardized and user-friendly format. 20

In African history, Frankema and his team have taken the lead in digitizing and transcribing the colonial blue books, facilitating research about colonial education, public finance, and population. In an award-winning study, Frankema and van Waijenburg demonstrated that African real wages were significantly above subsistence levels and rising for most of the colonial period. In some places, wages were much higher than comparable figures for Asia. That study suggests that the idea of sub-Saharan Africa as always having been poor is not supported by the evidence. 21

Wages and prices can reveal much more than simply comparative levels of development. Real wages also help to examine changes in the degree of inequality between the rich and the poor (or between the colonizer and the colonized), a topic that has attracted much attention following the publication of Piketty’s Capital in the Twenty-First Century . Because historical wage and price series are often not readily available, several scholars are turning to such innovative sources as slave-ship provisions, Dutch East India Company records, or agricultural journals. 22

On a broader scale, Jerven has tracked macroeconomic trends backward into the past. By reconstructing African gdp measures, he has found considerable discrepancies between World Bank, imf , and Penn World Table estimates. In two highly acclaimed books, Jerven emphasizes that Africa’s image as the “hopeless continent”—famously encapsulated by the May 13, 2000, cover of The Economist —is largely a construction of the 1980s and 1990s. Bolt and Gardner are compiling fiscal revenue and expenditure data at the local level for the late colonial period. They pair these new records with evidence about precolonial states to assess the extent to which the fiscal capacity of local-government units reflects precolonial state centralization. Moreover, in an attempt to understand the evolution of inequality during the colonial era (in the absence of information about income at the individual level), Bolt and Hillbom are compiling colonial-era records of occupations and labor-market structures. 23

These reinterpretations of the African past are the result of digitizing and transcribing the vast amounts of data available in colonial archives—projects that are ongoing. Meier zu Selhausen is digitizing and transcribing vast quantities of demographic records kept in missionary-station archives. These records are not without problems. Selection into formal Christian mission stations may carry hidden biases, and gaining the trust of bishops and others anxious that the data should be put to proper use is often as time-consuming for researchers as is the analysis. But the effort is worth the cost because these records provide a glimpse of African demographic changes and living standards unrecorded in the colonial blue books. 24

The records available for the European settlers and their activities are more detailed. Green, von Fintel, and Fourie are constructing an annual panel data set of several thousand settler farmers for more than 140 years. Once complete, this data set will offer a wide-ranging and informative account of eighteenth- and nineteenth-century life in colonial South Africa and allow economists to investigate the economic consequences of such events as the abolition of slavery and smallpox epidemics. Even though these statistical records were collected for the European colonial administration, they contain detailed demographic information about the Khoisan, a population group almost completely neglected in other sources. The scale of this project—several thousand individuals for each year—does not allow for matching individuals manually across years. Instead, an algorithm confers unique individuals with a distinct id that permits them to be observed throughout their entire lifetime. By adding genealogical records, we can also expand the algorithm to link families across generations. Such an intergenerational panel would not have been feasible without the advent of enhanced computing and powerful statistical techniques. 25

Has the data revolution produced a better understanding of the African past and the manner in which historical factors still shape African destinies? It may be too early to pass judgment. However, the new approach has generated interest far beyond the narrowly defined disciplinary borders of economic history. Econometricians, geographers, evolutionary biologists, linguists, demographers, sociologists, and computer scientists are beginning to exploit the rich quantitative history of Africa using their own methods. Economic historians should welcome this trend, not only because it generates an audience for their work but also because it provides new ways to test conjectures and hypotheses. The risk, however, is that some economic historians may be left behind, anchored to methods created at a time when quantitative data were less available. Many of the scholars who lack the statistical skills that the data revolution requires are based in Africa, geographically isolated from the interdisciplinary opportunities to acquire them. Of the more than fifty authors cited in the section above on the new economic history of Africa, fewer than ten are from Africa. 26

Addressing this dearth of African scholars who are equipped to participate in the field is necessary for reasons of both fairness and intellectual diversity. As the new African economic history has demonstrated, the negative effects of colonialism still linger into the present; redressing these injustices by helping Africans to engage in, and contribute to, their own history is the first step to a more just intellectual discourse. But drawing African scholars into the field is likely to create more than just equality. It will also encourage the development of new questions, sources, approaches, and answers that will broaden and deepen the scope of African economic history.

A debate in a recent issue of Economic History of Developing Regions summarizes the factors that contribute to the current intellectual segregation. Green and Nyambara averred that economic history research at African universities “is not only strong, but remained vibrant even when African economic history was on the decline at universities elsewhere. The lack of visible output in major economic history journals is thus not a sign of weakness. Instead it is an effect of the increased methodological specialization of economic history in the Western world.” Green and Nyambara thus promote the engagement of Western economic historians with African scholars to prevent “regional isolation. In response, Austin suggested that “resource constraints,” along with “institutional constraints and intellectual priorities,” are responsible for the slow adoption of quantitative techniques by African scholars: “The overwhelming priority that economics departments in Africa rightly give to the study of current problems does not seem to be combined with an awareness of the uses of history in fulfilling this mission. Meanwhile, the institutionalization of the humanities/social sciences divide in many universities has made it less likely that history graduates will be equipped to combine qualitative and quantitative techniques, let alone focus on the latter.” 27

Although online access may bring African students into closer contact with new methodological tools, the existing disparities may well become more entrenched. It is not entirely clear which incentives will minimize the likelihood of a continued isolation of African scholars, or who will offer them. One approach endorsed by Green and Nyambara, as well as by Austin, is to encourage collaborative work through joint funding applications and research programs. The danger of this strategy, however, is that the non-African partners will tend to drive the research agendas. Another, more promising, approach is for European and U.S. universities to recruit and subsidize Ph.D. students from Africa who could begin the slow but sustainable process of training their African colleagues and students in the new methods when they return. 28

African economic history has already gained much from the data revolution of the past two decades. Although the major historiographical debates persist, we now know, for example, that the slave trade has made Africans wary of both their neighbors and their political leaders, that real wages during the early twentieth century were higher in many African countries than in Asian countries, and that colonial railways boosted production and altered the spatial distribution of cities. The continuing projects to transcribe and digitize large numbers of colonial and postcolonial records are likely to advance our knowledge about Africa’s economic past. 29

African economic history is, however, not just about the past. A major issue for economic historians concerns how to explain Africa’s apparent failure to sustain growth and development. The renaissance of African economic history at the start of the twenty-first century has coincided with a rise in African economies. But for a variety of external and internal reasons, the momentum seems to be slowing as pessimism about Africa’s prospects increases. Can the tools of the data revolution bring a new understanding to Africa’s past and promote policies for a better future? The answer is likely to be negative if the African beneficiaries of this would-be prosperous future cannot take part in the conversation and if non-African scholars in Europe or the United States fail to establish long-term partnerships with African universities and their faculties.

“The past is never dead. It isn’t even past,” William Faulkner once said. This statement is nowhere more apt than in Africa, where the deleterious effects of colonialism and slavery persist. Equipping African scholars with the tools to take part in the data revolution is not only necessary to redress the inequalities of the past but also vital to build a thriving interdisciplinary academic discourse.

Eric D. Feigelson and G. Jogesh Babu, “Big Data in Astronomy,” Significance , IX (2012), 22–25; Liran Einav and Jonathan Levin, “The Data Revolution and Economic Analysis,” NBER Working Paper No. 19035 (2013); Vivien Marx, “Biology: The Big Challenges of Big Data,” Nature , CDXCVIII (2013), 255–260; A.D. Howe et al., “Big Data: The Future of Biocuration,” ibid. , CDLV (2008), 47–50; Andrew McAfee and Erik Brynjolfsson. “Big Data: The Management Revolution,” Harvard Business Review , XC (2012), 60–66. Schmidt’s figures have been disputed ( http://readwrite.com/2011/02/07/are-we-really-creating-as-much ), but the point is that the quantity of data is increasing exponentially.

Patrick Manning, Big Data in History (Basingstoke, 2013).

Marion Johnson, “The Cowrie Currencies of West Africa Part I,” Journal of African History , XI (1970), 17–49; idem , “Cloth as Money: The Cloth Strip Currencies of Africa,” Textile History , XI (1980), 193–202; Robert Szereszewski, “The Process of Growth in Ghana, 1891–1911,” Journal of Development Studies , I (1965), 123–141; Anthony G. Hopkins, “Economic Imperialism in West Africa: Lagos, 1880–1921,” Economic History Review , XXI (1968), 580–606; Manning, “Slaves, Palm Oil, and Political Power on the West African Coast,” African Historical Studies , II (1969), 279–288; David Eltis, “The Export of Slaves from Africa, 1821–1843,” Journal of Economic History , XXVII (1977), 409–433; Ralph A. Austen, “Slavery among Coastal Middlemen: The Duala of Cameroon,” in Suzanne Miers and Igor Kopytoff (eds.), Slavery in Africa: Historical and Anthropological Perspectives (Madison, 1977), 305–333.

Erik Green and Pius Nyambara, “The Internationalization of Economic History: Perspectives from the African Frontier,” Economic History of Developing Regions , XXX (2015), 68–78 (doi:10.1080/20780389.2015.1025744)—argue that African economic history declined only in Western institutions; Gareth Austin—“African Economic History in Africa,” ibid. , 79–94 (doi:10.1080/20780389.2015.1033686)—suggests that it was also in decline in Africa. See also Morten Jerven, “African Growth Recurring: An Economic History Perspective on African Growth Episodes, 1690–2010,” ibid. , XXV (2010), 127–154; Austin and Stephen Broadberry, “Introduction: The Renaissance of African Economic History,” Economic History Review , LXVII (2014), 893–906; Fourie and Leigh Gardner, “The Internationalization of Economic History: A Puzzle,” Economic History of Developing Regions , XXIX (2014), 1–14.

Daron Acemoglu, Simon Johnson, and James A. Robinson, “The Colonial Origins of Comparative Development: An Empirical Investigation,” American Economic Review , XCI (2001), 1369–1401; Nathan Nunn, “The Long-Term Effects of Africa’s Slave Trades,” Quarterly Journal of Economics , CXXIII (2008), 139–176; Philip D. Curtin, The Atlantic Slave Trade: A Census (Madison, 1972); Joseph E. Inikori, “Measuring the Atlantic Slave Trade: An Assessment of Curtin and Anstey,” Journal of African History , XVII (1976), 197–223; Paul E. Lovejoy, “The Impact of the Atlantic Slave Trade on Africa: A Review of the Literature,” Journal of African History , XXX (1989), 365–394; Eltis and David Richardson (eds.), Extending the Frontiers: Essays on the New Transatlantic Slave Trade Database (New Haven, 2008); Manning, Slavery and African Life: Occidental, Oriental, and African Slave Trades (New York, 1990), LXVII; Lovejoy, The African Diaspora: Revisionist Interpretations of Ethnicity, Culture and Religion under Slavery (Boston, 1997); Ewout Frankema and Marlous van Waijenburg, “Structural Impediments to African Growth? New Evidence from Real Wages in British Africa, 1880–1965,” Journal of Economic History , LXXII (2012), 895–926. The reason for the criticism is more explicit in the working paper version, available at http://eh.net/eha/wp-content/uploads/2013/11/FrankemaWaijenburg_0.pdf . Nunn and Leonard Wantchekon, “The Slave Trade and the Origins of Mistrust in Africa,” American Economic Review , CI (2011), 3221–3252; Nunn and Diego Puga, “Ruggedness: The Blessing of Bad Geography in Africa,” Review of Economics and Statistics , XCIV (2012), 20–36.

Marcella Alsan, “The Effect of the TseTse Fly on African Development,” American Economic Review , CV (2015), 382–410; Sambit Bhattacharyya, “Root Causes of African Underdevelopment,” Journal of African Economies , XVIII (2009), 745–780; James Fenske and Namrata Kala, “Climate and the Slave Trade,” Journal of Development Economics , CXII (2015), 19–32; Kostadis J. Papaioannou, “Climate Shocks and Conflict: Evidence from Colonial Nigeria,” African Economic History Network Working Paper Series, 17 (2014); Papaioannou and Michiel de Haas, “Climate Shocks, Cash Crops and Resilience: Evidence from Colonial Tropical Africa,” paper presented at the World Economic History Congress, August 3–8, 2015, Kyoto.

Stelios Michalopoulos and Papaioannou, “Pre‐Colonial Ethnic Institutions and Contemporary African Development,” Econometrica , LXXXI (2013), 113–152; idem , “Further Evidence on the Link between Pre-colonial Political Centralization and Comparative Economic Development in Africa,” Economics Letters , CXXVI (2015), 57–62; idem , “National Institutions and Subnational Development in Africa,” Quarterly Journal of Economics , CXXIX (2014) 151–213; Timothy Besley and Marta Reynal-Querol, “The Legacy of Historical Conflict: Evidence from Africa,” American Political Science Review , CVIII (2014), 319–336; Nonso Obikili, “An Examination of Subnational Growth in Nigeria: 1999–2012,” South African Journal of Economics , LXXXXIII (2015), 335–357; Denis Cogneau and Yannick Dupraz, “Questionable Inference on the Power of Pre-Colonial Institutions in Africa,” PSE Working Papers, No. 2014–25 (2014).

The Afrobarometer survey series measures public attitudes about economic, political, and social matters in several sub-Saharan African countries. More information is available at: http://www.afrobarometer.org/ . Martin Abel, “Long-Run Effects of Forced Removal under Apartheid on Social Capital,” paper presented at African Economic History meetings, London School of Economics and Political Science, October 25–26, 2014; Michalopoulos, Louis Putterman, and David N. Weil, “The Influence of Ancestral Lifeways on Individual Economic Outcomes in Sub-Saharan Africa,” paper discussed at NBER Reporter Number 4: Program and Working Group Meetings (2014).

Austin, “The ‘Reversal of Fortune’ Thesis and the Compression of History: Perspectives from African and Comparative Economic History,” Journal of International Development , XX (2008), 996–1027; Jerven, Africa: Why Economists Get it Wrong (London, 2015); idem and Deborah Johnston, “Statistical Tragedy in Africa? Evaluating the Data Base for African Economic Development,” Journal of Development Studies , LI (2015), 111–115; Alex de Waal, “Liberating African Economic History from the Tyranny of Econometric,” review of Jerven, Africa: Why Economists Get it Wrong , available at: http://africanarguments.org/2015/06/24/liberating-african-economic-history-from-the-tyranny-of-econometrics-by-alex-de-waal/ .

Manning, “Historical Datasets on Africa and the African Atlantic,” Journal of Comparative Economics , XL (2012), 604–607; Frankema and Jerven, “Writing History Backwards or Sideways: Towards a Consensus on African Population, 1850–2010,” Economic History Review , LXVII (2014), 907–931; Fourie and Erik Green, “The Missing People: Accounting for the Productivity of Indigenous Populations in Cape Colonial History,” Journal of African History , LVI (2015), 195–215; Frankema and van Waijenburg, “Structural Impediments”; Rönnbäck Klas, “Living Standards on the Pre-Colonial Gold Coast: A Quantitative Estimate of African Laborers’ Welfare Ratios,” European Review of Economic History , XVIII (2014), 185–202; Sophia Du Plessis and Stan du Plessis, “Happy in the Service of the Company: The Purchasing Power of VOC Salaries at the Cape in the 18 th Century,” Economic History of Developing Regions , XXVII (2012), 125–149; Jerven, “A West African Experiment: Constructing a GDP Series for Colonial Ghana, 1891–1950,” Economic History Review , LXVII (2014), 964–992; Leandro Prados de la Escosura, “Output per Head in Pre-Independence Africa: Quantitative Conjectures,” Economic History of Developing Regions , XXVII (2012), 1–36; Fourie and Jan Luiten van Zanden, “GDP in the Dutch Cape Colony: The National Accounts of a Slave‐Based Society,” South African Journal of Economics , LXXXI (2013), 467–490; Wantchekon, Marko Klašnja, and Natalija Novta, “Education and Human Capital Externalities: Evidence from Colonial Benin,” Quarterly Journal of Economics , CXXX (2015), 703–757; Fourie and Dieter von Fintel, “Settler Skills and Colonial Development: The Huguenot Wine‐Makers in Eighteenth‐Century Dutch South Africa,” Economic History Review , LXVII (2014), 932–963; Nonso, Obikili, “Social Capital and Human Capital in the Colonies: A Study of Cocoa Farmers in Western Nigeria,” Economic History of Developing Regions , XXX (2015), 1–22 (doi: 10.1080/20780389.2015.1012712); Jörg Baten, and Fourie, “Numeracy of Africans, Asians, and Europeans during the Early Modern Period: New Evidence from Cape Colony Court Registers,” Economic History Review , LXVIII (2015), 632–656; Gardner, Taxing Colonial Africa: The Political Economy of British Imperialism (New York, 2012); Fourie, Ada Jansen, and Krige Siebrits, “Public Finances and Private Company Rule: The Dutch Cape Colony (1652–1795),” New Contree , 68 (December 2013), 1–22; Frankema and van Waijenburg, “Metropolitan Blueprints of Colonial Taxation? Lessons from Fiscal Capacity Building in British and French Africa, c. 1880–1940,” Journal of African History , LV (2014), 371–400; Frankema, “Colonial Taxation and Government Spending in British Africa, 1880–1940: Maximizing Revenue or Minimizing Effort?” Explorations in Economic History , XLVIII (2011), 136–149; Papaioannou and Angus Edwin Dalrymple-Smith, “Political Instability and Discontinuity in Nigeria: The Pre-Colonial Past and Public Goods Provision under Colonial and Post-Colonial Political Orders,” Economics of Peace and Security Journal , X (2015), 40–53; Remi Jedwab and Alexander Moradi, “The Permanent Effects of Transportation Revolutions in Poor Countries: Evidence from Africa,” Review of Economics and Statistics (forthcoming).

Robert W. Fogel et al., “Secular Changes in American and British Stature and Nutrition,” Journal of Interdisciplinary History , XIV (1983), 445–481; idem , Stanley L. Engerman, and James Trussell, “Exploring the Uses of Data on Height: The Analysis of Long-Term Trends in Nutrition, Labor Welfare, and Labor Productivity,” Social Science History , VI (1982), 401–421; Richard H. Steckel, “Height and Per Capita Income,” Historical Methods: A Journal of Quantitative and Interdisciplinary History , XVI (1983), 1–7; Steckel, “Stature and the Standard of Living,” Journal of Economic Literature , XXXIII (1995), 1903–1940; George Alter, “Height, Frailty, and the Standard of Living: Modeling the Effects of Diet and Disease on Declining Mortality and Increasing Height,” Population Studies , LVIII (2004), 265–279; Angus Deaton, “Height, Health, and Inequality: The Distribution of Adult Heights in India,” American Economic Review , XCVIII (2008), 468; John Komlos, “Shrinking in a Growing Economy? The Mystery of Physical Stature during the Industrial Revolution,” Journal of Economic History , LVIII (1998), 779–802; idem , “Anomalies in Economic History: Toward a Resolution of the ‘Antebellum Puzzle,’” ibid. , LVI (1996), 202–214; idem and Bjorn Alecke, “The Economics of Antebellum Slave Heights Reconsidered,” Journal of Interdisciplinary History , XXVI (1996), 437–457.

Eltis, “Nutritional Trends in Africa and the Americas: Heights of Africans, 1819–1839,” Journal of Interdisciplinary History , XII (1982), 453–475; idem , “Welfare Trends among the Yoruba in the Early Nineteenth Century: The Anthropometric Evidence,” Journal of Economic History , L (1990), 521–540; Alexander Moradi and Baten, “Inequality in Sub-Saharan Africa: New Data and New Insights from Anthropometric Estimates,” World Development , XXXIII (2005), 1233–1265. Other pioneering work included David E. Sahn and David C. Stifel, “Urban-Rural Inequality in Living Standards in Africa,” Journal of African Economies , XII (2003), 564–597. Moradi, “Confronting Colonial Legacies: Lessons from Human Development in Ghana and Kenya, 1880–2000,” Journal of International Development , XX (2008), 1107–1121, idem , “Towards an Objective Account of Nutrition and Health in Colonial Kenya: A Study of Stature in African Army Recruits and Civilians, 1880–1980,” Journal of Economic History , LXIX (2009), 719–754.

Austin, Baten, and Bas Van Leeuwen, “The Biological Standard of Living in Early Nineteenth‐Century West Africa: New Anthropometric Evidence for Northern Ghana and Burkina Faso,” Economic History Review , LXV (2012), 1280–1302; Denis Cogneau and Léa Rouanet. “Living Conditions in Côte d’Ivoire and Ghana, 1925–1985: What Do Survey Data on Height Stature Tell Us?” Economic History of Developing Regions , XXVI (2011), 55–82; Kris Inwood and Oliver Masakure, “Poverty and Physical Well-Being among the Coloured Population in South Africa,” Economic History of Developing Regions , XXVIII (2013), 56–82; Joseph M. Prince and Steckel, “Nutritional Success on the Great Plains: Nineteenth-Century Equestrian Nomads,” Journal of Interdisciplinary History , XXXIII (2003), 353–384; Howard Bodenhorn, “The Mulatto Advantage: The Biological Consequences of Complexion in Rural Antebellum Virginia,” ibid. (2002), 21–46; Aravinda Meera Guntupalli and Baten, “The Development and Inequality of Heights in North, West, and East India 1915–1944,” Explorations in Economic History , XLIII (2006), 578–608; Inwood, Les Oxley, and Evan Roberts, “Physical Growth and Ethnic Inequality in New Zealand Prisons, 1840–1975,” History of the Family , XX (2015), 249–269 (doi:10.1080/1081602X.2015.1006653); Baten, Ines Pelger, and Linda Twrdek, “The Anthropometric History of Argentina, Brazil and Peru during the 19 th and early 20 th Century,” Economics Human Biology , VII (2009), 319–333; Baten, Mojgan Stegl, and Pierre van der Eng, “The Biological Standard of Living and Body Height in Colonial and Post-Colonial Indonesia, 1770–2000,” Journal of Bioeconomics , XV (2013), 103–122; Baten and Matthias Blum, “Growing Tall but Unequal: New Findings and New Background Evidence on Anthropometric Welfare in 156 Countries, 1810–1989,” Economic History of Developing Regions , XXVII (2012), S66–S85.

Cogneau and Moradi, “Borders that Divide: Education and Religion in Ghana and Togo since Colonial Times,” Journal of Economic History , LXXIV (2014), 694–729; Jedwab and Moradi, “Permanent Effects.”

Martine Mariotti, “Fathers’ Employment and Sons’ Stature: The Long-Run Effects of a Positive Regional Employment Shock in South Africa’s Mining Industry,” Economic Development and Cultural Change , LXIII (2015), 485–514; Taryn Dinkelman and Mariotti, “Does Labor Migration Affect Human Capital in the Long Run? Evidence from Malawi,” unpub. paper (Dartmouth College, 2014).

See for example, Bodenhorn, Timothy W. Guinnane, and Thomas A. Mroz, “Sample-Selection Biases and the ‘Industrialization Puzzle,’” NBER Working Paper 21249 (2015); idem , “Caveat Lector: Sample Selection in Historical Heights and the Interpretation of Early Industrializing Economies,” NBER Working Paper 19955 (2014). Fourie, Inwood, and Mariotti, “Can Historical Changes in Military Technology Explain the Industrial Growth Puzzle?” unpub. paper (London School of Economics, 2014).

Fourie, Inwood, and Mariotti, “‘Poor South Africa! Will No Nice English People Ever Come Out Here?’—The South African Constabulary of the Second South African War,” unpub. paper (Stellenbosch University, 2015).

Maya R. Gupta, Nathaniel P. Jacobson, and Eric K. Garcia, “OCR Binarization and Image Pre-Processing for Searching Historical Documents,” Pattern Recognition , XL (2007), 389–397; Sergey Milyaev et al., “Image Binarization for End-to-End Text Understanding in Natural Images,” paper presented at the Twelfth International Conference on Document Analysis and Recognition ( icdar ), August 25–28, 2013, Washington, D.C.

See, for example, James J. Feigenbaum, “Automated Census Record Linking,” unpub. paper (Harvard University, 2015), available at: http://scholar.harvard.edu/files/jfeigenbaum/files/feigenbaum-censuslink.pdf .

Robert Feenstra, Robert Inklaar, and Marcel Timmer, “The Next Generation of the Penn World Table,” NBER Working Paper 19255 (2013); Jutta Bolt and van Zanden, “The Maddison Project: Collaborative Research on Historical National Accounts,” Economic History Review , LXVII (2014), 627–651; Manning, Big Data . For the Global Collaboratory, see https://collab.iisg.nl/web/labourrelations . Van Zanden et al. (eds.), How Was Life? Global Well-Being since 1820 (Paris, 2014).

Frankema and Jerven, “Writing History Backwards or Sideways”; Frankema, “The Origins of Formal Education in Sub-Saharan Africa: Was British Rule More Benign?” European Review of Economic History , XVI (2012), 335–355; idem , “Colonial Taxation and Government Spending in British Africa, 1880–1940: Maximizing Revenue or Minimizing Effort?” Explorations in Economic History , XLVIII (2011), 136–149. Idem and van Waijenburg, “Structural Impediments,” won the 2013 Arthur H. Cole prize for the best paper published in the Journal of Economic History .

Thomas Piketty (trans. Arthur Goldhammer), Capital in the Twenty-First Century (Cambridge, Mass., 2014); Piketty and Emmanuel Saez, “Inequality in the Long Run,” Science , CCCXLIV (2014), 838–843. For innovative sources, see, for example, Inikori’s attempt to extract African market-price data from the slave trade records to measure precolonial African economies at https://www.wilsoncenter.org/person/joseph-inikori ; Dalrymple-Smith and Frankema’s collection of prices from slave-ship provisions at http://www.wageningenur.nl/upload_mm/7/8/8/aa9b6927-3229-4323-b6f4-0a343dea34b4_Dalrymple-Smith%20Frankema%20-%20Paper.pdf ; Sophia du Plessis and Stan du Plessis, “Happy in the Service of the Company”; Willem Boshoff and Fourie, “When did Globalization Begin in South Africa?” Stellenbosch Working Paper Series WP10/2015.

Jerven, “African Growth Recurring”; idem , “For Richer, For Poorer: GDP Revisions and Africa’s Statistical Tragedy,” African Affairs , CXII (2012), 138–147; idem , Poor Numbers: How we are Misled by African Development Statistics and What to Do about It (Ithaca, 2013); idem , Africa: Why Economists Get It Wrong ; Bolt and Gardner, “De-compressing History? Pre-colonial Institutions and Local Government Finance in British Colonial Africa,” paper presented at the Economic History Association meeting, September 11, 2015, Nashville; Bolt and Ellen Hillbom, “Potential for Diversification? The Role of the Formal Sector in Bechuanaland Protectorate’s Economy, 1900–65,” Economic History of Developing Regions , XXX (2015), 1–30.

Felix Meier zu Selhausen, “Missionaries and Female Empowerment in Colonial Uganda: New Evidence from Protestant Marriage Registers, 1880–1945,” Economic History of Developing Regions , XXIX (2014), 74–112; idem and Jacob Weisdorf, “A Colonial Legacy of African Gender Inequality? Evidence from Christian Kampala, 1895–2011,” Economic History Review , LXIX (2016), 229–257.

Fourie and Green, “Missing People”; Baten and Fourie, “Numeracy of Africans”; Jeanne Cilliers and Fourie, “New Estimates of Settler Life Span and Other Demographic Trends in South Africa, 1652–1948,” Economic History of Developing Regions , XXVII (2012), 61–86.

For evidence of Africans’ lack of the technical skills to keep up with advances in their own economic history, see the list of contributors to the special issue “The Economics of Apartheid,” published in Economic History of Developing Regions , XXIX (2014). For the introduction, see Mariotti and Fourie, “The Economics of Apartheid: An Introduction,” Economic History of Developing Regions , XXIX (2014), 113–125. Gardner and Fourie discuss this problem for other developing regions in “The Internationalization of Economic History: A Puzzle,” Economic History of Developing Regions , XXIX (2014), 1–14.

Green and Nyambara, “Internationalization of Economic History,” 68; Austin, “African Economic History in Africa,” 91.

An ongoing project to write a freely downloadable textbook about African economic history recognizes that technology can leapfrog many of the traditional constraints to accessing higher education. See http://www.aehnetwork.org/textbook/ .

Nunn and Wantchekon, “Slave Trade and the Origins of Mistrust”; Frankema and van Waijenburg, “Structural Impediments”; Jedwab and Moradi, “Permanent Effects.”

Author notes

Email alerts, related articles, related book chapters, affiliations.

  • Online ISSN 1530-9169
  • Print ISSN 0022-1953

A product of The MIT Press

Mit press direct.

  • About MIT Press Direct

Information

  • Accessibility
  • For Authors
  • For Customers
  • For Librarians
  • Direct to Open
  • Open Access
  • Media Inquiries
  • Rights and Permissions
  • For Advertisers
  • About the MIT Press
  • The MIT Press Reader
  • MIT Press Blog
  • Seasonal Catalogs
  • MIT Press Home
  • Give to the MIT Press
  • Direct Service Desk
  • Terms of Use
  • Privacy Statement
  • Crossref Member
  • COUNTER Member  
  • The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

Sign In or Create an Account

IMAGES

  1. Data Revolution Report

    data revolution research paper

  2. (PDF) “You say you want a [data] revolution”: A proposal to use

    data revolution research paper

  3. 8 DATA Analysis AND Report Writing

    data revolution research paper

  4. How to Manage the “Big Data” Revolution

    data revolution research paper

  5. (PDF) The Data Revolution: Big Data, Open Data, Data Infrastructures

    data revolution research paper

  6. Advanced Data Research Paper

    data revolution research paper

VIDEO

  1. Revolutionizing Data Modeling

  2. Unveiling the Journey of AI: From 1950s to Present

  3. Top 10 Conventional Data + AI Tools and Why Databricks Might be the Next Big Thing

  4. How real-time data enables digital transformation

  5. ULTIMATE GENERAL-AMERICAN REVOLUTION

  6. Understanding the AI Revolution: Advancements, Accessibility, and Challenges

COMMENTS

  1. PDF The Data Revolution and Economic Analysis

    Novel Measurement and Research Designs. Both large- scale administrative data sets and new private- sector data have the potential to enable a variety of novel research designs. A salient example is the study by Chetty, Friedman, and Rock-off (2011) on the long- term effects of better teachers.

  2. PDF The Data Revolution and Economic Analysis

    The Data Revolution and Economic Analysis Liran Einav and Jonathan D. Levin NBER Working Paper No. 19035 May 2013 JEL No. C10,C18,C50,C80 ABSTRACT Many believe that "big data" will transform business, government and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research.

  3. The Data Revolution and Economic Analysis

    The Data Revolution and Economic Analysis. Liran Einav & Jonathan D. Levin. Working Paper 19035. DOI 10.3386/w19035. Issue Date May 2013. Many believe that "big data" will transform business, government and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research.

  4. PDF The data revolution

    The data revolution - Finding the missing millions 5 . Acknowledgements ... research project that aims to better understand, measure and communicate what has worked in development and why. Its findings and conclusions are those of the authors ... 3 Note that this paper does not focus on data and the private sector.

  5. The Data Revolution: A Comprehensive Survey on Datafication

    Abstract. Datafication has emerged as a key driver of the digital economy, enabling businesses, governments, and individuals to extract value from the growing flood of data. In this comprehensive ...

  6. The Data Revolution: A Comprehensive Survey on Datafication

    DOI: 10.48175/ijarsct-9007 Corpus ID: 257919152; The Data Revolution: A Comprehensive Survey on Datafication @article{Donald2023TheDR, title={The Data Revolution: A Comprehensive Survey on Datafication}, author={A. David Donald and T. Aditya Sai Srinivas and K. Rekha and D. Anjali and I. Dwaraka Srihith}, journal={International Journal of Advanced Research in Science, Communication and ...

  7. The Data Revolution and Economic Analysis

    The Data Revolution and Economic Analysis. 2014 Pages 1-24. Many believe that "big data" will transform business, government and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research. Large-scale administrative datasets and proprietary private sector data can greatly improve ...

  8. Harvard Data Science Review

    As an open access platform of the Harvard Data Science Initiative, Harvard Data Science Review (HDSR) features foundational thinking, research milestones, educational innovations, and major applications, with a primary emphasis on reproducibility, replicability, and readability.We aim to publish content that helps define and shape data science as a scientifically rigorous and globally ...

  9. Bringing about the data revolution in development: What data skills do

    This paper seeks to understand how the 'data revolution' in development is viewed and experienced by devel-opment actors, how data skills are used within their organisations and projections on this for the future, as well as the eventual implications of these evolutions for professionals in this sector. The paper draws on expert interviews

  10. The Data Revolution and Economic Analysis

    Many believe that "big data" will transform business, government and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research. Large-scale administrative datasets and proprietary private sector data can greatly improve the way we measure, track and describe economic activity.

  11. (PDF) In search of the data revolution: Has the official statistics

    paradigm shift deserving of the adjective rev olutionary. 7.2. Data revolution #4 - Data and statistics to inform. decisions. The idea that statistics can used to inform decisions. rests on the ...

  12. The Data Revolution: Cloud Computing, Artificial Intelligence, and

    The Data Revolution: Cloud Computing, Artificial Intelligence, and Machine Learning in the Future of Sports ... in fact, this research is already happening in self-driving cars, in which driving behavior can be classified with respect to how ... Yesterday's stadium had multiple gate entrances with a paper ticket. Teams and venue owners had no ...

  13. Bringing about the data revolution in development: What data skills do

    Responding to calls for a 'data revolution' in global development, this paper provides new insights into how development professionals view the current and future data skills needs of their sector in four countries. ... Second, the research team recognise that that this paper presents perspectives on the issues from the Global North. Within ...

  14. PDF Revolution in Data

    Revolution in Data: How new technologies are upending borrowing February 2019 Marshall Lux ... [email protected] Guillaume Delepine Research Assistant, Mossavar-Rahmani Center for Business and Government, John F. Kennedy School of Government, Harvard University ... In this paper, we evaluate the role of Big Data in the provision of ...

  15. Book Review The Data Revolution: Big Data, Open Data, Data

    Rob Kitchin's The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences is a timely volume that knits together research and perspectives on the twenty-first century data deluge from multiple scholarly communities connected to CSCW. The book's aim is threefold. First, to examine in detail and reflect upon the nature of data and the assemblages they are a part of.

  16. PDF The Data Revolution in Education

    enabling environment, data production, and data di ssemination and use. Togeth er these activities are part of the data revolution that can respond to the needs of the 2030 education agenda. At the level of the enabling environm ent, there is a need for strong na tional institutions that are well-funded and staffed by qualified individuals.

  17. The Data Revolution and Economic Analysis

    The Data Revolution and Economic Analysis. 9 May 2013 View Share Cite Add to list. ... expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research. Liran Einav This paper draws on some of my experience in related work that used proprietary data from various companies, which were ...

  18. The Data Revolution and Economic Analysis

    Many believe that "big data" will transform business, government and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research. Large-scale administrative datasets and proprietary private sector data can greatly improve the way we measure, track and describe economic activity.

  19. The Data Revolution and Economic Analysis

    How new data may impact economic policy and economic research is discussed and whether the big data predictive modeling tools that have emerged in statistics and computer science may prove useful in economics is considered. Many believe that "big data" will transform business, government, and other aspects of the economy. In this article we discuss how new data may impact economic policy ...

  20. Moving back to the future of big data-driven research: reflecting on

    The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual ...

  21. The Data Revolution: A Comprehensive Survey on Datafication

    (DOI: 10.48175/ijarsct-9007) Datafication has emerged as a key driver of the digital economy, enabling businesses, governments, and individuals to extract value from the growing flood of data. In this comprehensive survey, we explore the various dimensions of datafication, including the technologies, practices, and challenges involved in turning information into structured data for analysis ...

  22. SAGE Research Methods: Find resources to answer your research methods

    The Data Revolution: How Big Data, Open Data and Data Infrastructures are Changing the World. A book that explores the implications of data for society, politics and culture.

  23. The Data Revolution: Big Data, Open Data, Data Infrastructures and

    Semantic Scholar extracted view of "The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences , by Rob Kitchin . 2014 . Thousand Oaks, California : Sage Publications . 222+xvii . ISBN: 978-1446287484, $100" by R. Lovelace

  24. The Data Revolution: Big Data, Open Data, Data Infrastructures and

    The notion of the data assemblage is introduced, and using it to examine three key components of the data revolution: data infrastructures, open data and Big Data. Despite their importance, little conceptual attention has traditionally been paid to data, at least in comparison with the sophistication and depth of the debate on the nature of knowledge and of information. In contrast, efforts ...

  25. Realising the Data Revolution for Sustainable Development ...

    The ongoing data revolution, a significantly changed global context through the 2030 agenda and the necessity to use evidence more effectively in development co ... The analysis presented in this paper demonstrates the need to speed up efforts in several regards. Recent years were to some extent supportive in bringing issues on data and ...

  26. Keeping pace with the digital age: Envisioning information systems

    Our argument rests to a considerable extent on the growing availability of digital trace data that provide increasing opportunities to make sense of the digital world (Leonardi and Treem, 2020).Recent attempts sought to reconcile digital trace data-based research with established practices in our field (Miranda et al., 2022)—but we propose a more fundamental move.

  27. A survey on data‐efficient algorithms in big data era

    According to Research and Markets report , the global big data market size will grow from USD 138.9 billion in 2020 to USD 229.4 billion by 2025, at a compound annual growth rate of 10.6 % during the forecast period. 98.8 % of Fortune 1000 represented in the last version of NewVantage Partners executive survey confirm an active investment in ...

  28. Open Data Revolution in Clinical Research: Opportunities and Challenges

    Citation: Clin Transl Sci (2020) 13, 665-674; doi:10.1111/cts.12756 REVIEW Open Data Revolution in Clinical Research: Opportunities and Challenges Mohamed H. Shahin1, Sanchita Bhattacharya2,3, Diego Silva4,5, Sarah Kim6, Jackson Burton7, Jagdeep Podichetty7, Klaus Romero7 and Daniela J. Conrado8,* Efforts for sharing individual clinical data are gaining momentum due to a heightened ...

  29. PDF Digital Innovation, Data Revolution and Central Bank ...

    Under the developments of digital innovation, global expansion of cashless payments and the emergence of crypto-assets, some argue that central banks should issue digital currencies that can be used by ordinary people instead of paper-based banknotes. The debates on central bank digital currencies are now gathering great attention from worldwide.

  30. The Data Revolution in African Economic History

    Abstract. The recent surge in computing power and access to data-processing software and online resources enables historians to capture historical statistics on a much larger scale than before. The data revolution—encompassing unprecedented advances in data transcription, augmentation, and collaboration—is especially valuable for studying the history of regions where written records are ...