• skip to Main Navigation
  • skip to Main Content
  • skip to Footer
  • Accessibility feedback
  • Data & Visualization and Research Support
  • Data Management

Defining Research Data

One definition of research data is: "the recorded factual material commonly accepted in the scientific community as necessary to validate research findings." ( OMB Circular 110 ).

Research data covers a broad range of types of information (see examples below), and digital data can be structured and stored in a variety of file formats.

Note that properly managing data (and records) does not necessarily equate to sharing or publishing that data.

Examples of Research Data

Some examples of research data:

  • Documents (text, Word), spreadsheets
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Audiotapes, videotapes
  • Photographs, films
  • Protein or genetic sequences
  • Test responses
  • Slides, artifacts, specimens, samples
  • Collection of digital objects acquired and generated during the process of research
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts
  • Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • Methodologies and workflows
  • Standard operating procedures and protocols

Exclusions from Sharing

In addition to the other records to manage (below), some kinds of data may not be sharable due to the nature of the records themselves, or to ethical and privacy concerns. As defined by the OMB , this refers to:

  • preliminary analyses,
  • drafts of scientific papers,
  • plans for future research,
  • peer reviews, or
  • communications with colleagues

Research data also do not include:

  • Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and
  • Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

Some types of data, particularly software, may require special license to share.  In those cases, contact the Office of Technology Transfer to review considerations for software generated in your research.

Other Records to Manage

Although they might not be addressed in an NSF data management plan, the following research records may also be important to manage during and beyond the life of a project.

  • Correspondence (electronic mail and paper-based correspondence)
  • Project files
  • Grant applications
  • Ethics applications
  • Technical reports
  • Research reports
  • Signed consent forms

Adapted from Defining Research Data by the University of Oregon Libraries.

Opens in your default email client

URL copied to clipboard.

QR code for this page

research data meaning

University of Leeds logo

  • Study and research support
  • Researcher support
  • Research data management

Research data management explained

What is research data.

Research data is any information that has been collected, observed, generated or created to validate original research findings.

Although usually digital, research data also includes non-digital formats such as laboratory notebooks and diaries.

Types of research data

Research data can take many forms. It might be:

  • documents, spreadsheets
  • laboratory notebooks, field notebooks, diaries
  • questionnaires, transcripts, codebooks
  • audiotapes, videotapes
  • photographs, films
  • test responses
  • slides, artefacts, specimens, samples
  • collections of digital outputs
  • database contents (video, audio, text, images)
  • models, algorithms, scripts
  • contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • methodologies and workflows
  • standard operating procedures and protocols

Non-digital data

Non-digital data such as laboratory notebooks, ice-core samples and sketchbooks is often unique. You should assess the long-term value of any non-digital data and plan how you will describe and retain them.

You could digitise the materials, but this may not be possible for all types of data.

The University of Leeds research data repository (Research Data Leeds) describes digital materials and can also be used to create records for physical artefacts.

Please contact the team if you would like to discuss requirements for non-digital data.

Sources of research data

Research data can be generated for different purposes and through different processes.

  • Observational data is captured in real-time, and is usually irreplaceable, for example sensor data, survey data, sample data, and neuro-images.
  • Experimental data is captured from lab equipment. It is often reproducible, but this can be expensive. Examples of experimental data are gene sequences, chromatograms, and toroid magnetic field data.
  • Simulation data is generated from test models where model and metadata are more important than output data. For example, climate models and economic models.
  • Derived or compiled data has been transformed from pre-existing data points. It is reproducible if lost, but this would be expensive. Examples are data mining, compiled databases, and 3D models.
  • Reference or canonical data is a static or organic conglomeration or collection of smaller (peer-reviewed) datasets, most probably published and curated. For example, gene sequence databanks, chemical structures, or spatial data portals.

Library buildings are open for UniSA staff and students via UniSA ID swipe cards.   Please contact us on Ask the Library for any assistance. Find out about other changes to Library services .

What is data?

  • Create a plan
  • Find (re-use) data
  • Analyse data

What is research data?

Research data  is any information that has been collected, observed, generated or created to validate original research findings. Research data may be arranged or formatted in such a way as to make it suitable for communication, interpretation and processing. Data comes in many formats, both digital and physical.

Diagram illustrating the research cycle

More information:

  • What is research data? A useful definition from the University of Leeds
  • What is research data? - PDF (3422 KB) Guide prepared by the Australian Research Data Commons (ARDC)

Video length: 36 sec

Common formats include:

  • documents, spreadsheets
  • laboratory notebooks, field notebooks, diaries
  • questionnaires, transcripts, codebooks
  • audiotapes, videotapes
  • photographs, films
  • test responses
  • slides, artefacts, specimens, samples
  • models, algorithms, scripts
  • contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • methodologies and workflows
  • standard operating procedures and protocols

Ask Research

Navigate to AskResearch webpage

  • << Previous: Home
  • Next: Create a plan >>
  • Last Updated: Mar 26, 2024 12:52 PM
  • URL: https://guides.library.unisa.edu.au/ResearchDataManagement

Banner

Research Data Management

  • What Is Research Data?
  • Relevant Research Data Policies
  • The SNSF Data Management Plan
  • Data Safety
  • Data Security
  • Cloud Storage
  • IHEID Network Storage
  • Personal Storage
  • Folder Structure
  • File Naming
  • Version Management
  • Standard Vocabularies
  • Quantitative Data
  • Qualitative Data
  • Image Metadata
  • Finding data
  • Anonymising Data
  • Why You Should Share Data
  • Preserving Data
  • Open Data Repositories
  • RDM Workshops

Save Time, Ask a Librarian

Profile Photo

Library Calendar

Check out our agenda :

Research Data 101

research data meaning

There are multiple definitions of research data. The one adopted by the Swiss National Science Foundation (SNSF) is as follows:

“Research data are the evidence that underpins the answer to the research question, and can be used to validate findings regardless of its form (e.g. print, digital, or physical).”  Concordat on Open Research Data , published on 28 July 2016

You might notice the form of the data matters less than its destination. If something – anything – is a base for research findings, it becomes research data. 

Of course, saying financial data and statistics are data is easy, but in the social sciences, data can be both qualitative and quantitative. A historian or an anthropologist might, for example, base their research on texts, photographs, films, sounds, or interviews – all this is research data. Surveys are data too – both the questions and the answers. Protocols, lab notes, field notes, transcriptions, codebooks all fit the definition and your funding organisation expects them to be stored, documented, and shared appropriately.

Research Data FAQ

What about non-digital data?

Objects, physical samples and artefacts are generally excluded for practical reasons, but their description and recorded characteristics are considered data within that context. Your handwritten lab notes or sketchbooks can, and often should, be digitised for sharing.

What about software and code?

Did the researcher write code or prepare software to answer their research question? If so, the code qualifies as research data. Software that was not developed for the project is only considered a tool – Stata and R do not qualify as research data more than a recorder for an anthropologist, but a custom package should probably be conserved and shared.

What about data I got from someone else?

The question of secondary data usage and copyright is a long and different matter, which will be covered here soon. Secondary data is research data, but licensing and copyright mean they will need specific treatment, and might not be shared depending on their origin. You might even have transformed data so much to fit your needs that it becomes original – really, each case will be different.

What about things that are just related to my research?

Just because something happened during the research process doesn't mean outputs were based on it. Early musings, drafts, peer reviews, plans for future research or e-mails to colleagues are not research data: they are not necessary to support your findings.

What about ethics, privacy, security, or confidentiality requirements?

Don't worry: the SNSF doesn't expect you to share all and any data used in your research. Funding institutions simply want you to manage and conserve – or sometimes anonymise or destroy – it following best practices, including those intended to protect research subjects from any inconvenience. 

What if I just have NO data?

In most cases, researchers do use or produce data. There are exceptions in specific disciplines, though: many researchers in international law, for example, base their research on legal texts and theory rather than empirical research, and there's not much else than their bibliography to qualify as research data.

Don't worry: this does not mean a project will not be accepted. It just means that a data management plan will not be necessary for a grant application. But what is a DMP, you ask? Read further to learn more!

Would You Like to Know More?

Join one of our Research Data Management Workshops ! These are usually 45-minute sessions on a specific aspect of research data management, with opportunities for discussions and questions.

  • << Previous: Home
  • Next: Planning your Data Management >>
  • Last Updated: Mar 28, 2024 3:35 PM
  • URL: https://libguides.graduateinstitute.ch/rdm

We use cookies on reading.ac.uk to improve your experience, monitor site performance and tailor content to you.

Read our cookie policy to find out how to manage your cookie settings.

This site may not work correctly on Internet Explorer. We recommend switching to a different browser for a better experience.

Research data defined

'the evidence that underpins the answer to the research question' Concordat on Open Research Data

'recorded factual material commonly retained by and accepted in the scientific [research] community as necessary to validate research findings' EPSRC Policy Framework on Research Data

Research data are the raw materials collected, processed and studied in the undertaking of research. They are the evidential basis that substantiates published research findings.

They may be primary data generated or collected by the researcher, or secondary data collected from existing sources and processed as part of the research activity.

In addition to the 'raw' data, research data include information about the means necessary to generate data or replicate results, such as computer code, experimental methods and instruments used, and essential interpretive and contextual information, e.g. specifications of variables.

The raw data of research may exist in digital and non-digital formats, and may be broadly divide into five classes:

Observational

Facts recorded directly in real time from the physical and social environment, e.g. measurements collected by weather sensors, species abundance surveys, archaeological samples, brain scan images, experience and opinion surveys in the social sciences. These data are often unique to time and place and by definition cannot be reproduced.

Experimental

Data collected as the outputs of field or laboratory experiments and complex analytical processes, e.g. clinical trial data, chemical analyses of physical samples, DNA sequencing of organic material, field trial results. These data are generally in principle reproducible, assuming the experimental conditions can be replicated.

Data generated by means of computational 'virtual experiments', often used to model complex systems and processes, e.g. climate and weather simulations, models of market processes. These data are usually reproducible, given information about the model, the code and computing environment used to execute the model, and any input conditions. This information may in fact be more important that the output data.

Derived or compiled

Datasets produced by processing or combining source data, e.g. databases compiled by extraction of information from multiple secondary sources, collections of digitized materials, corpora collected by means of text mining.

Published and curated data, usually existing as part of managed collections, e.g. national statistics archives, crystallographic databases, gene banks.

Robert Darby , Research Data Manager

[email protected]

0118 378 6161

Book cover

Handbuch Methoden der Politikwissenschaft pp 1–18 Cite as

Understanding and Implementing Research Data Management

  • Sebastian Netscher 4 &
  • Alexia Katsanidou 4  
  • Living reference work entry
  • First Online: 01 May 2018

465 Accesses

3 Altmetric

Part of the book series: Springer Reference Sozialwissenschaften ((SRS))

Research Data Management is an integral part of good scientific practice. Its relevance increases as funders, journals, and other scientific outlets place it as a requirement in the context of sharing data and concepts on data harmonization. Consequently, it has become imperative that researchers know what research data management is, how it supports their research project and fosters satisfying funders’ requirements on sharing. This chapter introduces research data management, describes activities of data handling, and highlights legal issues. Based on this introduction, we discuss different objectives of research data management, such as ensuring replicability or sharing of research data generated in a research project as well as sharing concepts on data harmonizing and merging in secondary analysis.

This is a preview of subscription content, log in via an institution .

A computer science method that automatically links available data over the internet.

Also most research institutes have automated back-up processes, nowadays, researcher should be aware of how these procedures work and how to access back-ups if needed.

It is important to note that licenses can be irrevocable as well as not suitable for personal data, requiring confidentiality.

The Consortium of European Social Science Data Archives (CESSDA) is a pan-European Research Infrastructure (ERIC) that brings together social science data archives across Europe to provides large scale, integrated, and sustainable data services to the social sciences.

At least, researchers should not avoid data sharing by restrictive formulations in the consent form, such as ‘data will only be used within the research project’ (Doorn 2010 ).

In some cases, data anonymization might not be appropriate due to a loss of information. In such cases, researchers might share their data weakly or even non-anonymized, controlling data access and re-use by a restrictive license or even by sharing data via a so called secure data center.

BDSG. 1990. Bundesdatenschutzgesetz. https://www.gesetze-im-internet.de/bdsg_1990/ . Accessed 22 Feb 2018.

Beagrie, Charles. 2011. User guide for keeping research data safe. Assessing Costs/Benefits of Research Data Management, Preservation and Re-Use. Version 2.0. https://beagrie.com/KeepingResearchDataSafe_UserGuide_v2.pdf . Accessed 22 Feb 2018.

CESSDA. ERIC 2017. Consortium or European Social Science Data Archive. https://www.cessda.eu/ .

Corti, Louise, Veerle van den Eynden, Libby Bishop, and Matthew Woollard. 2014. Managing and sharing research data. A guide to good practice . London: Sage publications.

Google Scholar  

DataCite Metadata Working Group. 2016. DataCite Metadata schema documentation for the publication and citation of research data. Version 4.0. DataCite e.V. https://doi.org/10.5438/0012 .

DDI Allicane. 2017. Data documentation initiative. http://www.ddialliance.org/ . Accessed 22 Feb 2018.

DFG-Deutsche Forschungsgemeinschaft. 2013. Sicherung guter wissenschaftlicher Praxis. Denkschrift, Empfehlungen der Kommission „Selbstkontrolle in der Wissenschaft“ . Weinheim: WILEY-VCH Verlag. http://www.dfg.de/foerderung/grundlagen_rahmenbedingungen/gwp/ .

Book   Google Scholar  

Doorn, Peter K. 2010. Preparing data for sharing. Guide to social science data archiving . DANS Data Guide 8. Amsterdam: Pallas Publications/Amsterdam University Press.

Elliot, Mark, Elaine Mackey, Kieron O’Hara, and Caroline Tudor. 2016. The anonymization decision-making framework . Manchester: University of Manchester.

Force11. 2017. The FAIR data principles. https://www.force11.org/group/fairgroup/fairprinciples . Accessed 22 Feb 2018.

GDPR. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data (General Data Protection Regulation). http://eur-lex.europa.eu/legal-content/DE/ALL/?uri=CELEX%3A32016R0679 . Accessed 22 Feb 2018.

GESIS. 2017. Replikationsserver. http://www.gesis.org/replikationsserver/home/ .

Horizon 2020 Programme. 2016. Guidelines on FAIR Data Management in Horizon 2020. Version 3.0. http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf . Accessed 22 Feb 2018.

Huschka, Denis, Claudia Oellers, Notburga Ott, and Gert G. Wagner. 2011. Datenmanagement und Data Sharing- Erfahrungen in den Sozial- und Wirtschaftswissenschaften. In Handbuch Forschungsdatenmanagement , eds. S. Büttner, H.-C. Hobohm, and L. Müller, 35–48. Bad Honnef: Bock+Herchen Verlag.

ISCED. 2011. International standard classification of education. http://uis.unesco.org/en/documents . Accessed 22 Feb 2018.

Jensen, Uwe, Alexia Katsanidou, and Wolfgang Zenk-Möltgen. 2011. Metadaten und Standards. In Handbuch Forschungsdatenmanagement , eds. S. Büttner, H.-C. Hobohm, and L. Müller, 83–100. Bad Honnef: Bock+Herchen Verlag.

Kimpton, Michele, and Carol Minton Morris. 2014. Managing and archiving research data: Local repositories and cloud-based practices. In Research data management. Practical strategies for information professionals , ed. J. M. Ray, 223–238. West Lafayette: Purdue University Press.

Recker, Astrid, Stefan Müller, Jessica Trixa, and Natascha Schumann. 2015. Paving the way for data-centric, open science: An example from the social sciences. Journal of Librarianship and Scholarly Communication 3(2). https://doi.org/10.7710/2162-3309.1227 .

Wallace, David A. 2001. Archiving metadata forum: Report from the recordkeeping metadata working meeting, June 2000. Archival Science 1: 253–269.

Article   Google Scholar  

Winters, Kristi, and Sebastian Netscher. 2016. Proposed standards for variable harmonization documentation and referencing: A case study using QuickCharmStats 1.1. PLoS ONE 11(2). https://doi.org/10.1371/journal.pone.0147795 .

Download references

Author information

Authors and affiliations.

Abteilung Datenarchiv für Sozialwissenschaften, GESIS - Leibniz Institut für Sozialwissenschaften, Köln, Deutschland

Sebastian Netscher & Alexia Katsanidou

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Alexia Katsanidou .

Editor information

Editors and affiliations.

Methodenzentrum Sozialwissenschaft, Goethe-Universität Frankfurt Methodenzentrum Sozialwissenschaft, Frankfurt, Hessen, Germany

Claudius Wagemann

Institute for Political Sciences, University of Duisburg-Essen Institute for Political Sciences, Duisburg, Germany

Achim Goerres

Institut für Politikwissenschaft, Goethe-Universität Frankfurt am Main Institut für Politikwissenschaft, Frankfurt, Hessen, Germany

Markus Siewert

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature

About this entry

Cite this entry.

Netscher, S., Katsanidou, A. (2018). Understanding and Implementing Research Data Management. In: Wagemann, C., Goerres, A., Siewert, M. (eds) Handbuch Methoden der Politikwissenschaft. Springer Reference Sozialwissenschaften. Springer VS, Wiesbaden. https://doi.org/10.1007/978-3-658-16937-4_4-1

Download citation

DOI : https://doi.org/10.1007/978-3-658-16937-4_4-1

Received : 22 February 2018

Accepted : 22 February 2018

Published : 01 May 2018

Publisher Name : Springer VS, Wiesbaden

Print ISBN : 978-3-658-16937-4

Online ISBN : 978-3-658-16937-4

eBook Packages : Springer Referenz Sozialwissenschaften und Recht

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Data Analysis

  • Introduction to Data Analysis
  • Quantitative Analysis Tools
  • Qualitative Analysis Tools
  • Mixed Methods Analysis
  • Geospatial Analysis
  • Further Reading

Profile Photo

What is Data Analysis?

According to the federal government, data analysis is "the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data" ( Responsible Conduct in Data Management ). Important components of data analysis include searching for patterns, remaining unbiased in drawing inference from data, practicing responsible  data management , and maintaining "honest and accurate analysis" ( Responsible Conduct in Data Management ). 

In order to understand data analysis further, it can be helpful to take a step back and understand the question "What is data?". Many of us associate data with spreadsheets of numbers and values, however, data can encompass much more than that. According to the federal government, data is "The recorded factual material commonly accepted in the scientific community as necessary to validate research findings" ( OMB Circular 110 ). This broad definition can include information in many formats. 

Some examples of types of data are as follows:

  • Photographs 
  • Hand-written notes from field observation
  • Machine learning training data sets
  • Ethnographic interview transcripts
  • Sheet music
  • Scripts for plays and musicals 
  • Observations from laboratory experiments ( CMU Data 101 )

Thus, data analysis includes the processing and manipulation of these data sources in order to gain additional insight from data, answer a research question, or confirm a research hypothesis. 

Data analysis falls within the larger research data lifecycle, as seen below. 

( University of Virginia )

Why Analyze Data?

Through data analysis, a researcher can gain additional insight from data and draw conclusions to address the research question or hypothesis. Use of data analysis tools helps researchers understand and interpret data. 

What are the Types of Data Analysis?

Data analysis can be quantitative, qualitative, or mixed methods. 

Quantitative research typically involves numbers and "close-ended questions and responses" ( Creswell & Creswell, 2018 , p. 3). Quantitative research tests variables against objective theories, usually measured and collected on instruments and analyzed using statistical procedures ( Creswell & Creswell, 2018 , p. 4). Quantitative analysis usually uses deductive reasoning. 

Qualitative  research typically involves words and "open-ended questions and responses" ( Creswell & Creswell, 2018 , p. 3). According to Creswell & Creswell, "qualitative research is an approach for exploring and understanding the meaning individuals or groups ascribe to a social or human problem" ( 2018 , p. 4). Thus, qualitative analysis usually invokes inductive reasoning. 

Mixed methods  research uses methods from both quantitative and qualitative research approaches. Mixed methods research works under the "core assumption... that the integration of qualitative and quantitative data yields additional insight beyond the information provided by either the quantitative or qualitative data alone" ( Creswell & Creswell, 2018 , p. 4). 

  • Next: Planning >>
  • Last Updated: Apr 2, 2024 3:53 PM
  • URL: https://guides.library.georgetown.edu/data-analysis

Creative Commons

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals

Research data articles from across Nature Portfolio

Research data comprises research observations or findings, such as facts, images, measurements, records and files in various formats, and can be stored in databases. Data publication and archiving is important for the reuse of research data and the reproducibility of scientific research.

research data meaning

Accessible computing platforms democratize neuroimaging data analysis

Several research groups are making it easier for other neuroscientists to analyze large datasets by providing tools that can be accessed and used from anywhere in the world.

  • Lucina Q. Uddin

Related Subjects

Latest research and reviews.

research data meaning

FastMRI Prostate: A public, biparametric MRI dataset to advance machine learning for prostate cancer imaging

  • Radhika Tibrewala
  • Patricia M. Johnson

research data meaning

The advent of Alzheimer treatments will change the trajectory of human aging

The advent of plaque-clearing antibodies to the amyloid-β as the first disease-modifying treatment for Alzheimer’s disease will change the course of this disease, the most common type of dementia. Related progress will gradually alter the trajectory of human aging.

  • Dennis J. Selkoe

research data meaning

A construction waste landfill dataset of two districts in Beijing, China from high resolution satellite images

research data meaning

brainlife.io: a decentralized and open-source cloud platform to support neuroscience research

brainlife.io is a one-stop cloud platform for data management, visualization and analysis in human neuroscience. It is web-based and provides access to a variety of tools in a reproducible and reliable manner.

  • Soichi Hayashi
  • Bradley A. Caron
  • Franco Pestilli

research data meaning

High-resolution meteorology with climate change impacts from global climate model data using generative machine learning

Global climate models are challenging to integrate in energy system models because their output data resolution is too coarse. Buster et al. generate high-resolution meteorological data with climate change impacts from global climate model datasets using generative machine learning.

  • Grant Buster
  • Brandon N. Benton
  • Ryan N. King

research data meaning

Standardised Versioning of Datasets: a FAIR–compliant Proposal

  • Alba González–Cebrián
  • Michael Bradford
  • Horacio González–Vélez

Advertisement

News and Comment

Adopt universal standards for study adaptation to boost health, education and social-science research.

  • Dragos Iliescu
  • Samuel Greiff

Understanding the provenance and quality of methods is essential for responsible reuse of FAIR data

  • Tracey L. Weissgerber
  • Małgorzata Anna Gazda
  • Malcolm R. Macleod

Cosmic recipe book updated

On the ethics of informed consent in genetic data collected before 1997.

  • Martin Zieger
  • Maria Eugenia D’Amato

research data meaning

AI and the democratization of knowledge

The solution of the longstanding “protein folding problem” in 2021 showcased the transformative capabilities of AI in advancing the biomedical sciences. AI was characterized as successfully learning from protein structure data , which then spurred a more general call for AI-ready datasets to drive forward medical research. Here, we argue that it is the broad availability of knowledge , not just data, that is required to fuel further advances in AI in the scientific domain. This represents a quantum leap in a trend toward knowledge democratization that had already been developing in the biomedical sciences: knowledge is no longer primarily applied by specialists in a sub-field of biomedicine, but rather multidisciplinary teams, diverse biomedical research programs, and now machine learning. The development and application of explicit knowledge representations underpinning democratization is becoming a core scientific activity, and more investment in this activity is required if we are to achieve the promise of AI.

  • Christophe Dessimoz
  • Paul D. Thomas

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research data meaning

Banner

Data Module #1: What is Research Data?

  • Defining Research Data
  • Qualitative vs. Quantitative

Types of Research Data

  • Data and Statistics
  • Let's Review...

Quick Navigation

Data may be grouped into four main types based on methods for collection: observational, experimental, simulation, and derived. the type of research data you collect may affect the way you manage that data. for example, data that is hard or impossible to replace (e.g. the recording of an event at a specific time and place) requires extra backup procedures to reduce the risk of data loss. or, if you will need to combine data points from different sources, you will need to follow best practices to prevent data corruption.  .

research data meaning

Observational Data

Observational data are captured through observation of a behavior or activity. It is collected using methods such as human observation, open-ended surveys, or the use of an instrument or sensor to monitor and record information -- such as the use of sensors to observe noise levels at the Mpls/St Paul airport. Because observational data are captured in real time, it would be very difficult or impossible to re-create if lost. Image courtesy of  https://dorothyjoseph.com

research data meaning

Experimental Data

Experimental data are collected through active intervention by the researcher to produce and measure change or to create difference when a variable is altered. Experimental data typically allows the researcher to determine a causal relationship and is typically projectable to a larger population. This type of data are often reproducible, but it often can be expensive to do so.  

research data meaning

Simulation Data

Simulation data are generated by imitating the operation of a real-world process or system over time using computer test models. For example, to predict weather conditions, economic models, chemical reactions, or seismic activity. This method is used to try to determine what would, or could, happen under certain conditions. The test model used is often as, or even more, important than the data generated from the simulation.  

research data meaning

Derived / Compiled Data

Derived data involves using existing data points, often from different data sources, to create new data through some sort of transformation, such as an arithmetic formula or aggregation. For example, combining area and population data from the Twin Cities metro area to create population density data. While this type of data can usually be replaced if lost, it may be very time-consuming (and possibly expensive) to do so.  

  • << Previous: Qualitative vs. Quantitative
  • Next: Data and Statistics >>
  • Last Updated: Feb 2, 2024 1:41 PM
  • URL: https://libguides.macalester.edu/data1

Banner

Research Data Management

  • Research data lifecycle
  • Searching for research data
  • NTU Research Data Policy

Defining Research data

  • Classification, Storage & Handling
  • Anonymisation This link opens in a new window
  • NTU DMP Template v3 (18 Jun 2020 - )
  • NTU DMP Template v2 (15 Jan 2018 - 17 Jun 2020)
  • Where to share data?
  • How to share data?
  • Current Workshops This link opens in a new window
  • Singapore Open Research Conference 2024 This link opens in a new window
  • Singapore Open Research Conference 2022 This link opens in a new window
  • DR-NTU (Data) This link opens in a new window
  • Open Science This link opens in a new window
  • Blog This link opens in a new window

An Introduction to the Basics of Research Data

An animated video that explains the basics of research data.

(Source: Created by Louise Patterton of  the Council for Scientific and Industrial Research  in South Africa)

Giving an authoritative definition to “research data” is challenging. Different disciplines usually develop their own preferred definition which is appropriate and suitable for their domain.

Some higher education institutions around the world have developed their definition in developing their policy on research data.

Definition by Nanyang Technological University

“Research data are data in whatever formats or form collected, observed, generated, created and obtained during the entire course of a research project. This would include numerical, descriptive, aural, visual or physical forms recorded by the researcher, generated by equipment and derived from models, simulations.”

(Source: NTU Research Data Policy )

Definition by National Medical Research Council 

The National Medical Research Council (NMRC) from Ministry of Health (MOH) in Singapore has defined it in its ' Research Data Governance and Sharing ' briefing slides presented in March 2015 data sharing roadshows as:

"Recorded factual material commonly accepted in the scientific community as necessary to document and support research findings."

(Source : National Medical Research Council )

Further readings:

  • What is Research Data?   (Source: Yale University)
  • Service Standards and Policies/EPSRC Policy Framework on Research Data/ Scope and Benefits  (Source: Engineering and Physical Sciences Research Council [EPSRC])
  • Defining Research Data  (Source: University of Oregon)
  • Management of Research Data and Records Policy  (Source: University of Melbourne)
  • Research Data Management: Staff, Adjuncts and Visitors Procedures  (Source: Monash University)
  • Management of Research Data  (Source: Queensland University of Technology)
  • Research Data in Context  (Source: MANTRA)
  • << Previous: NTU Research Data Policy
  • Next: Classification, Storage & Handling >>
  • Last Updated: Apr 23, 2024 7:04 PM
  • URL: https://libguides.ntu.edu.sg/rdm

You are expected to comply with University policies and guidelines namely, Appropriate Use of Information Resources Policy , IT Usage Policy and Social Media Policy . Users will be personally liable for any infringement of Copyright and Licensing laws. Unless otherwise stated, all guide content is licensed by CC BY-NC 4.0 .

Advertisement

Issue Cover

  • Previous Article

PEER REVIEW

1. introduction, 2. background, 3. data and methods, 5. discussion, 6. conclusion, author contributions, competing interests, funding information, data availability, researchers and their data: a study based on the use of the word data in scholarly articles.

ORCID logo

Handling Editor: Ludo Waltman

  • Cite Icon Cite
  • Open the PDF for in another window
  • Permissions
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Search Site

Frédérique Bordignon , Marion Maisonobe; Researchers and their data: A study based on the use of the word data in scholarly articles. Quantitative Science Studies 2022; 3 (4): 1156–1178. doi: https://doi.org/10.1162/qss_a_00220

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Data is one of the most used terms in scientific vocabulary. This article focuses on the relationship between data and research by analyzing the contexts of occurrence of the word data in a corpus of 72,471 research articles (1980–2012) from two distinct fields (Social sciences, Physical sciences). The aim is to shed light on the issues raised by research on data, namely the difficulty of defining what is considered as data, the transformations that data undergo during the research process, and how they gain value for researchers who hold them. Relying on the distribution of occurrences throughout the texts and over time, it demonstrates that the word data mostly occurs at the beginning and end of research articles. Adjectives and verbs accompanying the noun data turn out to be even more important than data itself in specifying data. The increase in the use of possessive pronouns at the end of the articles reveals that authors tend to claim ownership of their data at the very end of the research process. Our research demonstrates that even if data-handling operations are increasingly frequent, they are still described with imprecise verbs that do not reflect the complexity of these transformations.

https://publons.com/publon/10.1162/qss_a_00220

While data are an essential component of the scientific method, it is commonly accepted that they play a growing role in our contemporary societies. This evolution is related to a change of scale in our access and means to process data ( Chen & Zhang, 2014 ). As this change of scale also concerns research data, one may wonder if it has an effect on the way researchers use and refer to data. Given that some consider the changes under way as capable of transforming how research is done across disciplines, the investigation we provide is timely. The aim of this paper is indeed to contribute to the science studies that focus on what Borgman (2015) calls “data scholarship,” that is, the complex arrangement of the relationship between data and research. To this end, we propose to explore the traces of this relationship in scientific articles.

Most scientometrics studies dealing with the advent of the “Big Data era” document the development of research areas specifically related to data handling ( Huang, Schuehle et al., 2015 ; Maisonobe, 2022 ), but no comprehensive, longitudinal and transversal study has been performed on the content of scholarly articles’ full text to monitor the way academic scholars refer to data. However, Coxhead (2000) has evidenced that the word data is among the most used terms of academic vocabulary, as did Hyland and Tse (2007) who also insisted on field-to-field variations.

According to Rosenberg (2013 , 2018) , the word data progressively gained popularity among scholars during the 19th century and throughout the 20th century, when it started to be more often used with the meaning it is attributed today. A quick glance at the Web of Science Core Collection suggests the continuation of this positive dynamic. Between 1991 and 2019 the share of article abstracts containing the word data increased from 13.6% to 18.6%. In 2020, this share exceeds 20%, which means that the word data is now present in the abstracts of one out of five academic papers 1 .

The widespread use of the word and its frequent appearance in the abstracts of scientific articles, however, give us only limited insight into its use by researchers within their articles and over time. Moreover, as many analysts have pointed out in reaction to the sometimes excessive enthusiasm generated by the Big Data movement ( Boyd & Crawford, 2012 ; Tenopir, 2016 ), the term data is so polysemous that it seems difficult to understand it independently of its context. In particular, most specialists share the idea that data does not have a stable meaning, its significance and utility varying throughout the research process ( Hilgartner & Brandt-Rauf, 1994 ; Latour & Woolgar, 1986 ; Leonelli, 2020 ; Longino, 2020 ).

To account for this diversity of use and meaning, and to better understand how data are mobilized by researchers, we offer to investigate how the rhetorical function of data emerges in academic writing by analyzing in which contexts data occurs in scientific articles, which are among the written outputs that researchers use to formulate their argument and reasoning. To carry out this investigation, we use a computational method of discourse analysis, starting from the pivot term data and looking for different lexicogrammatical realizations in our corpus of full texts. We undertake the textometric analysis of a corpus of 72,471 scientific articles with 896,159 occurrences of the word data retrieved from scholarly articles available in the ISTEX database (Excellence Initiative for Scientific and Technical Information; see below for details). We consider the evolution of the use of the term and its context of use (its neighborhood) over the course of the articles and over a 30-year span (1980s to 2010s). To echo the work revealing the differences in practices between disciplinary fields, we provide a comparative study between physical sciences and social sciences.

Our main research question is therefore the following: What does the linguistic environment of the word data tell us about the very nature of data in two different disciplinary fields, and over a 30-year period? This study is indeed a corpus-based lexical study that aims to shed light on the different issues raised by research on data, namely the difficulty of defining what is considered as data, the transformations that data undergo during the research process, and how they gain value for the researchers who hold them. After the state of research, the article presents the data and methods used to carry out this study and the results obtained by focusing on the occurrence of data throughout the article, and in different contexts of use, including the combination with adjectives, verbs, and possessive pronouns. In addition to a tailored categorization of the types of adjectives and verbs pertaining to the word data , our research brings out interesting stylized facts about the evolution of the use of the term across articles and interesting distinctions between scientific domains.

2.1. Drawing on Scientific Articles

Each research field develops its own inscriptions ( Latour, 1999 ) to record, describe, and represent what it considers to be data. Some of those traces can be found in scientific publications, and in particular in the articles that we intend to analyze here. While the articles are of course not a faithful account of what happened in the laboratory, according to Knorr and Knorr (1978) , they provide the selected measurement inscriptions of the laboratory with “contexture.”

In discourse epistemetrics, a specialty at the interface between quantitative science studies and applied linguistics ( Demarest & Sugimoto, 2015 ), it is agreed that the style, composition units, rhetorical moves, and registers found in peer-reviewed research articles demonstrate regularities and specificities allowing us to distinguish them from other kinds of documents and to submit them to specific bibliometric and textometric analyses. Also, from a diachronic linguistic observation, it is possible to derive a sociohistorical interpretation, in line with those who consider that the linguistic study of scientific discourses can inform the research process itself ( Bazerman, 1981 ; Mullins, Snizek, & Oehler, 1988 ). We follow the footsteps of works that take the scholarly article as an object and as a reflection of research practices, social context, and “discourse community” ( Swales, 1988 ), in which researchers are involved.

By studying the use of the word data in research articles, we assume that it can inform us about the way researchers use and value data.

2.2. The Polysemy of the Term Data

The heterogeneity of the objects falling under the term data has been highlighted by many studies on the topic, which then hardly succeed in providing a consensual and accurate definition of what this term covers ( Borgman, 2015 ; Zins, 2007 ). Others have focused on the word data itself. Rosenberg (2013) relies on datum , the Latin etymology of the word data , which refers to something given in an argument, something taken for granted. According to him the term data is thus used to refer to claims accepted for the sake of argument and has no intrinsic formal characteristics. Rosenberg concludes that the use of the word data provides a rhetorical basis.

In sociology of science, the use of the word data has been criticized when it refers to objects that are rather constructed ( Latour, 1999 ; Terrier, 2011 ). Latour (1999) suggests stopping speaking of data and using sublata instead, meaning achievements, and illustrating the cumulative nature of knowledge ( Perret & Le Deuff, 2019 ). In the same line of thought, but in a reflection on digital humanities, Drucker (2011) also suggests reconsidering data and proposes to speak of capta : “data are capta, taken not given, constructed as an interpretation of the phenomenal world, not inherent in it.” Even if neither capta nor sublata have passed into common usage, there is a consensus that data are not indeed already there ( Borgman, 2015 ).

Drawing on the fact that adjectives qualify nouns, by their very nature, we propose to explore the meaning of the word data by taking advantage of the semantic content conveyed by the adjectives it combines with.

2.3. Data Transformation, Value, and Ownership

Data are the outcome of a number of operations, manipulations, and translations so that they can be given meaning and direction for action by those who process them ( Hilgartner & Brandt-Rauf, 1994 ; Martin-Scholz, 2017 ). Many studies have explored the complexity of these operations, such as Walford (2013) , drawing on an ethnographic study of scientific research carried out in the Amazon, Denis and Goëta (2017) and Plantin (2019) , who have highlighted several dimensions of the process of cleaning for opening data (and make them suitable for reuse), or Ruppert and Scheel (2021) who study the “data practices” within the field of official population statistics. They argue that these numerous operations are the central features of the transactional process through which data are both obtained for some and given for others, just as Hilgartner and Brandt-Rauf (1994) propose to use the term data in a broad manner that encompasses both inputs and outputs. Accordingly, data should be conceptualized as part of an evolving “data stream” ( Hilgartner & Brandt-Rauf, 1994 ; Smolczewska Tona, 2021 ) or “data journey” ( Leonelli, 2020 ). This is what leads Gitelman (2013) to consider raw data as an oxymoron, data being always “cooked” and never entirely “raw”; there are phenomena of the world filtered by our means of perception ( Longino, 2020 ). Nevertheless, raw data is a common term for data that has not yet been cleaned; they are indeed awaiting transformation and do not yet have the status of certified data ( Walford, 2013 ). We will consider that an article constitutes the authors’ narrative of a part of the “data journey,” the one they wished to share with the reader.

It is in this transformation that data become more valuable and ownership is shaped. The data have become a symbol of “what the researcher seeks and more importantly, needs” ( Markham, 2013 ).

Whether observations, traces or other entities become data depends on who collects them and how, with different practices in different disciplines ( Borgman, 2015 ; Leonelli, 2020 ; Ruppert & Scheel, 2021 ; Zins, 2007 ).

This body of literature prompted us to analyze verbs (and the actions they refer to) and to investigate the use of the possessive pronouns my and our .

Despite contradictory recommendations and still much controversy about whether to use the plural or the collective singular to refer to data, usage allows both forms ( Rosenberg, 2013 ) and we opt for the plural, which seems more natural to us as native speakers of French.

This study is a use case of the ISTEX database, an archive of scientific publications 2 . It offers French researchers online access to a retrospective collection of more than 23 million scientific publications in all disciplines and from many different publishers including Elsevier, Wiley, Springer, Oxford University Press, British Medical Journal, IOP Publishing, Nature, Royal Society of Chemistry, De Gruyter, Ecco Press, Emerald, and Brill. We therefore take advantage of this very extensive database, which offers a legal framework for text mining and saves us the tedious task of (manually) downloading full texts from the various publishers’ websites. Finally, ISTEX offers technical services to export plain texts in different formats expected by text-mining tools, and also enriches the metadata provided by publishers with several discipline classifications ( Dazy, 2014 ).

It seems essential to us that researchers of quantitative science studies take advantage of this type of database, partly designed for them 3 , as they can access and process them to answer their research questions without the advanced skills of computer scientists.

We used the ISTEX API to build our corpus. We simply searched for the word data everywhere in the documents. After trials and visual checks, we chose to filter out documents whose quality index was under 5 (out of 10; the score is calculated by ISTEX according to checks on OCR quality and the integrity of texts and metadata) to avoid possible textual residues impossible to interpret.

data AND host.genre.raw:(“journal”) AND genre.raw:(“research-article”) AND publicationDate:[ year TO year+1 ] AND language.raw:(“eng”) AND qualityIndicators.score:[5 TO 10] AND categories.scopus.raw:“1 - Social sciences”

This query was repeated for each 2-year set since 1980 to extract the publications in Social sciences and those in Physical sciences (changing: “1 - Social sciences” to “2 - Physical sciences”). Each time, we exported the corpus files in TXT (for the plain text) and JSON formats (for the metadata). We then took 2012 as the upper limit, because from 2013 onwards, the ISTEX database is unbalanced with fewer and fewer articles available in Social sciences.

We therefore have a homogeneous corpus whose text files are of good quality and which allows us to carry out a comparative analysis over time and within two major scientific fields.

As these files are copyright protected, we cannot make them available, but this methodology allows authorized users to access the database and reproduce the operations. In any case, the data we have extracted from texts are available for download ( Bordignon & Maisonobe, 2022 ).

We used the software TXM ( Heiden, Magué, & Pincemin, 2010 ) to index this corpus and, in particular, to annotate it with TreeTagger. TreeTagger is a tool that processes words from a text and labels them with a part-of-speech tag. Part-of-speech tagging is commonly used in corpus linguistics to identify word categories (e.g., verb, noun, adjective, …) and also grammatical features (verb tense, plural/singular, …). For example, it helps in distinguishing the verb leaves from the noun leaves thanks to the context of the sequence of words they occur in.

Then TXM is used to query the corpus and retrieve the contexts where data occurs and its position in the text (i.e., the rank of the sentence in the whole document). The resulting data can then be exported in CSV and processed in Tableau Software.

3.3. Queries

Queries for TXM must be constructed according to the Corpus Query Language (CQL) based on the combination of regular expressions and the parts of speech previously tagged. These queries were elaborated through a long iterative process, including quality controls and numerous tests on our corpus.

For the purposes of this study, we develop queries able to extract adjectives combined with data , to extract the verbs and past participles that authors use to express their actions on data to assess, and the use of the possessive pronouns my and our combined with data .

(…) in Fig. 8, we compare the uncorrected experimental data with the LA 150 data (…)
[word="I|we" & enpos="PP"] []{0,3} [!word="not|.*n't|cannot"] [word="data"%c] | [word="data"%c][word ="that"]* [word="I|we" & enpos="PP"] []{0,3} [!word="not|.*n't|cannot"] [enpos="V.*"]
we_PP compare_VVP the_DT uncorrected_JJ experimental_JJ data_NNS

For each occurrence retrieved by these queries, we also recorded the position in the text with the sentence reference (i.e., rank) provided by TXM; this means that each occurrence position in the text is between 1 and 100%, the value 1% representing the beginning of the text and the value 100% corresponding to the very end of the article. Our aim here was to resonate with existing studies that consider data as mobile entities and to check whether this is apparent in the course of the text.

We also built a special query not intended to identify the word data in any context, but to identify the position of the bibliography section in each document (i.e., the rank of the section title). This query served to populate a variable that we used to exclude all occurrences of data that are beyond this limit (i.e., occurring in the titles of publications or sources present in the articles’ reference section). Our objective here is to prevent the results from being biased by occurrences of the word data in the full reference of a cited document, with, for example, data occurring in the title of a cited document.

By removing all publications where the term data appears only in the reference section, we finally obtained a corpus of 896,159 occurrences of data in the main text, provided by 72,471 research articles ( Table 1 ).

Corpus overview. Distribution of articles and occurrences of the term data by year and subject area

4.1. Data Occurrences Throughout the Article

The following figures present the distribution of occurrences of the word data throughout the text, following the examples of Bertin, Atanassova et al. (2016) and Hsiao and Schneider (2021) while they study the location of in-text citations. Thus, in Figure 1 , the x -axis refers to the progression of the text. The y -axis indicates the share of occurrences related to each 10% interval of the text. The figure shows that it is at the beginning and the end of the article that the authors use the word data the most, with slightly different curves for Social sciences and Physical sciences. However, we cannot conclude that less data is mobilized in between. On the contrary, we think that this is indicative of the authors specifying the nature of the data throughout the text by naming them differently (i.e., without using the word data , but by naming what can instantiate the word data , such as surveys, measurements, subjects, materials, numbers, and photos).

Data occurrences throughout the article.

Data occurrences throughout the article.

The fact that the word data occurs more often at the beginning and end of articles might come from the fact that the introduction and conclusion are the sections that contain more general statements. The end of articles is also often dedicated to discussion. It gives an opportunity to authors to step back and theorize the results obtained.

Figure 2 shows that this distribution has not changed much over time, reflecting the regularity of the structure of research articles over the period under study.

Evolution of the distribution of data occurrences throughout the article.

Evolution of the distribution of data occurrences throughout the article.

4.2. On the Contribution of Adjectives to the Definition of Data

When we see in the literature how difficult it is to define data, and that there is a consensus on their evolving nature, it becomes clear that it is relevant to study adjectives combined with data insofar as the role of the adjective is to modify the noun, by bringing further details. This is precisely what the authors need to specify the nature of the data they are talking about.

The word data can also be preceded by a noun (adjunct/attributive noun) which specifies its meaning in the manner of an adjective. But the numerous tests on our corpus lead us to prefer the analysis of adjectives and past participles. Indeed, the combination of these attributive nouns and data generates mainly types of data (e.g., population data , morbidity data , pollen data , citation data ), leaving aside some of the nuances that the authors wish to specify concerning their data and that only adjectives can convey.

Kinds of data. We borrow the term kind from Morgan (2020) , who uses it to refer to what is produced by different kinds of instruments, to be taken in a broad sense as it can also refer to strategies developed in the Social sciences. These kinds of data are revealed by the properties conveyed by the adjectives. These properties might be general ( simulated , statistical , isotopic , global , random , theoretical , exploratory , archival , relational , raw , digital , analytical , synthetic ), or specific to science and technique generally speaking ( experimental , observational , empirical , structural , numerical , electronic ), or specific to a particular field ( thermodynamic , kinetic , nuclear , molecular , crystallographic , astrometric , photometric , seismic , atomic , meteorological , optical ).

Dimension. These adjectives can be used to define the extent of the data ( few , extensive , small , full , comprehensive ), but in a large majority of cases ( big , large , entire ), this dimension applies to a set designated by the names bank , base , set , package , for instance.

Spatiotemporal properties. These adjectives provide a temporal information about the data ( new , actual , preliminary , initial , recent , current , present , existing , first , final , primary , secondary , historical ) or specify their spatial extent ( local , national , international , Brazilian , geocoded , western ).

Value. The value of data might be subjective or at least results from the researcher’s analysis. We only include in this category adjectives that convey a value (or lack of value) inherently, independently of any context ( accurate , correct , false , falsified , inaccurate , reliable , sufficient , limited , original , basic , relevant , high-quality , noisy , useful ).

Availability. These adjectives provide information on the availability of data ( available , observed , additional , measured , corresponding , internal , published ), or on the contrary on the fact that they are missing for the researcher ( lossy , incomplete , inaccessible , insufficient , complete, missing, unpublished ).

Data to data comparison. These adjectives qualify the data by comparing them to each other ( alternative , other (it might be considered as a determiner as well), same , conflicting , different ) or by placing them in relation to other data of the same group ( joint , various , mixed , homogeneous , combined , balanced ). In both cases, these same adjectives can be used for a comparison with a completely different data set.

Alpha-numerical adjectives. We also make a special class of adjectives that have an alpha-numeric form. We are unable to qualify them with certainty because it is necessary to have both context and more detailed knowledge of the scientific field in question (e.g.: 1o-2 data , 1s data , 4dvar data , 18-winter data )

We manually tagged into one of these seven classes all the adjectives for which we identified at least 20 occurrences in the whole corpus (i.e., 3,556 adjectives for 228,988 occurrences involving the word data ) occurring before the bibliography section. The distribution is detailed in Table 2 and the most frequent adjectives by class and domain are presented in Figure 3 .

Distribution of data across the classes of adjectives it combines with

Percentage of occurrences of the 20 most frequent adjectives combined with data, by adjective class and subject area (see full list and reusable version in the available data set (Bordignon & Maisonobe, 2022)).

Percentage of occurrences of the 20 most frequent adjectives combined with data , by adjective class and subject area (see full list and reusable version in the available data set ( Bordignon & Maisonobe, 2022 )).

The identification of adjectives and the fact that 3,556 different adjectives are used more than 20 times in the corpus show that it is by combining data with an adjective that the word data takes on its meaning. Adjectives provide the semantic precision that the word data alone lacks, giving rise to the abundant considerations in the literature in an attempt to provide a definition. To confirm this, we can see that adjectives expressing a kind of data are the most frequent with more than half of the total number of uses. This proportion is even more important in Physical sciences (59.2%, against 44.5% in Social sciences), but this difference is compensated by a more important proportion of adjectives expressing a spatiotemporal property in Social sciences (20.9% against 11.3% in Physical sciences). Secondly, adjectives related to the availability of data are used in more than 11% of the occurrences of data . The access to the data and their intrinsic characteristics thus seems to be more discussed than their size. Interestingly, while the issue of “big data” and its use in science is very important, this aspect does not emerge clearly in our corpus. Adjectives expressing the dimension of data sets constitute barely more than 3% of occurrences. Moreover, the phrase big data is almost absent from our corpus (around 0.001% of occurrences).

Figure 3 shows the 20 most frequent adjectives of each class by subject area; the percentage is calculated on the whole field.

The most frequent adjective in Physical sciences, is, by far, experimental , with 20.9% of all occurrences of adjectives in the domain (vs. 2% in Social sciences). The second most frequent adjective in Physical sciences, which is also the most frequent in Social sciences, is available , with around 4% of the occurrences in both disciplines.

The distribution of adjectives qualifying the word data throughout the text generally follows that of the word data itself ( Figure 4 ). Nevertheless, it is interesting to note that the adjectives expressing relations between data have a flatter curve in Physical sciences, denoting the fact that the authors continuously assemble their data or balance them with other teams’ data to progress in their demonstration, up to the results and conclusions. In contrast, in Social sciences, it is at the beginning and especially at the very end that these relationships are rather settled. This can be put in relation to the adjectives of availability, the use of which drops off rapidly at the beginning of the articles in Physical sciences, whereas it rebounds and lasts for at least a third of the text in Social sciences. Finally, it should be noted that the kinds of data are also expressed during the first third of the text in Social sciences, whereas these details seem to be dealt with more quickly in Physical sciences.

Distribution of data throughout the text, represented by the adjective classes it combines with and the subject areas.

Distribution of data throughout the text, represented by the adjective classes it combines with and the subject areas.

Our corpus allows us to generate Figure 5 , which shows the evolution of the distribution of each type of adjective over time in each domain.

Evolution of the distribution of data throughout the text, according to the adjective classes it combines with and the subject areas.

Evolution of the distribution of data throughout the text, according to the adjective classes it combines with and the subject areas.

Even if we can see some oscillations (which may be due to the very nature of the ISTEX corpus), we cannot say that there is a clear shift over this 30-year period. However, if we look very closely, we can see that the curve for data to data adjectives in Social sciences also tends to flatten out, with a less sharp decrease in data to data adjectives over the first two-thirds of the articles. This trend is very subtle indeed and requires further investigation; it may be indicative of a more intensive use of quantitative approaches in Social sciences.

4.3. Data Transformation Through Verb Usage

As many studies have shown that data undergo transformations, we assume that it is by identifying verbs whose subjects are the authors of the articles (using we or I ) that we will be able to have a better picture of how data are used and how transformations are performed. We build a query for this purpose, taking care to avoid retrieving negative forms that reflect the opposite of what we are looking for. Our query also allows us to identify more complex inverted forms such as “ the data we analyzed .” We also search for past participles, which are somehow halfway between the adjectives from which they take on the function of qualification (notably after the verb to be ), and the verbs from which they are lexically derived. To process the results accurately, we homogenize some equivalent forms but whose spellings may differ (e.g., analyse and analyze ).

Assemblage. Data assemblage verbs and participles are those allowing authors to explain how they carried out the delineation and curation of their data. We borrow the term from Law (2004) , who presents the concept of method assemblage as “the process of enacting or crafting bundles of ramifying relations that condense presence and (therefore also) generate absence by shaping, mediating and separating these.” Data assemblage is displayed through verbs related to data collection ( collect , gather , accumulate ), to their selection ( filter , remove , truncate ), and to their arrangement ( merge , aggregate , associate ). Some rare verbs within this class express the fact of possessing these collected data ( own , have , hold ). At this stage, data are inputs and are interrelated, echoing the data to data adjectives.

Analysis. With data analysis verbs and the use of the passive voice, authors explain the analysis carried out with the data and the associated methodology. The data are seen from afar: They are studied (or examined , compared , interpreted ) or evaluated (or assessed , checked , verified ) thus reminding us of value adjectives.

Inscription. In this third class, verbs and participles denote inscriptions, in Latourian terms ( Latour, 1999 ): They refer to “the transformations through which data become materialized into a sign” ( list , normalize , plot , code , compute ). They can also refer to the presentation of these inscriptions ( present , show , report , publish ). The data are then outputs.

We have unified verbs and past participles under their corresponding lemma (e.g., obtain and obtained have obtain as lemma). We finally left out the phrases with the term set , because the sequence data set (supposed to be equivalent to dataset in one word) was problematic when tagged in parts-of-speech ( set being then often wrongly considered as a participle; e.g., “ the results were used as a learning data set for a predictor of peptide detectability ”). And eventually, we excluded the different variants of do and make , which do not bring any information on the action performed on the data itself.

We classified all verbs and participles with more than 20 occurrences (in the whole corpus) into one of the three categories (i.e., 457 terms for 146,627 occurrences involving the word data , occurring before the bibliography section). Past participles account for more than 90% of the occurrences ( Table 3 ), consistent with the fact that the passive voice suggests the researcher’s actions that the data undergo. The distribution by subject area across classes is detailed in Table 4 and the most frequent forms associated are presented in Figure 6 by class and domain.

Distribution of data according to verbs and past participles it combines with in the two subject areas

Distribution by percentage of occurrences of data across the classes of verbs/past participles in the two subject areas

Percentage of occurrences of the 20 most frequent verbs/past participles combined with data, by class and subject area (see full list and reusable version in the available data set (Bordignon & Maisonobe, 2022)).

Percentage of occurrences of the 20 most frequent verbs/past participles combined with data , by class and subject area (see full list and reusable version in the available data set ( Bordignon & Maisonobe, 2022 )).

Figure 6 displays the 20 most frequent forms of each class by field; the percentage is calculated on the whole field.

The verbs obtain and collect , expressing the task of data collection, are by far the most used in both subject areas. Next are the verbs use and analyze , as well as present and show (especially in Physical sciences). But these frequently used verbs are lacking specificity. To specify their actions, the authors then resort to other verbs that are more precise. But still, the class of assemblage verbs is the smallest, with hardly more than 100 variants. This means that the data assemblage phase is underdescribed. It is in the class of inscriptions that we find the most variants, more than 200, which describe the inscriptions generated.

Figure 7 reveals an important difference between the two domains: The curves in Social sciences follow those of the distribution of data in the text, whereas they are different in Physical sciences. Indeed, we can see that even if the actions of assemblage are more frequent at the beginning and the end of the text, the curve tends to flatten with occurrences that are distributed on the whole article. The homogenization of the distribution is even more striking for the inscriptions, with a nearly flat curve. This reveals that physical scientists have a permanent and continuous use of data, generate new inscriptions, and need to rearrange their data to carry out their research and thus their demonstration in the text.

Distribution of the occurrences of data throughout the text, represented by the verbs/past participles classes it combines with and the subject areas.

Distribution of the occurrences of data throughout the text, represented by the verbs/past participles classes it combines with and the subject areas.

This result is consistent with the findings in Figure 4 with a similar distribution for the data to data adjectives; this means that assemblage operations and inscriptions are recurrent throughout the text. Figure 8 shows that these practices have not changed much over time.

Evolution of the distribution of data throughout the text, according to the verb/past participles classes it combines with and the subject areas.

Evolution of the distribution of data throughout the text, according to the verb/past participles classes it combines with and the subject areas.

4.4. Data Ownership and Possessive Pronouns

The literature review revealed that the work around data, from collection to interpretation, is important and value-creating. Bearing in mind that researchers gain recognition with their data, we looked for expressions of ownership of data in texts through possessive pronouns by isolating the use of our / my data .

We found 23,102 occurrences of possessive pronouns used before the word data and recorded before the bibliography. Of these, 15,481 occurrences are related to the pronouns my or our , or 67%.

Figure 9 shows very clearly that the authors appropriate the data throughout the article, as if after having used them, they finally allow themselves to assert that they are the owners and that it is with this status that they mention them in the conclusion of their article. This is probably also a way of “sharing” them (at least the associated results) within the community by insisting on who owns them, and the value added by the analysis.

Distribution of my/our data throughout the text in the two subject areas.

Distribution of my / our data throughout the text in the two subject areas.

Authors anticipate the future credit that readers would have to give them by citing the conclusions of their research, and sometimes the data themselves. But at the same time, if my / our data appears as the subject (typically in “ our data suggest …” or “ our data show …”), it helps authors distance themselves, shift responsibility to the data (as Salager-Meyer (1999) states it) with a typical turn of phrase where some nonhuman entity such as data actually speaks for itself. Therefore, they anticipate criticism as they progress in presenting their results. As the article reaches its conclusion, authors have an interest in gaining confidence and may seek to emphasize the personal and specific contribution to the literature that the article and its results represent.

Figure 10 shows the evolution over the three decades we are studying. Without showing major changes, it nevertheless reveals that in the most recent period, the occurrences distribution curve in Social sciences increasingly looks like that of Physical sciences with a less marked growth at the end of the text.

Evolution of the distribution of my/our data throughout the text in the two subject areas.

Evolution of the distribution of my / our data throughout the text in the two subject areas.

As our literature review at the beginning of this article demonstrates, we set our study in a theoretical framework that takes scientific writings as research objects and considers that their linguistic analysis can inform the research process itself, even if only partially. Our approach is intended to complement sociological or ethnographic studies that point out the difficulty of defining what data are and which place them in a transformation process. Therefore, by choosing to work on adjectives and verbs, we are able to provide results that confirm both the polysemous nature of the word data and the difficulty of defining it. But above all, our study draws on a large corpus of scientific articles overlapping two disciplinary fields and thus sheds new light on this polysemy by providing a large number of examples of lexical combinations, which we present with typologies of adjectives and verbs/past participles to help us understand the nature of data.

Our findings confirm the transformation process but also that the task of assemblage is poorly rewarded, as shown by previous studies whose scope is wider because they deal with data curation, mainly in archives, whereas our focus here is rather on what researchers decide to mention in the articles they publish. We believe, however, that the large number of words we have extracted and which highlight these transformation processes can contribute to the works related to the description of data curation and data reuse. Indeed, while many instances of data use that we have identified in the articles may be instances of primary data use (i.e., researchers describe collecting, manipulating, and analyzing their own data), researchers may also reuse data created by others that have been made available through data repositories. The lists of terms we provide help to fuel studies on these topics and to enrich, with a greater level of detail, the Data Practices and Curation Vocabulary (a shared vocabulary ( Chao, Cragin, & Palmer, 2015 ), limited to 187 terms, and intended for use by data curators and other persons involved in the curation process (including preservation, sharing or management)).

In line with the principles of epistemetrics, we have built up a corpus that allows for a comparative analysis between fields, a diachronic analysis and an analysis along the text, which we suggest reflects the "data journey" agreed upon by previous studies. Despite the increasing use of the word data in titles and abstracts, our research shows a significant regularity over the period studied in the way researchers refer to data within scholarly articles. In particular, authors prefer to use this word at the beginning and the end of articles. This predominance is consistent with other findings suggesting ( Master, 1991 ) that it is in the opening and concluding sections of research articles that we find phrases involving more abstract and general concepts, as data can be.

data and their characteristics are discussed longer at the beginning of Social sciences articles;

in Physical sciences only, assemblage operations and inscriptions are recurrent throughout the text; and

there is more description of the spatiotemporal properties of data and more discussion of their availability in Social sciences and these features continue over time.

In both subject areas, we observe the same tendency to take more ownership of the data throughout the narrative, which is illustrated by the more frequent use of the possessives my or our at the end of the articles, actually reaching the end of the “data journey.” We have seen that the trend in Social sciences to appropriate data at the very end of the text has changed in the most recent period, with a redistribution of some of the uses of my / our earlier in the text. This is a discourse pattern similar to the one we have in Physical sciences. It is difficult to draw conclusions about a possible change in the behavior of social scientists towards their data, and it would be worth organizing interviews to clarify this. Our work opens up here an opportunity for further research.

The verb analysis confirms the seminal research of Knorr and Knorr (1978) on the topic: The complex and painful task of data acquisition is somewhat absent from articles, even if assemblage verbs are the most frequent. They refer to assemblage operations but with rather imprecise verbs, such as obtain and collect . It is in the class of inscriptions that we find the most variants, meaning authors prefer to focus on their results, to produce their own new inscriptions and thus new data, and to claim ownership of them at the end of the process. This is what leads us to think that it would be interesting to do the same analysis on another kind of scientific output, the data paper, to see if other trends emerge.

Regarding the limits of this study, we can say that it would have been interesting to investigate on words belonging to the data family, such as database and dataset . This is something we are considering for future work, but it made sense to start by investigating the word data to grasp its polysemy, before studying derived words such as database or dataset . Moreover, the work of indexing and tagging terms in such a large corpus is very time-consuming and the many necessary controls are tedious. As for the ISTEX database, it should be noted that it does not cover recent publications and that its coverage depends on agreements with publishers so that the types of analyses that can be derived from it are limited. Given the evolution of the coverage, we also need to be cautious with the interpretations of the longitudinal observations. Nevertheless, by using a random sampling method, we believe that we have somehow managed to address this coverage pitfall. Consequently, we found that ISTEX is a resource that can be used if care is taken to extract a fairly large corpus and to ensure its homogeneity, as we have done. On the other hand, getting down to a finer level of detail, such as the subfield, is probably riskier because the inclusion of a single journal, for example, can unbalance the content’s homogeneity.

Because we worked on equal-sized sets of publications for each subperiod studied, our study makes it difficult to perceive the changes and the enthusiasm of the 2000s for data and more precisely for big data . As this phrase hardly emerges from our corpus, our analysis has the advantage of showing that, when it comes to looking at research as a whole, the use of the word and its context of occurrence have not radically changed over the period under study. Data scholarship is an intrinsic part of research activity, both in Social sciences and Physical sciences, and this regularity as well as the frequency of occurrences of the word in the two subject areas reflect this clearly.

To conclude, the results of our corpus-based study demonstrate that there is no point in trying to define the word data , as researchers do it themselves in the course of their writing, taking Firth’s idea to its fullest extent: “You shall know a word by the company it keeps” ( Firth, 1957 ). The word data serves as a rhetorical base and draws on the context for its meaning, relying on the properties conveyed by adjectives and verbs associated to it. Adjectives and verbs accompanying the noun data turn out to be even more important than data itself in specifying what data are at stake. And to echo Gitelman (2013) , we can say that while data can never be raw, the word data is, and it only serves as a rhetorical basis, as long as the context and mainly adjectives have not contributed to achieve its potential with the properties they convey.

Frédérique Bordignon: Data curation, Investigation, Methodology, Validation, Visualization, Writing—Original draft, Writing—Review & editing. Marion Maisonobe: Conceptualization, Investigation, Methodology, Validation, Writing—Original draft, Writing—Review & editing.

The authors have no competing interests.

The research project received no funding by third parties.

The data used in this study are available in a repository ( Bordignon & Maisonobe, 2022 ).

Analysis performed on January 5, 2022, on the online version of the Web of Science Core Collection, with a restriction to articles containing an abstract written in English (all the abstracts with at least one occurrence of the word the are considered in this analysis).

https://www.istex.fr .

It should be noted that ISTEX is a public resource supported by the main French research organization: the CNRS (Centre National de la Recherche Scientifique).

Author notes

Email alerts, related articles, affiliations.

  • Online ISSN 2641-3337

A product of The MIT Press

Mit press direct.

  • About MIT Press Direct

Information

  • Accessibility
  • For Authors
  • For Customers
  • For Librarians
  • Direct to Open
  • Open Access
  • Media Inquiries
  • Rights and Permissions
  • For Advertisers
  • About the MIT Press
  • The MIT Press Reader
  • MIT Press Blog
  • Seasonal Catalogs
  • MIT Press Home
  • Give to the MIT Press
  • Direct Service Desk
  • Terms of Use
  • Privacy Statement
  • Crossref Member
  • COUNTER Member  
  • The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

Sign In or Create an Account

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

research data meaning

Home Market Research

Data Analysis in Research: Types & Methods

data-analysis-in-research

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection methods , and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

customer communication tool

Customer Communication Tool: Types, Methods, Uses, & Tools

Apr 23, 2024

sentiment analysis tools

Top 12 Sentiment Analysis Tools for Understanding Emotions

QuestionPro BI: From Research Data to Actionable Dashboards

QuestionPro BI: From Research Data to Actionable Dashboards

Apr 22, 2024

customer experience management software

21 Best Customer Experience Management Software in 2024

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of f1000res

  • PMC9650106.1 ; 2022 Jan 28
  • ➤ PMC9650106.2; 2022 Nov 1

Research Software vs. Research Data I: Towards a Research Data definition in the Open Science context

Teresa gomez-diaz.

1 Laboraroire d'Informatique Gaspard-Monge, CNRS, Paris-Est, France

Tomas Recio

2 Universidad Antonio de Nebrija, Madrid, Spain

Associated Data

Underlying data.

Data underlying the arguments presented in this article can be found in the references, footnotes and Box 1 .

Version Changes

Revised. amendments from version 1.

This version considers the comments of the reviewers to better explain and illustrate some of the concepts presented in the article. In particular we have stressed the importance of the scientific production context for the RS and RD definitions. We have as well introduced new references related to the concepts of data and information, to further illustrate our view on the complexity of the data concept, and a new reference to complete the studied landscape for the proposed RD definition. As asked by the Referees, we have moved the translations of French and Spanish quotes to the main text. See our answers to the referee reports to complete the differences with the version 1 of this article.

Peer Review Summary

Background: Research Software is a concept that has been only recently clarified. In this paper we address the need for a similar enlightenment concerning the Research Data concept.

Methods: Our contribution begins by reviewing the Research Software definition, which includes the analysis of software as a legal concept, followed by the study of its production in the research environment and within the Open Science framework. Then we explore the challenges of a data definition and some of the Research Data definitions proposed in the literature.

Results: We propose a Research Data concept featuring three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind.

Conclusions: The analysis of this definition and the context in which it is proposed provides some answers to the Borgman’s conundrum challenges, that is, which Research Data might be shared, by whom, with whom, under what conditions, why, and to what effects. They are completed with answers to the questions: how? and where?

1. Introduction

Each particle of the Universe, known or unknown by what is widely accepted as Science, is information. Different datasets can be associated to each particle to convey information, as, for example: where has this particle been discovered? By whom? At what time? Is this particle a constituent element of a rock, or a plant, or … ? Indeed, as living entities of the Earth planet, … we are all part of this Universe and every atom in our bodies came from a star that exploded … , therefore … we are all stardust … . 1

So long ago that we have never been able to give a precise date, information started to be fixed in cave paintings, figurines, and bone cravings, which have been found in caves like Altamira 2 or Lascaux 3 . That is, some human beings intentionally fixed information on a support. Much more recently, languages have been developed to deal with information, fixing and exchanging it in clay bricks, papyrus, monument walls, and paper books. Even more recently, information has been fixed in films, photographs, and has finally adopted digital formats.

Scientists study all kinds of subjects and objects: persons, animals, trees and plants and other living beings, philosophies and philosophers, artists and artworks, mathematical theories, music, languages, societies, cities, Earth and many other planets and exoplanets, clouds, weather and climate, stars and galaxies, as well as other animate or inanimate objects, molecules, particles, nanoparticles and viruses, nowadays including digital objects such as computer programs. Some of these items, like images, texts, and music etc. may have associated intellectual property rights; but others, like statistics or geographical data, may not. Yet, they may be affected by other legal contexts, such as, for example, the one given by the EU INSPIRE Directive 1 for spatial data, concerning any data with a direct or indirect reference to a specific location or geographical area.

Now, in our digital era, most of the above subjects under consideration are handled by humans using computers, through numerical data. Scientists present new theories and results built and produced with numerical simulations and through the analysis of numerical datasets. They are usually stored in databases, manipulated or produced in digital environments using existing software, either Free/Open Source Software (FLOSS) 4 or commercial, or by means of software developed by research teams to address specific problems 2 , 3 .

In this specific scientific context, the aims and developments of Open Science practices are particularly relevant. Indeed, as remarked by 4 : "We must all accept that science is data and that data are science … ". Therefore, in this article we take into consideration the following definition of Open Science, in which the open access to Research Data (RD) and to Research Software (RS) is part of the core pillars 5 :

Open Science is the political and legal framework where research outputs are shared and disseminated in order to be rendered visible, accessible and reusable.

A more transversal and global vision can be found in the UNESCO Recommendation on Open Science 5 , 6 . See also 7 for another relevant example of ongoing work on the Open Science concept. But in this paper, following the analysis and the conclusions of 5 , we focus here on this restricted framework as more suitable for our purposes.

Among the most important kinds of research outputs of any scientific work, we focus on the trio formed by articles , software and data. Actually, among all the possible duos, the couple RS and RD present more similarities, although a light list of differences between software and data have been mentioned in 8 and 9 . On the other hand, regarding other duos, we think that differences are much stronger. For instance, unlike the dissemination of published articles, usually at the hands of scientific editors, the dissemination of software and data that have been produced in the research process is mostly at the hands of their producers, the research team. The analogies between RS and RD have been already summarily highlighted in 10 , such as those concerning the release protocols of RD and RS, which raises the same questions, at the same time, in the production context. As a direct consequence, it seems suitable to propose a similar dissemination procedure for both kinds of research outputs 11 .

Indeed, let us remark that, as mentioned in 11 , 12 , both RS and RD dissemination might involve the use of licenses to set their sharing conditions, such a core issue. Information about RS licenses and licensing can be found at the Free Software Foundation (FSF) 6 , the Open Source Initiative (OSI) 7 , and the Software Package Data Exchange (SPDX) 8 . The SPDX licenses list also includes licenses that can be used for databases, like the Creative Commons licenses 9 or the Open Data Commons Licenses 10 , see for example 13 .

Other similarities regarding RS and RD are related to management plans: for example, Data Management Plans are nowadays required by research funders (see for example 14 , 15 ) and, in the same mood, Software Management Plans have been recently proposed, see 16 and the references therein.

Finally, concerning evaluation, as observed in 3 , similar evaluation protocols can be proposed for both RS and RD.

Leaving aside the common issues in RS and RD for licensing and management plans, that have been already studied in the above mentioned references, the RS and RD dissemination and evaluation analogies are more closely analyzed in the article 12 that follows the present work, including FAIR related issues 17 and 5Stars Open Data 11 . On the other hand, in the current article we focus on the conceptual analogies of RS and RD, and their consequences (see Section 5 ).

As we will argue in the next sections, a definition for RD can be proposed following the main features of the RS definition given in our recent work 3 , 18 . However, we consider that formulating such proposal still remains a challenging issue that we dare to address here. In fact, although one of the most widely accepted RD definitions is the one proposed by the OECD (2007) 19 , other works have shown the difficulties to fix such a definition 20 , 21 . Indeed, establishing this concept has important and not well settled consequences, for example, concerning the context of RD sharing, as highlighted by C. Borgman in 22 :

Data sharing is thus a conundrum. […] The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice.

It is the intention of our present work to bring some answers to these questions.

The plan of this article is as follows. The next section introduces the concept of RS after a summary presentation of the key points involved in the notion of software as a legal object. Section 3 is devoted to discuss the different issues involved in the challenge towards a precise definition of data (in the more comprehensive sense of this concept). Section 4 describes partially the landscape of existing work addressing the RD definition, enumerating, again, some difficulties to settle such a concept.

There we propose our RD definition, based in three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind. Comparisons with other RD definitions are examined.

The last and final section concludes with the proposition of some specific answers to Borgman’s conundrum challenges 22 . Let us remark that these conundrum challenges involve as well RD dissemination issues that are studied in detail in the article that follows this work 12 , which also includes the analysis of RD evaluation and FAIR issues.

The reader of the current work should be aware that its authors are not legal experts. Thus, in order to address our goals in this article, we have analyzed (French, Spanish, European and USA) legal documents and articles written by law experts 1 , 13 , 20 , 21 , 23 – 34 , but from the scientist’s point of view. Yet, a deeper understanding of legal issues may require the intervention of legal specialists.

Following the standard scientific protocol, the authors of this work (mathematicians) have, first, detected a problem – the need to provide a more suitable RD definition. Then, they have observed the involved landscape and studied the related literature; have focused on and structured different components of the problem; finally, they have proposed what they believe could be a solution for the challenge under consideration. As in any other research work, we, authors of the present work, believe that our proposal should be examined by the scientific community in order to evaluate its correctness, and to help improving it, if needed, advancing towards a better solution.

2. Research Software

In this section we bring together some of the existing definitions of software as a legal object (see references below). We also recall our definition of RS coming from 3 , 18 .

2.1. Software is a legal object

In what follows we refer to the documents 26 – 29 dealing with a definition of software as a legal object. Note that the terms computer program , software , logiciel (in French), programa de ordenador (in Spanish) are synonyms in this work. The terms source code (or código fuente in Spanish), compiled code (or code compilé , código compilado ) correspond to subsets of a computer program.

The first definition that we would like to consider comes from the Directive 2009/24/EC of the European Parliament 26 , that states:

For the purpose of this Directive, the term “computer program” shall include programs in any form, including those which are incorporated into hardware. This term also includes preparatory design work leading to the development of a computer program provided that the nature of the preparatory work is such that a computer program can result from it at a later stage.

Moreover, in the Spanish Boletín Oficial del Estado n. 97 (1996) 27 we can find 12 :

A los efectos de la presente Ley se entenderá por programa de ordenador toda secuencia de instrucciones o indicaciones destinadas a ser utilizadas, directa o indirectamente, en un sistema informático para realizar una función o una tarea o para obtener un resultado determinado, cualquiera que fuere su forma de expresión y fijación. […] comprenderá también su documentación preparatoria.
[For the purpose of this Law, a computer program shall be understood as any sequence of instructions or indications intended to be used, directly or indirectly, in a computer system to perform a function or a task or to obtain a certain result, whatever expression and fixation form it can take. […] it can also include its preparatory documentation.]

Likewise, in the French Journal officiel de la République française (1982) 29 we can read:

Logiciel : Ensemble des programmes, procédés et règles, et éventuellement de la documentation, relatifs au fonctionnement d’un ensemble de traitement de données (en anglais : software). [ Software : All programs, procedures and rules, and possibly documentation, related to the performance of some data processing (in English: software).].

And in the French Code de la propriété intellectuelle (current regulation) 28 , Article L112-2, we can find:

Les logiciels, y compris le matériel de conception préparatoire, sont considérés notamment comme œuvres de l’esprit au sens du présent code. [Software, including the preparatory material, is considered as works protected by the present code.]

We observe that, in the above mentioned documents, the concept of software or computer program, logiciel or programa de ordenador refers to the set of instructions, of any kind, that are to be used in a computer system (including hardware). It is a work protected by the author rights. It can include the source code, the compiled code, and, eventually, the associated documentation and the preparatory material. It can be related to some data processing or to other tasks to be implemented in a computer system.

In order to complete this legal vision of the software concept we refer to item (11) of 26 :

For the avoidance of doubt, it has to be made clear that only the expression of a computer program is protected and that ideas and principles which underlie any element of a program, including those which underlie its interfaces, are not protected by copyright under this Directive. In accordance with this principle of copyright, to the extent that logic, algorithms and programming languages comprise ideas and principles, those ideas and principles are not protected under this Directive. In accordance with the legislation and case-law of the Member States and the international copyright conventions, the expression of those ideas and principles is to be protected by copyright.

Indeed, there is a difference between the concepts of algorithm and software from the legal point of view, as there is a difference between the mere idea for the plot of a novel and the final written work. Several persons could have the same idea for the plot, but its realization in a final document will deliver different novels by different writers, as the novel will reflect the personality of its author. Similarly, an algorithm remains on the side of ideas, and as such, it is not protected by copyright laws. On the other side, poetry, novels and software are protected under copyright laws. Moreover, a computer program can implement several algorithms, and the same algorithm can be implemented in several programs.

Finally, note the nature of software as a digital object underlying all the above considerations.

2.2. Software as a research output: definition of Research Software

Beyond the vision of software as a legal object, we bring here the concept of Research Software (RS) as a scientific production, as defined in 3 , 18 :

Research Software is a well identified set of code that has been written by a (again, well identified) research team. It is software that has been built and used to produce a result published or disseminated in some article or scientific contribution. Each research software encloses a set (of files) that contains the source code and the compiled code. It can also include other elements as the documentation, specifications, use cases, a test suite, examples of input data and corresponding output data, and even preparatory material.

Thus, Section 2.1 of 3 introduces several definitions regarding the notions of scientific and research software as found in the literature, as a way to support the above definition, while 18 provides complementary analysis on this concept. Note that this definition does not take into consideration if the RS status is “ongoing” or “finalized”, and does not regard if the RS has been disseminated, its quality or scope, its size, or if it is documented, maintained, used only by the development team for the production of an article, or it is currently used in several labs … 2 .

Different recent works on the RS concept can be found, for example, on 35 and the references therein, where the RDA FAIR for Research Software (FAIR4RS) working group 13 proposes a definition of RS full of subtleties and details, albeit, perhaps, of complex interpretation in practice.

We observe, following our proposed definition, that RS can be characterized through three main features:

  • • the goal of the RS development is to do research. As stated by D. Kelly: it is developed to answer a scientific question 36 ,
  • • it has been written by a research team,
  • • the RS is involved in the obtention of the results presented in scientific articles (as the most important means for scientific exchange are still articles published in scientific journals) or by any other kind of recognized scientific means.

Note that documentation, licenses, examples, data, tests, Software Management Plans and other related information and materials can also be part of the set of files that constitutes a specific RS. Remark that the data we refer to in this list will qualify as RD (as defined in Section 4 ) if they have been produced by a research team, that can be the same team that has produced the RS, but not necessarily (notice that the role of the research team involved in the development of a RS has been thoroughly studied in Section 2.2 of 3 ). Indeed, Section 2.1 above shows that the preparatory design work and documentation are part of the software, and these are documents that can be included in the released version of a RS, following the choice of the RS producer team. There can be other elements as for example tests, input and output files to illustrate how to use the RS, licenses, etc. To include these elements in the released RS correspond to best practices that facilitate RS reuse. In our view, the release of a RD (see Section 4 and 12 ) can follow similar practices, that is, to include a documentation, some use examples, a license, a data management plan … this is to be decided by the producer team.

The initial origin of this RS definition is to be found in 2 , that contains a detailed and complete study comparing articles and software produced in a typical (French) research lab. As remarked in received comments and Referee reports to this article, this RS definition (as well as the RD definition proposed in Section 4 ) is placed in what can be considered as a narrow context, emphasizing the role of the scientific production context. The relevance of such context is widely accepted by the scientific community in the case of articles: not every article published in a newspaper qualifies as a research article, that requires to be released in a scientific journal and subject to a referee procedure. Similarly, the importance of the production context has been already highlighted in the case of data, regarding those that qualify as cultural data 23 .

Besides, our definition does not include as RS neither commercial software nor existing Free/Open Source Software (FLOSS) or other software developed outside Academia, a restriction which does not exclude that RS (or research articles, data...) can be produced in other contexts like private laboratories, for example. Rather, this means that we are not considering here differences between private or public funding of research. As a matter of fact, a research team can use RS produced by other teams for their scientific work, as well as FLOSS or other software developed outside the scientific community, but the present work is centered in the making-of aspects which are pertinent for the proposed definition. Obviously, a RS that has been initially developed in a research lab can evolve to become commercial software or just evolve outside its initial academic context. The above definition concerns its early, academic life.

Moreover, a RS development team may not just use software produced by other teams, but also include external software as a component inside the ongoing computer program, a procedure that could be facilitated by the FLOSS licenses. We consider that this external component qualifies as RS if it complies with the three characteristics given in the above definition. Moreover, the producers of the final RS should clearly identify the included external components, and their licenses. They should also highlight the used or included RS components, by means of a correct citation form 3 , 8 , 11 , 37 – 39 .

Furthermore, a RS may involve other software components that can remain external , and that are not included in the RS development and release. It is then left to the users the task to recover and install them, and to assemble these external components in order to get a running environment. Another situation, as the one we have analyzed in 18 , deals with the RS developed within a given software environment which is not perhaps fully disseminated with the RS. For example, the GeoGebra code developed by T. Recio and collaborators 14 does not disseminate the whole GeoGebra software 15 , but only some parts that are relevant for their goals and that include their code.

See 2 , 3 , 18 for more discussions and references that have motivated the RS definition we have sketched in this section.

3. The challenges of a data definition

As stated in 40 :

“Data” is a difficult concept to define, as data may take many forms, both physical and digital.

For example, unlike software, data is, as a legal object, much more difficult to grasp. In fact, according to 33 , data is not a legal concept, as it does not fall into a specific legal regime. For example, data can be either mere information or une œuvre , a work with associated intellectual property, when it involves creative choices in its production that reflect the author’s personality 32 . The Knowledge Exchange report 21 provides guidelines that can be used to assess the legal status of research data, and mentions:

It is important to know the legal status of the data to be shared. […] not all data are protected by law, and not every use of protected research data requires the author’s consent. […] Whether data are in fact protected must be determined on a case-by-case basis.

In relation with this legal context of data sharing and reuse, a very complete framework is introduced in 23 :

Les problématiques liées à la réutilisation nécessitent une maîtrise parfaite du droit de la propriété intellectuelle, du droit à l’image, du droit des données personnelles, du respect à la vie privée et du secret de la statistique, du droit des affaires, du droit de la concurrence, du droit de la culture, du droit européen et des règles de l’économie publique. [The issues related to reuse require a perfect mastership of intellectual property rights, image rights, personal data rights, respect for private life and statistical confidentiality, business law, competition law, cultural law, European law and the rules of the public economy.]

Another list of legal issues related to data is provided by 33 , similar but not equal to the one in the previous quote. Yet, it is also necessary to consider other legal contexts concerning, for example, les données couvertes par le secret médical ou le secret industriel et commercial [Data covered by medical secret or by the industrial and commercial secret] 16 . Let us remark that the section Applicable Laws and Regulations of 15 provides a broad overview of regulatory aspects that need to be taken into consideration when developing disciplinary RD management protocols in the European context. But, as declared in the introduction, it is not our intention to go deeper into these legal aspects, that should be also regarded from the perspective of many different laws.

The underlying problem is that data can refer to many different subjects or objects. We need to simplify the context to help us setting a manageable concept of research data adapted to the scientific framework. For this purpose we present here two relevant data definitions found in the data scientific literature.

The OECD data definition in its Glossary of Statistical Terms 17 states that:

DATA Definition: Characteristics or information, usually numerical, that are collected through observation. Context: Data is the physical representation of information in a manner suitable for communication, interpretation, or processing by human beings or by automatic means (Economic Commission for Europe of the United Nations (UNECE)), “Terminology on Statistical Metadata”, Conference of European Statisticians Statistical Standards and Studies, No. 53, Geneva, 2000.

Also, as a relevant precedent, let us quote here the data definition of the Committee for a Study on Promoting Access to Scientific and Technical Data for the Public Interest , as mentioned in 41 :

A data set is a collection of related data and information – generally numeric, word oriented, sound, and/or image – organized to permit search and retrieval or processing and reorganizing. Many data sets are resources from which specific data points, facts, or textual information is extracted for use in building a derivative data set or data product. A derivative data set, also called a value-added or transformative data set, is built from one or more preexisting data set(s) and frequently includes extractions from multiple data sets as well as original data (Committee for a Study on Promoting Access to Scientific and Technical Data for the Public Interest, 1999, p. 15).

We can notice that both definitions combine the concepts of data and information, yielding, again, to a challenging situation. Thus, to better grasp the connection between both terms we have consulted several sources of different nature, see Box 1 . Note that we can find in Box 1 that information among the data synonyms in the Larousse dictionary, but data is not among the information synonyms. On the other hand, Wikipedia mentions that both terms can be used interchangeably, but that they have different meanings.

A promenade around the data and information concepts.

I.1 Diccionario de la lengua española of the Real Academia Española

  • • Definition of dato ( https://dle.rae.es/dato )
  • – Del latín datum ‘lo que se da’.
  • – 1. m. Información sobre algo concreto que permite su conocimiento exacto o sirve para deducir las consecuencias derivadas de un hecho. A este problema le faltan datos numéricos.
  • – 2. m. Documento, testimonio, fundamento.
  • – 3. m. Inform. Información dispuesta de manera adecuada para su tratamiento por una computadora.
  • • Definition of información ( https://dle.rae.es/informaci%C3%B3n )
  • – Del latín informatio, - o ¯ nis ‘concepto’, ‘explicación de una palabra’.
  • – 1. f. Acción y efecto de informar.
  • – 2. f. Oficina donde se informa sobre algo.
  • – 3. f. Averiguación jurídica y legal de un hecho o delito.
  • – 4. f. Pruebas que se hacen de la calidad y circunstancias necesarias en una persona para un empleo u honor. U. m. en pl.
  • – 5. f. Comunicación o adquisición de conocimientos que permiten ampliar o precisar los que se poseen sobre una materia determinada.
  • – 6. f. Conocimientos comunicados o adquiridos mediante una información.
  • – 7. f. Biol. Propiedad intrínseca de ciertos biopolímeros, como los ácidos nucleicos, originada por la secuencia de las unidades componentes.
  • – 8. f. desus. Educación, instrucción.

I.2 Diccionnaire Larousse de la langue française

  • • Definition of donnée ( https://www.larousse.fr/dictionnaires/francais/donn%c3%a9e/26436 )
  • – Ce qui est connu ou admis comme tel, sur lequel on peut fonder un raisonnement, qui sert de point de départ pour une recherche (ex. Les données actuelles de la biologie).
  • – Idée fondamentale qui sert de point de départ, élément essentiel sur lequel est construit un ouvrage (ex. Les données d’une comédie).
  • – Renseignement qui sert de point d’appui (ex. Manquer de données pour faire une analyse approfondie).
  • – Représentation conventionnelle d’une information en vue de son traitement informatique.
  • – Dans un problème de mathématiques, hypothèse figurant dans l’énoncé.
  • – Résultats d’observations ou d’expériences faites délibérément ou à l’occasion d’autres tâches et soumis aux méthodes statistiques.
  • • Definition of information ( https://www.larousse.fr/dictionnaires/francais/information/42993 )
  • – Action d’informer quelqu’un, un groupe, de le tenir au courant des événements : La presse est un moyen d’information.
  • – Indication, renseignement, précision que l’on donne ou que l’on obtient sur quelqu’un ou quelque chose: Manquer d’informations sur les causes d’un accident. (Abréviation familière : info.)
  • – Tout événement, tout fait, tout jugement porté à la connaissance d’un public plus ou moins large, sous forme d’images, de textes, de discours, de sons. (Abréviation familière : info.)
  • – Nouvelle communiquée par une agence de presse, un journal, la radio, la télévision. (Abréviation familière : info.)
  • – Cybernétique. Mesure de la diversité des choix dans un répertoire de messages possibles.
  • – Droit. Instruction préparatoire, diligentée par le juge d’instruction en vue de rechercher et de rassembler les preuves d’une infraction, de découvrir l’auteur, de constituer à charge et à décharge le dossier du procès pénal. (Elle est close par un non-lieu ou par un renvoi devant une juridiction répressive. En matière criminelle, l’instruction est à double degré [juge d’instruction, chambre d’accusation].)
  • – Informatique. Élément de connaissance susceptible d’être représenté à l’aide de conventions pour être conservé, traité ou communiqué.

I.3 Wikipedia

Extract from the Data page of Wikipedia ( https://en.wikipedia.org/wiki/Data ):

Data are characteristics or information, usually numeric, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable. […] Although the terms “data” and “information” are often used interchangeably, these terms have distinct meanings. […] data are sometimes said to be transformed into information when they are viewed in context or in post-analysis. However, […] data are simply units of information.

Moreover, in 42 and in the web page of ISKO 18 , when discussing in detail the concept of data, an etymological and linguistic vision is also the starting point, and among other sources also, it mentions Wikipedia. The conclusion in 42 (section 2.5):

Therefore, our conclusion of this Section is that Kaase’s (2001, 3251) definition seems the most fruitful one suggested thus far: Data are information on properties of units of analysis.

See also 43 – 45 where ours readers can find further reflections on the concepts of data, information, knowledge, understanding, evidence and wisdom.

Such reflections bring to us an eclectic panorama on the ingredients that could form a data definition and their relation with the concept of information, attesting the involved difficulties in such goal.

Focusing in the scientific context, we can illustrate this complexity in full terms referring to the French Code de l’environnement 30 . In its Article L-124-2 19 we can appreciate the subtleties of the definition of environmental data in the following description:

Est considérée comme information relative à l’environnement au sens du présent chapitre toute information disponible, quel qu’en soit le support, qui a pour objet : 1. L’état des éléments de l’environnement, notamment l’air, l’atmosphère, l’eau, le sol, les terres, les paysages, les sites naturels, les zones côtières ou marines et la diversité biologique, ainsi que les interactions entre ces éléments ; 2. Les décisions, les activités et les facteurs, notamment les substances, l’énergie, le bruit, les rayonnements, les déchets, les émissions, les déversements et autres rejets, susceptibles d’avoir des incidences sur l’état des éléments visés au point 1 ; 3. L’état de la santé humaine, la sécurité et les conditions de vie des personnes, les constructions et le patrimoine culturel, dans la mesure où ils sont ou peuvent être altérés par des éléments de l’environnement, des décisions, des activités ou des facteurs mentionnés ci-dessus ; 4. Les analyses des coûts et avantages ainsi que les hypothèses économiques utilisées dans le cadre des décisions et activités visées au point 2 ; 5. Les rapports établis par les autorités publiques ou pour leur compte sur l’application des dispositions législatives et réglementaires relatives à l’environnement. [For the purposes of this chapter, information relating to the environment is considered to be any information available, whatever the medium, the purpose of which is: 1. The state of the elements of the environment, namely the air, atmosphere, water, soil, land, landscapes, natural sites, coastal or marine areas and biological diversity, as well as the interactions between these elements; 2. Decisions, activities and factors, namely substances, energy, noise, radiation, waste, emissions, spills and other discharges, likely to have an impact on the state of the elements concerned in point 1; 3. The state of human health, safety and living conditions of people, buildings and cultural heritage, insofar as they are or may be altered by elements of the environment, decisions, activities or the factors mentioned above; 4. The analyses of costs and advantages as well as the economic assumptions used in the context of the decisions and activities referred to in point 2; 5. Reports drawn up by public authorities or on their behalf on the application of legislative and regulatory provisions related to the environment. ] .

To be compared with the much more easier to understand concept of geographical data as introduced by the Article L127-1 20 of the same Code de l’environnement 30 :

Donnée géographique, toute donnée faisant directement ou indirectement référence à un lieu spécifique ou une zone géographique ; [Geographic data, any data that refers directly or indirectly to a specific place or geographic area;]

Another example to show the complexity of the representation and manipulation of data and information that we would like to mention here corresponds to the linguistic research work developed at the Laboratoire d’informatique Gaspard-Monge, where one of the authors of the present work resides, see for example the doctoral thesis 46 , 47 .

An additional factor that adds complexity to the concept of scientific data has to do with the potential use(s) and sharing of these data. As remarked by the OECD Glossary of Statistical Terms 21 :

The context provides detailed background information about the definition, its relevance, and in the case of data element definitions, the appropriate use(s) of the element described.

The importance of the context is also noted in 22 :

… research data take many forms, are handled in many ways, using many approaches, and often are difficult to interpret once removed from their initial context.

This opens the door to a series of complex issues. For example, to the need for complementary, technical information or documentation associated to a given dataset in order to facilitate its reuse. See 48 (p.16) (and also 40 ) that highlights the difficulties raised by the concept of temperature related data, as explained by a CENS biologist:

There are hundreds of ways to measure temperature. “The temperature is 98” is low-value compared to, “the temperature of the surface, measured by the infrared thermopile, model number XYZ, is 98.” That means it is measuring a proxy for a temperature, rather than being in contact with a probe, and it is measuring from a distance. The accuracy is plus or minus.05 of a degree. I [also] want to know that it was taken outside versus inside a controlled environment, how long it had been in place, and the last time it was calibrated, which might tell me whether it has drifted.

Another instance to further illustrate the complexity of technical information associated to a data set in the STRENDA Guidelines that have been developed to assist authors to provide data describing their investigations of enzyme activities. 22

Other examples from the collection of complex issues associated to data use(s) and sharing conditions are:

  • • 23 (p.11) The concept of right of access , involving the meaning of public information, requiring three characteristics: the existence of a document, of administrative nature, and in the possession of the Public Administration.
… l’utilisation d’une information publique par toute personne qui le souhaite à d’autres fins que celles de la mission de service public pour les besoins de laquelle les documents ont été élaborés ou détenus. [… the use of public information by anyone who wishes it for other purposes than those of the original needs for which the documents were prepared or held by the public service mission.].

finds a strong formulation for scientific data in 49 :

The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research. The public-good interests in the full and open access to and use of scientific data need to be balanced against legitimate concerns for the protection of national security, individual privacy, and intellectual property.

For more information on ‘re-use’ see, for example, 20 , 25 , 32 , 48 .

  • • 23 (p.10) The evolution from the right of access to documents from the Public Administration to the right of reuse of public information.
  • 1 public information derived from a document produced or hold by the Administration,
  • 2 there are no other intellectual property rights owners,
  • 3 data do not affect personal or private issues of people.
  • • 22 (p. 1060) The concept of data sharing in a scientific context:
For the purposes of this article, data sharing is the release of research data for use by others. Release may take many forms, from private exchange upon request to deposit in a public data collection. Posting datasets on a public website or providing them to a journal as supplementary materials also qualifies as sharing.
  • • The importance of licenses to set the sharing and re-use conditions as highlighted in 5 , 11 , 13 , 50 .
Open data are data in an open format that can be freely used, re-used and shared by anyone for any purpose.
  • • 53 also provides a classification of scientific data in four types: observational, experimental, computational and reference data sets.
  • • The FAIR guiding principles 17 are studied in the article that follows this work 12 .
  • • The recent and relevant introduction of the term Big Data 24 , that refers to the exploitation of larger amounts of data. They can appear in medical research, meteorology, genomics, astronomy, demographic studies … and in real life, as we live all in a digital world where we generate large amounts of data every day by the use of phones and computers to do work, travelling, e-mail, business, shopping etc. 42 . Big data is associated mainly to four “V” characteristics: Volume, Variety, Velocity, Veracity, and others can be found for example in the mentioned Wikipedia page and in the references mentioned there. See also 54 .

Closing the conceptual loop developed in this section, let us remark, again, that legal aspects arise quite naturally in the above list of items. Among others, some aspects are related to the fact that the datasets are usually organized in databases, where data is arranged in a systematic or methodical way and is individually accessible by electronic or other means 13 , 20 , 21 , 24 , 28 . The intellectual property rights can apply to the content of a database, the disposition of its elements and to the tools that make it working (for example software). The sui generis database rights primarily protects the producer of the database and may prohibit, for instance, the extraction and/or reuse of all or a substantial part of its content 24 .

Finally, let us quote here this paragraph from the OpenAIRE project report 20 (p.19) that highlights the difficulties to set a research data definition in the context of legal studies:

From a legal point of view, one of the very basic questions of this study is which kind of potentially protected data we are dealing with in the context of e-infrastructures for publications and research data such as OpenAIREplus. The term “research data” in this context does not seem to be very helpful, since there is no common definition of what research data basically is. It seems rather that every author or research study in this context uses its own definition of the term. Therefore, the term “research data” will not be strictly defined, but will include any kind of data produced in the course of scientific research, such as databases of raw data, tables, graphics, pictures or whatever else.

We can remark, that although the preceding quote does not provide a strict definition of research data, it highlights the relevance of the production context, as we have already mentioned in Section 2.2 .

4. Data as a research output: towards a definition for Research Data

In the previous section we have exemplified the complexity of the concept of data through different approaches. In this section we focus on the research data concept, proposing here a RD definition, directly derived from the RS definition presented in Section 2.2 . To this aim we start by gathering some previous definitions that are particularly relevant for our proposal.

The first one is the White House document 34 , and in particular the Intangible property section where we can find the following definition.

Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.

Let us remark that, according to 34 this definition explicitly excludes:

(A) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and (B) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

The above RD definition has been extended in 55 , emphasizing, among other aspects, the scientific purpose of the recorded factual material and the link with the scientific community.

A second basic inspiration for our proposal is the Directive for Open Data 25 that states:

(Article 2 (27)) The volume of research data generated is growing exponentially and has potential for re-use beyond the scientific community. […] Research data includes statistics, results of experiments, measurements, observations resulting from fieldwork, survey results, interview recordings and images. It also includes meta-data, specifications and other digital objects. Research data is different from scientific articles reporting and commenting on findings resulting from their scientific research. […] (Article 2 (9)) ‘research data’ means documents in a digital form, other than scientific publications, which are collected or produced in the course of scientific research activities and are used as evidence in the research process, or are commonly accepted in the research community as necessary to validate research findings and results;

The third pillar that we consider essential to support our proposal is the OECD report 19 (p.13) where we can find one of the most largely accepted and adopted definitions of RD:

Research data are defined as factual records (numerical scores, textual records, images and sounds) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings. A research data set constitutes a systematic, partial representation of the subject being investigated. This term does not cover the following: laboratory notebooks, pre-liminary analyses, and drafts of scientific papers, plans for future research, peer reviews, or personal communications with colleagues or physical objects (e.g. laboratory samples, strains of bacteria and test animals such as mice). Access to all of these products or outcomes of research is governed by different considerations than those dealt with here.

Finally, let us bring here the research data definition coming from the “Concordat on Open Research Data” 25 signed by the research councils of the UK Research and Innovation (UKRI) organisation 26 :

Research data are the evidence that underpins the answer to the research question, and can be used to validate findings regardless of its form (e.g. print, digital, or physical). These might be quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, modelling, interview or other methods, or information derived from existing evidence. Data may be raw or primary (e.g. direct from measurement or collection) or derived from primary data for subsequent analysis or interpretation (e.g. cleaned up or as an extract from a larger data set), or derived from existing sources here the rights may be held by others.

Let us observe that this last definition highlights the important role of data as a tool to find an answer to a scientific question, coinciding with the first characteristic of our RS definition, and also agreeing with 40 (p. 508): … data from scientific sensors are a means and not an end for their own research.

A remarkable “positive” aspect of these four definitions is that they separate the data from the subject under study, and establish what is, or is not, RD. This is relevant, as the legal context of the subjects under study sets up the legal (and ethical ) context of the RD.

We must say that we do not agree completely with all the terms in these definitions. For example, regarding the exclusion of the laboratory notebooks as RD elements, as we think they can be used to generate input data for other studies (how a laboratory works, which is the information that appears in some notebooks depending on the scientific matter). We think that these information and data can be of interest for other researchers.

Some other “negative” aspects: the role of the data producers does not appear in the above definitions, although it is more or less implicit when they refer to the connection with the scientific community. Indeed, their role is very important as observed in 48 (p.6):

Data creators usually have the most intimate knowledge about a given dataset, gained while designing, collecting, processing, analyzing and interpreting the data. Many individuals may participate in data creation, hence knowledge may be distributed among multiple parties over time.

Certainly, as for each research output, the producer team is the guarantor of the data quality, in particular to ensure that the data are not outdated, erroneous, falsified, irrelevant, and unusable. Note that this is particularly relevant in the case of RD, as a consequence of the lack of a widely accepted RD publication procedures, compared to the existing ones for articles in scientific journals, where the responsibility of the quality of the publication is somehow shared by the authors, the journal editors, and the reviewers. This is also confirmed by 56 (p. 73):

The concept of data quality is determined by multiple factors. The first is trust. This factor is complex in itself. […] Giarlo (2013) also mentions trust in first place, stating that it depends on subjective judgments on authenticity, acceptability or applicability of the data. Trust is also influenced by the given subject discipline, the reputation of those responsible for the creation of the data, and the biases of the persons who are evaluating the data.

Even more, note that, as remarked in 23 the quality of the producer legal entity defines the cultural quality of the data in legal terms, yielding in this way the qualification of cultural data.

On the other hand, in some of the above definitions, the RD scientific purpose is focused in its role to validate research findings , although RD can be reused for many other finalities in the scientific context as, for instance, to generate new knowledge, i.e. as primary sources for new scientific findings. Let us observe that these are two of the four rationales for data sharing examined in 22 .

Bearing all these arguments in mind, we propose the following RD definition.

Research data is a well identified set of data that has been produced (collected, processed, analyzed, shared & disseminated) by a (again, well identified) research team. The data has been collected, processed and analyzed to produce a result published or disseminated in some article or scientific contribution. Each research data encloses a set (of files) that contains the dataset maybe organized as a database, and it can also include other elements as the documentation, specifications, use cases, and any other useful material as provenance information, instrument information, etc. It can include the research software that has been developed to manipulate the dataset (from short scripts to research software of larger size) or give the references to the software that is necessary to manipulate the data (developed or not in an academic context).

We can summarize the above definition in the following three main characteristics:

  • • the goal of the collection and analysis is to do research, that is, to answer a scientific question (which includes the validation of research findings),
  • • it has been produced by a research team,
  • • the RD is involved in the obtention of the results presented in scientific articles (as the most important means for scientific exchange are still articles published in scientific journals) or by any other kind of recognized scientific means.

We provide here some further considerations concerning this proposal. First, it is clear that we have followed closely the RS definition in Section 2.2 , in order to formulate this RD counterpart, which involves the transaltion of some RS features of strict digital nature to RD. This does not mean that we do not consider non digital data as possible RD, but rather we assume that the information extracted from the physical samples has been already treated as digital information to be manipulated in a computer system, which simplifies the manipulation of physical data and its inclusion in the proposed RD definition.

Secondly, we emphasize that our RD definition also follows the consideration of a restricted research production context, as in the case of our RS definition. But this limited context to set the RD definition does not mean that e.g. public sector data can not be used in the research work. Rather it means that the external components that have not been directly collected/produced by the research team should be well identified, indicating their origin, where the data is available, which is the license that allows the reuse. It is also necessary to indicate if the data has been reused (processed) without modification, or if some adaptations have been necessary for the analysis. External data components can have any origin, not just public sector. As we have highlighted in Section 3 , the production context of the data may have a lot of importance, as data can be difficult to interpret once removed from their initial context 22 .

Third, note that, according to our definition, documentation, licenses, Data Management Plans and other documents can also be part of the set of files that constitutes the RD. Moreover, as explained in Section 2.2 , a RS can also include data in the list of included materials that could also be qualified as RD. There are here a broad spectrum of possibilities, according to the size, the importance given by the research team and the chosen strategy in the dissemination stage. If the RD is considered of little size and less importance than the RS, it can be just included and disseminated as part of the software, and also the other way around, when the RS is considered less important than the RD, as for example when the software development effort is much less important than the time and effort invested in the data collection and analysis. It can also happen that both outputs are considered as of equal value, and can be disseminated separately. In this case it is important that both outputs are linked in order to allow other researchers to find easily the other output.

In a similar manner as for RS, RD can include other data components, and some can also qualify as research data. The RD producer team should explain how these components have been selected, mixed and analyzed, and highlight the reuse of other RD components by means of a correct citation form, see for example, 38 , 41 , 57 .

Moreover, software and data can have several versions and releases, and they can be manipulated alike and with similar tools (forges, etc …) 37 , 58 , 59 . One of the differences that we have detected between RS and RD is that while some research teams can decide to give access to early stages of the software development, what we observe in the consulted work is that RD is expected in its final form, ready for reuse, as mentioned in 22 :

If the rewards of the data deluge are to be reaped, then researchers who produce those data must share them, and do so in such a way that the data are interpretable and reusable by others.

This difference is a consequence of the distinct nature of the building process of both objects. In the FLOSS community, we find the release early, release often principle associated to the development of the Linux kernel 60 and to Agile developments. 27 This principle may not have the same sense in the building of a dataset for which a research team collects, processes and analyzes data with a very particular research purpose, maybe difficult to share with a large or external community in the early stages of the RD production.

Yet, in this work, we do not address some production issues like best software development practices or data curation, as they are out of the scope of the present article, and could be the object of future work. It is not that we do not give enough appreciation to these important issues, as they are part of the 3rd step of the proposed CDUR evaluation protocol for RS and RD, see sections 2.3 and 3.3 of 12 . For us, the research team decides when the research outputs have reached the right status for its dissemination. Neither we do enter in the different roles (see 22 ) that may appear in the RD team, taking care of actions involving: collection, cleaning, selection, documentation, analysis, curation, preservation, maintenance, or the role of Data Officer proposed in 15 .

5. Conclusion

While some authors highlight differences between software and data 8 , 9 , the present article leans toward profiting from the similarities shared by RS and RD. For example, taking into consideration the difference between the definition of software and the definition of RS has driven us to the proposition of a RD definition that is independent from the definition of data. Likewise, along the above sections we have emphasized other characteristics of RD that are grounded in the RS features. As a side effect of this approach, the fact that we can easily adopt issues from the RS definition formulation to RD, confirms and validates our proposed RS definition.

In the introduction we have mentioned Borgman’s conundrum challenges related to RD 22 :

The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice.

In our experience, Borgman's conundrum challenges correspond to questions that appear regularly at different stages of the RD production. We think that to provide the vision developed in Section 4 could be of help to deal with these questions, as a first step to tackle some problems in a well determined situation. Moreover, the view proposed in Section 4 is extended and completed with the dissemination and evaluation protocols of 12 . Our experience of many years confirms the need of these protocols for RS, and we think that they will be appropriated, useful and relevant for RD as well.

As a test for the soundness of the proposed RD definition we have used the conundrum queries as a benchmark, checking if our definition allows us to provide answers to the different questions, as well as to two extra ones that we consider equally relevant, namely how and where to share RD:

Which data might be shared? Following the arguments supporting our RD definition, we think that it is a decision of the research team: similarly to the stage in which the team decides to present some research work in the form of a document for its dissemination as a preprint, or a journal article, a conference paper, a book … the team should decide which data might be shared, in which form and when (following maybe funder or institutional Open Science requirements).

By whom? The research team that has collected, processed, analyzed the RD, and decided to share and disseminate it. That is the RD producer team, as stated in the second characteristic of our RD definition. On the other hand, data ownership issues have been discussed for example in 20 , 21 , 32 , 61 – 63 .

How? As observed in the precedent sections, the How? should follow some kind of dissemination procedure like the one proposed in 11 , 12 in order to identify correctly the RD set of files, to set a title and the list of persons in the producer team (that can be completed with their different roles), to determine the important versions and associated dates, to give a documentation, to verify the legal 21 , 33 (and ethical ) context of the RD and give the license to settle the sharing conditions 13 , etc. which can include the publication of a data paper and decisions about in which form and when the RD should be disseminated, maybe following grant funders or institutional Open Science requirements). In order to increase the return on public investments in scientific research, RD dissemination could respect principles and follow guidelines as described in 17 , 19 . Further analysis on RD dissemination issues can be found in 12 .

Where? There are different places to disseminate a RD, including the web pages of the producer team, of the funded project, or in a existing data repository. Let us remark that the Registry of Research Data Repository 28 is a global registry of RD repositories that covers repositories from different academic disciplines. It is funded by the German Research Foundation (DFG) 29 and it can help to find the right repository. Note that the Science Europe report 64 provides criteria for the selection of trustworthy repositories to deposit RD.

With whom? Each act of scholar communication has its own target public, and initially, the RD dissemination strategy can target the same public as the one that could be interested in the corresponding research article. But it can happen that the RD is of interdisciplinary value, possibly wider than the initial discipline associated to the scientific publication, and to assess what is the public involved in this larger context can be difficult. Indeed, as observed by 22 :

An investigator may be part of multiple, overlapping communities of interest, each of which may have different notions of what are data and different data practices. The boundaries of communities of interest are neither clear nor stable.

So, it can be complex to determine the community of interest for a particular RD, but this also happens for articles, see for example the studies on HIV/AIDS 65 making reference to automatic reasoning in elementary geometry in studies in its reference number 12, and it seems to us that this has never been an obstacle for sharing a publication. Thus 22 :

… the intended users may vary from researchers within a narrow specialty to the general public.

Under what conditions? As described previously, and in parallel with the case of RS, the sharing conditions are to be found in the license that goes with the RD, such as a Creative Commons license 30 or other licenses to settle the attribution, re-use, mining … conditions 13 . For example, in France, the law of 2016 for a Digital Republic Act sets in a Décret the list of licenses that can be used for RS or RD release 31 , 32 .

Why and to what effects? There maybe different reasons to release some RD, from the contribution to build more solid, and easy to validate science, to just comply with the recommendations or requirements of the funder of a project, of the institutions supporting the research team, or those of a scientific journal, including Open Science issues 5 . The works 22 , 49 give a thorough analysis on this subject. As documented there and already mentioned in Section 3 :

“The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research.”

As remarked in 5 and in the work analyzed there, the evaluation step is an important enabler in order to improve the adoption of Open Science best practices and to increase RD sharing and open access. To disseminate high quality RD outputs is a task that requires time, work and hands willing to verify the quality of the data, to write the associated documentation, etc. Incentives are needed to motivate the teams to accomplish these tasks. RD dissemination also asks for the establishment of best citation practices and evolution in the protocols of research evaluation. In particular, following the parallelism present all along this work, the CDUR protocol 3 proposed for RS evaluation can be also proposed for RD as developed in the article that extends the present work 12 .

Data availability

Acknowledgments.

With many thanks to the Referees, to the Departamento de Matemáticas, Estadística y Computación de la Universidad de Cantabria (Spain) for hospitality, and to Prof. T. Margoni for useful comments and references.

[version 2; peer review: 3 approved]

Funding Statement

This work is partially funded by the CNRS-International Emerging Action (IEA) PREOSI (2021-22).

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1 “We Are Star Dust” - Symphony of Science, https://www.youtube.com/watch?v=8g4d-rnhuSg

2 Cave of Altamira and Paleolithic Cave Art of Northern Spain, https://whc.unesco.org/en/list/310/

3 Prehistoric Sites and Decorated Caves of the Vézère Valley, https://whc.unesco.org/en/list/85/

4 https://en.wikipedia.org/wiki/Free_and_open-source_software

5 https://en.unesco.org/science-sustainable-future/open-science/recommendation

6 https://www.fsf.org/licensing/

7 https://opensource.org/licenses

8 https://spdx.org/licenses/

9 https://creativecommons.org/licenses/?lang=en

10 https://opendatacommons.org/licenses/

11 https://5stardata.info/en/

12 Note that the authors of this article provide their own translations. Authors prefer to keep the original text for two reasons. First, because of the legal nature of the involved quotations. Second, for French or Spanish speaking readers to enjoy it, very much in line with the Helsinki Initiative on Multilingualism in Scholarly Communication (2019), see https://doi.org/10.6084/m9.figshare.7887059 . These translations have been helped by Google Translate, https://translate.google.com/ and Linguee, https://www.linguee.fr/ .

13 https://www.rd-alliance.org/groups/fair-research-software-fair4rs-wg

14 https://matek.hu/zoltan/issac-2021.php

15 https://swmath.org/software/4203

16 See, for example, https://www.senat.fr/dossier-legislatif/pjl16-504.html

17 https://stats.oecd.org/glossary/detail.asp?ID=532

18 https://www.isko.org/cyclo/data

19 https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI000006832922/

20 https://www.legifrance.gouv.fr/codes/section_lc/LEGITEXT000006074220/LEGISCTA000022936254/

21 The entries of the glossary https://stats.oecd.org/glossary/ have several parts including Definition and Context as shown in the Data definition included in Section 3 . This quotation appears when placing the pointer over the Context part of the Data entry.

22 https://www.beilstein-institut.de/en/projects/strenda/guidelines/

23 https://en.wikipedia.org/wiki/Open_data

24 https://en.wikipedia.org/wiki/Big_data

25 https://www.ukri.org/wp-content/uploads/2020/10/UKRI-020920-ConcordatonOpenResearchData.pdf

26 https://www.ukri.org/

27 https://en.wikipedia.org/wiki/Agile_software_development

28 https://www.re3data.org/

29 http://www.dfg.de/

30 https://creativecommons.org/

Reviewer response for version 2

Joachim schopfel.

1 GERiiCO Labor, University of Lille, Lille, France

The second version is fine with me. The authors replied to all comments; they fixed some issues, and they provided complementary arguments for other issues. I do not share all their viewpoints but that is science and not a problem. The paper is interesting and relevant.

Is the work clearly and accurately presented and does it cite the current literature?

If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable

Are all the source data underlying the results available to ensure full reproducibility?

No source data required

Is the study design appropriate and is the work technically sound?

Are the conclusions drawn adequately supported by the results?

Are sufficient details of methods and analysis provided to allow replication by others?

Reviewer Expertise:

Information science

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Remedios Melero

1 Instituto de Agroquímica y Tecnología de Alimentos, CSIC, Valencia, Spain

I do not have any further comments.

Open science, open research data, scholarly publications, open access policies

Tibor Koltay

1 Institute of Learning Technologies, Eszterházy Károly University, Eger, Hungary

I am satisfied with the author’s reply, and found the other two reviews’ comments intriguing and useful for the authors. I have no further comments.

Reviewer response for version 1

The research data management is a central dimension of the development of scientific research and related infrastructures. Also, any original attempt to define research data is welcome and helpful for the understanding of this field. This conceptual paper will be a valuable contribution to the discussion on the research but. Yet, it should be improved, and a couple of more or less minor issues should be fixed.

  • First, all cited text should be systematically translated into English.
  • The main concepts (such as data, information, knowledge...) should be defined from the beginning on and not only later (section 3). The definitions should not be based on Wikipedia, Larrousse etc but on academic works in the field of information science (eg ISKO). 
  • Open science is a fuzzy concept, an umbrella term or even a "boundary object" (as Samuel Moore put it). But it should be made clear that open science is more than "sharing and dissemination of research outputs" (as in the [5] citation). 
  • The former comment is important because the approach of the paper is in some kind limited or reduced to the aspect of "research output". Generally, in the research process, research software and research data are not only output but also tools (software) and input (data). This needs clarification.
  • In the same context, the paper cites Wikipedia with " We must all  accept that science is data and that data are science". I have two problems with this: nobody must accept anything in science, all is matter of discussion; and this sentence is either trivial or it makes no sense. My advice would be to avoid these kind of sentences.
  • Later on, the paper presents "analogies" between RS and RD. Analogy, even if it exists, does not mean "similarity", and I think that this comparison is somehow misleading because the underlying assumption is not entirely correct ("a definition for RD can be proposed following the main features of the RS definition"). Software and data are different objects, with different issues (IP protection, communities etc.); the analysis of RS may be helpful for a better understanding of RD but this does not mean that both are more or less similar or even "fungible".
  • In section 3, I would suggest that the paper tries to describe the relationship between RS and RD, perhaps with "use cases". 
  • I admit that the authors are not legal experts but section 3 should be more explicit (and perhaps shorter and more restrictive) about the different laws and legal frameworks. Are you speaking about French laws? Or about the EU regulation?
  • Another, related issue is the data typology. The paper is about research data but section 3 mentions (and apparently does not differentiate) environmental data, cultural data and public sector information. 
  • My suggestion would be to improve the structure of section 3 and to distinguish between concepts, typology, legal status and reuse/policy (subsections).
  • Section 4: I already mentioned it above - RD is not only output but also input, with different issues (third party rights etc). This requires clarification.
  • At the end of section 4, the paper states that "documentation, licenses, Data Management Plans and other documents can also be part of the set of files that constitutes the RD". The meaning of this statement requires attention, as well as its implications. Does this mean that "RDM and other documents" are data? Or that they may be part of a RD deposit? But again (see above, comment 5), a statement that "all is data" is not helpful; it may make sense as a political catchword but not in an academic paper.
  • Last comment: I like very much Borgman's assessment of RD and her "conundrum challenges" but I have a somewhat different understanding of the meaning of this - for me, these "challenges" are questions that require attention and evaluation in a given situation, not for all RD in a general way. For me, they provide a kind of "reading grid" to analyse a specific data community, or a specific instrument or infrastructure or workflow; but they don't require or demand a comprehensive response as such provided by the paper.
  • Anyway, the paper is an interesting contribution to the academic research on RD, and I am looking forward to read a revised and improved version. Thank you!

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

LIGM, Gustave Eiffel University & CNRS, France

Many thanks to you, Joachim Schopfel, for your interesting comments that give us the opportunity to improve this work. A new version is in preparation, but we provide here some answers to your comments.

1. [translations into English]

Translations are included as footnotes, they will be moved to the main text.

2. [information science (eg ISKO).]

Many thanks for this reference, we are looking into it.

3. [Open science is a fuzzy concept…]

As indicated in the introduction: A more transversal and global vision can be found in the ongoing work for the UNESCO Recommendation on Open Science [Reference 6]. See also [Reference 7] . We will explain better this point.

4. [the paper is in some kind limited or reduced to the aspect of "research output". Generally, in the research process, research software and research data are not only output but also tools (software) and input (data). This needs clarification.]

In our view, each "research output" is a potential input for new research work. For example a RS can be a tool to manipulate data or an input for a new RS, this can be in the form of a component, or in the form of a new version done by the initial research team or another one. A RD can be used by other teams (as a tool) to understand some problem, it can be modified to produce a new RD, or it can be included as part of a larger data set, that can be as well a new RD. To better understand the production context is not, in our view, a limitation. But you are right, this point needs clarification.

5. [cites Wikipedia with " We must all accept that science is data and that data are science ".]

Please note that this cited phrase comes from [Reference 4], and as indicated to Referee T. Koltay, we have chosen to do this reference in a slightly different manner as done in the Borgman’s work, where we have found it.

6.1. [similarity/analogy]

When consulting Cambridge English Learner’s Dictionary dictionary we find:

analogy: a comparison that shows how two things are similar

6.2. [Software and data are different objects, with different issues (IP protection, communities etc.); the analysis of RS may be helpful for a better understanding of RD but this does not mean that both are more or less similar or even "fungible".]

It is one of the intentions of the present work to show the differences between data and software form the legal point of view. While software finds a somehow clear and simple presentation (Section 2.1), data is much more difficult to grasp, as studied in Section 3. But this is not an obstacle to present an unified vision of RS and RD as research outputs, as we can see in the RS and RD proposed definitions. The fact that we can propose a similar formulation for both definitions allows us to propose similar dissemination and evaluation protocols as you can find in the article that follows this work [Reference 13]. The fact that we can deal with RS and RD in a similar way does not mean that they are similar.

7. [describe the relationship between RS and RD, perhaps with "use cases".]

It seems to us that it is quite usual for the targeted research audience to use and/or produce RS and/or RD as part of their everyday research practices, and that this point does not require further explanation. Examples can be found easily in the literature, as for example in the bibliography included at the end of this work.

8. [I admit that the authors are not legal experts but section 3 should be more explicit (and perhaps shorter and more restrictive) about the different laws and legal frameworks. Are you speaking about French laws? Or about the EU regulation?]

As indicated in the introduction, we have consulted legal texts and legal experts’ work in order to understand and explain the legal context in which we place this work. We have consulted French, European and USA texts, and selected the parts that we have used to document the article. We consider that our role is restricted to this intention, due to the lack or further expertise in legal matters, which does not hide the efforts we have put in to understand and to explain some legal issues. But we are unable to give more information on the regulations that can be taken into consideration, as this is the role of legal experts in the light of a well defined setting.

9. [Another, related issue is the data typology. The paper is about research data but section 3 mentions (and apparently does not differentiate) environmental data, cultural data and public sector information.]

The goal of Section 3 is to show the difficulties existing to set a data definition from the legal point of view, which is a very different context as the one existing for software, as shown in Section 2.1. The case of cultural data is very interesting, as legally speaking [Reference 19] the quality of the producer legal entity defines the cultural quality of the data . Then we can establish the parallel with the quality of research for some data set, as the consequence of the research quality of the producer team. Data typology could be the object of future work.

10. [My suggestion would be to improve the structure of section 3 and to distinguish between concepts, typology, legal status and reuse/policy (subsections).]

We will consider this suggestion

11. [Section 4: I already mentioned it above - RD is not only output but also input, with different issues (third party rights etc). This requires clarification.]

As already explained, we study in here the production aspects, and other aspects are presented in [Reference 13]. But you are right, this needs better explanation.

12. [At the end of section 4, the paper states that "documentation, licenses, Data Management Plans and other documents can also be part of the set of files that constitutes the RD". ]

Section 2.1 shows that the preparatory design work and documentation are part of the software, and these are documents that can be included in the released version of a RS, following the choice of the RS producer team. There can be other elements as for example tests, input and output files to illustrate how to use the RS, licenses, etc. To include these elements in the released RS correspond to best practices that facilitate RS reuse. In our view, to release a RD can follow similar practices, that is, to include a documentation, some use examples, a license, a data management plan…this is to be decided by the producer team.

13. [Last comment: I like very much Borgman's assessment of RD and her "conundrum challenges" but I have a somewhat different understanding of the meaning of this - for me, these "challenges" are questions that require attention and evaluation in a given situation, not for all RD in a general way. For me, they provide a kind of "reading grid" to analyse a specific data community, or a specific instrument or infrastructure or workflow; but they don't require or demand a comprehensive response as such provided by the paper.]

In our experience, Borgman's conundrum challenges correspond to questions that appear regularly at different stages of the RD production. We think that to provide such vision as the one exposed in Section 4 could be of help to deal with these questions, and, as you said, as a first step to tackle some problems in a well determined situation. Moreover, this view proposed in Section 4 is extended and completed with the dissemination and evaluation protocols proposed in [Reference 13]. Our experience of many years confirms the need of these protocols for RS, and we think that they will be appropriated, useful and relevant for RD as well.

Teresa Gomez-Diaz and Tomas Recio

The authors proposed a Research Data (RD) definition "based in three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind." From my point of view this definition restricts RD to those that are published by a scientific team, however what about the citizen science, or data produced by non-scientist staff? What about any other data that do not deserve be published but help to further research?

Authors say: "the RS is involved in the obtention of the results presented in scientific articles" - This is not necessarily true. RS is not always involved in the obtention of results because it can be developed for any other purpose, again the authors make a very strict definition.

Authors say: "As a matter of fact, a research team can use RS produced by other teams for their scientific work, as well as FLOSS or other software developed outside the scientific community, but the present work is centered in the making-of aspects which are pertinent for the proposed definition." - This restricts the definition of Research Software (RS) a lot by excluding all FLOSS produced by non-academic members.

The authors have missed any mention to the  Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information, in which RD are defined and included as part of the public sector. In fact, the authors have cited it but they have not commented/mentioned the fact that RD has a wider meaning and that according to this Directive are considered public sector information, and they need not necessarily be published in a scientific journal but shared.

Definitions given by dictionaries are not particularly relevant to the scientific context/environment. I think this part should be omitted, it only adds some definitions in the authors' own languages.

"For example, to the need for complementary, technical information associated to a given dataset in order to facilitate its reuse." - This is part of the FAIR principles which are not mentioned/linked to this comment. Obviously, a dataset without any information about how data have been produced/obtained, etc. are not valuable.

Authors write: "In here, the research outputs have reach a status in which the research team is happy enough for its dissemination." - This seems a very naïve assertion. Because the authors "do not consider production issues like best software development practices or data curation", it seems they do not care about these important issues.

Conclusions again repeat the proposal of a RD definition. Concepts like linked data, FAR data, and open data have not been treated in the article. Their definition of RD is very strict and narrow, and they have not considered any semantic issues about data and the benefits and implications of being a 5star open data . Their definition is far from the 4th or 5th step of the stars.

In general, from my point of view, the article does not add any new ideas about RD definition and restricts it to data produced by scientific teams.

Many thanks to you, Remedios Melero, for these very interesting comments. We are preparing a new version of this article and we will include several of the proposed corrections. Meanwhile, we would like to provide in here some preliminary comments.

1. [this definition restricts RD to those that are published by a scientific team, however what about the citizen science, or data produced by non-scientist staff?]

[the article does not add any new ideas about RD definition and restricts it to data produced by scientific teams.]

It would be strange to consider any article published in a newspaper as a scientific publication.

On the other hand, scientists may read the newspapers and many other documents, including tweets, and may use these documents as input information for a research work. As already explained in our answer to Rob Hooft’s comment, yes, we have chosen a restricted definition for RD. It allows us to provide the answers to the Borgman’s conundrum challenges that are in the Conclusion section. As far as we know, we have not found in the consulted literature the proposition of such kind of answers in this complete view. Moreover, as the RD definition finds a similar formulation as the RS one, we can also translate RS dissemination and evaluation protocols to RD [Reference 13]. Once we understand well the restricted context, it can be studied its extension and then see which are the answers to Borgman’s conundrum challenges and the dissemination and evaluation protocols that can be proposed in the extended context.

The fact that we do not include e.g. public sector data as RD is different from the claim that these data cannot be used as input for a research work. As explained in section 3.2 of [Reference 13], these external data components should be correctly presented and referenced, and some can also fall in the category of RD.

2. [RS is not always involved in the obtention of results because it can be developed for any other purpose, again the authors make a very strict definition.]

[This restricts the definition of Research Software (RS) a lot by excluding all FLOSS produced by non-academic members.]

You are right, this point should be explained better. To obtain a research result may involve the use of software (FLOSS or not FLOSS), the development of software to support some work or service, and the development of RS by the research team as explained in [References 3, 14]. Note that RS can be also disseminated as FLOSS, which is the usual practice in the work of T. Recio and in the research lab of T. Gomez-Diaz. This is also similar for data and RD, that can be disseminated as open data, as well as for publications and research articles as seen in the previous point.

3. [Research data defined in the  Directive (EU) 2019/1024 ]

This definition was included in the preparation versions of the present article, and it will be included again in the new version in preparation, following your advice.

4. [Definitions given by dictionaries]

In the difficulties to explain easily the concepts of data and information we have ended in the consultation of several dictionaries, including some in English. Some of the found definitions, mainly in Spanish and French have attracted our attention and we have decided to included them in Box 1. This box can be easily skipped by readers not interested in these definitions.

We prefer to leave the reading of the content of this box at the choice of readers.

5. [FAIR and "For example, to the need for complementary, technical information associated to a given dataset in order to facilitate its reuse."]

Please note that FAIR principles appear in the [Reference 55] dated 2016, while [Reference 36] that we have chosen to illustrate the need for complementary, technical information is dated 2012. Moreover, this is also related to the importance of context, that is explained in the OECD Glossary of Statistical Terms, with PDF and WORD download versions dated 2007 [ https://stats.oecd.org/glossary/download.asp ]. On the other hand, FAIR principles are considered in the second part of this work [Reference 13], as they are related to dissemination issues. We will also mention them in the second version of this first part.

6. ["In here, the research outputs have reached a status in which the research team is happy enough for its dissemination."]

[authors "do not consider production issues like best software development practices or data curation", it seems they do not care about these important issues.]

You are right, this point should be better explained in the new version of the article. It is not that we do not care about these important issues, as they are part of the 3 rd step of the proposed CDUR evaluation protocol for RS and RD, see sections 2.3 and 3.3 of [Reference 13].

7. [Concepts like linked data, FAIR data, and open data have not been treated in the article. Their definition of RD is very strict and narrow, and they have not considered any semantic issues about data and the benefits and implications of being a 5star open data . Their definition is far from the 4th or 5th step of the stars.]

Please note that FAIR data and open data are treated in [Reference 13]. We will include in the second version the mention of the 5star open data, many thanks for this reference.

Teresa Gomez-Diaz, Tomas Recio

The content of the first two paragraphs of the paper (especially the first one) seems to be less appropriate, compared to the purpose of your paper. I would thus advise you to consider rewriting these paragraphs.

Your practice of providing the cited texts in the original language (French or Spanish) and providing the translations of these passages only in the footnotes is unusual and may be not appropriate for a readership that probably reads and writes only in English, or is not familiar with Spanish and/or French texts. As I see it, if you would want to make a favour to your readers, who prefer French or Spanish, the solution could be reverse this order, i.e. putting the original texts into the footnotes.

Other remarks

I think that it would be better if the following sentence would be changed as follows:

  • “Indeed, as remarked by Hanson et al ., we must all accept that science is data and that data are science… 4 ”

This regards not only the form of citing, but content, because this remark comes from Borgman’s Conundrum, cited in your paper a couple of times.

You describe three main characteristics of RS:

  • “the goal of the RS development is to do research. As stated by D. Kelly: it is developed to answer a scientific question 32 ,
  • it has been written by a research team,
  • the RS is involved in the obtention of the results presented in scientific articles (as the most important means for scientific exchange are still articles published in scientific journals).”

In general, these three claims are correct. However, the first one of them is a little awkward. I would thus change it to anything like “the goal of the RS development is to support research. As stated by Kelly, it is developed to answer a general, or a specific scientific question. Writing the software requires close involvement of someone with deep domain knowledge in the application area related to the question. 32 ”. Theses sentences however may prove redundant, because you provide a more complete definition:

  • “Research Software is a well identified set of code that has been written by a (again, well identified) research team…If take this, linger definition only, the expression “(again, well identified)” should be deleted.

You write that “Indeed, there is a difference between the concepts of algorithm and software from the legal point of view, as there is a difference between the mere idea for the plot of a novel and the final written work.” This is a brilliant idea, although I believe that it should not be restricted to the legal point of view.

In my view, it seems to be dangerous to write about copyright issues without being legal experts. Personally, I have only basic knowledge of copyright laws, so I cannot judge the correctness of all your argument. Fortunately, what you describe is also related to different issues.

I do not see any further problems. Therefore, I will not enumerate passages that are correct and rather straightforward. My suggestion is however, that you carefully review you text in order to reach clarity of your argument.

Many thanks to you, Tibor Koltay, for these very interesting comments. We are preparing a new version of this article and we will include several of the proposed corrections. Meanwhile, we would like to provide in here some preliminary comments.

1. [first two paragraphs]

We have chosen to start in a ''light'' manner an article that can ask for some effort to be understood, this is our author’s choice. It is the reader’s choice to skip these two first paragraphs or to enjoy them, as this does not have any consequence for the understanding of the content of the article.

2. [translations to English]

We agree with you, the translations to English in the footnotes may hinder the fluent reading of this work, we will modify the presentation.

3. [Hanson et al. Reference]

You are right, we have found this reference in Borgman’s work, but we have consulted the original article and we have chosen to do this reference in a slightly different manner.

4. [RS definition characteristics]

We will modify the phrase to include your proposition as follows: “the goal of the RS development is to do or to support research’’. Please note that the composition of a research team involved in the

development of a RS has been thoroughly studied in section 2.2 of [Reference 3]. We will include this reference to clarify this point as you ask. Please, also note that long developments may involve many different contributions from developers with different status. As copyright issues enter into play, it is important that the RS developers and contributors are correctly listed.

5. [Algorithms and software]

Comparisons between algorithms and software can be done in several contexts, for example in mathematics, or in computer science, among others. We have highlighted the legal aspects as we detect regularly the confusion between these two concepts, and the [Reference 22] providers a pretty clear explanation.

6. [Copyright issues]

Please note that one of the authors has study copyright issues in order to write [Reference 2], work that has been validated by several experts, including legal experts. On the other hand, we are regularly in contact and follow the work of legal experts, in such a manner as to provide us with the necessary confidence to deal with copyright issues in the way we propose in this article. The remark included at the end of the Introduction gives the necessary warning to our readers on this point.

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Data Analysis – Process, Methods and Types

Data Analysis – Process, Methods and Types

Table of Contents

Data Analysis

Data Analysis

Definition:

Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various statistical and computational techniques to interpret and derive insights from large datasets. The ultimate aim of data analysis is to convert raw data into actionable insights that can inform business decisions, scientific research, and other endeavors.

Data Analysis Process

The following are step-by-step guides to the data analysis process:

Define the Problem

The first step in data analysis is to clearly define the problem or question that needs to be answered. This involves identifying the purpose of the analysis, the data required, and the intended outcome.

Collect the Data

The next step is to collect the relevant data from various sources. This may involve collecting data from surveys, databases, or other sources. It is important to ensure that the data collected is accurate, complete, and relevant to the problem being analyzed.

Clean and Organize the Data

Once the data has been collected, it needs to be cleaned and organized. This involves removing any errors or inconsistencies in the data, filling in missing values, and ensuring that the data is in a format that can be easily analyzed.

Analyze the Data

The next step is to analyze the data using various statistical and analytical techniques. This may involve identifying patterns in the data, conducting statistical tests, or using machine learning algorithms to identify trends and insights.

Interpret the Results

After analyzing the data, the next step is to interpret the results. This involves drawing conclusions based on the analysis and identifying any significant findings or trends.

Communicate the Findings

Once the results have been interpreted, they need to be communicated to stakeholders. This may involve creating reports, visualizations, or presentations to effectively communicate the findings and recommendations.

Take Action

The final step in the data analysis process is to take action based on the findings. This may involve implementing new policies or procedures, making strategic decisions, or taking other actions based on the insights gained from the analysis.

Types of Data Analysis

Types of Data Analysis are as follows:

Descriptive Analysis

This type of analysis involves summarizing and describing the main characteristics of a dataset, such as the mean, median, mode, standard deviation, and range.

Inferential Analysis

This type of analysis involves making inferences about a population based on a sample. Inferential analysis can help determine whether a certain relationship or pattern observed in a sample is likely to be present in the entire population.

Diagnostic Analysis

This type of analysis involves identifying and diagnosing problems or issues within a dataset. Diagnostic analysis can help identify outliers, errors, missing data, or other anomalies in the dataset.

Predictive Analysis

This type of analysis involves using statistical models and algorithms to predict future outcomes or trends based on historical data. Predictive analysis can help businesses and organizations make informed decisions about the future.

Prescriptive Analysis

This type of analysis involves recommending a course of action based on the results of previous analyses. Prescriptive analysis can help organizations make data-driven decisions about how to optimize their operations, products, or services.

Exploratory Analysis

This type of analysis involves exploring the relationships and patterns within a dataset to identify new insights and trends. Exploratory analysis is often used in the early stages of research or data analysis to generate hypotheses and identify areas for further investigation.

Data Analysis Methods

Data Analysis Methods are as follows:

Statistical Analysis

This method involves the use of mathematical models and statistical tools to analyze and interpret data. It includes measures of central tendency, correlation analysis, regression analysis, hypothesis testing, and more.

Machine Learning

This method involves the use of algorithms to identify patterns and relationships in data. It includes supervised and unsupervised learning, classification, clustering, and predictive modeling.

Data Mining

This method involves using statistical and machine learning techniques to extract information and insights from large and complex datasets.

Text Analysis

This method involves using natural language processing (NLP) techniques to analyze and interpret text data. It includes sentiment analysis, topic modeling, and entity recognition.

Network Analysis

This method involves analyzing the relationships and connections between entities in a network, such as social networks or computer networks. It includes social network analysis and graph theory.

Time Series Analysis

This method involves analyzing data collected over time to identify patterns and trends. It includes forecasting, decomposition, and smoothing techniques.

Spatial Analysis

This method involves analyzing geographic data to identify spatial patterns and relationships. It includes spatial statistics, spatial regression, and geospatial data visualization.

Data Visualization

This method involves using graphs, charts, and other visual representations to help communicate the findings of the analysis. It includes scatter plots, bar charts, heat maps, and interactive dashboards.

Qualitative Analysis

This method involves analyzing non-numeric data such as interviews, observations, and open-ended survey responses. It includes thematic analysis, content analysis, and grounded theory.

Multi-criteria Decision Analysis

This method involves analyzing multiple criteria and objectives to support decision-making. It includes techniques such as the analytical hierarchy process, TOPSIS, and ELECTRE.

Data Analysis Tools

There are various data analysis tools available that can help with different aspects of data analysis. Below is a list of some commonly used data analysis tools:

  • Microsoft Excel: A widely used spreadsheet program that allows for data organization, analysis, and visualization.
  • SQL : A programming language used to manage and manipulate relational databases.
  • R : An open-source programming language and software environment for statistical computing and graphics.
  • Python : A general-purpose programming language that is widely used in data analysis and machine learning.
  • Tableau : A data visualization software that allows for interactive and dynamic visualizations of data.
  • SAS : A statistical analysis software used for data management, analysis, and reporting.
  • SPSS : A statistical analysis software used for data analysis, reporting, and modeling.
  • Matlab : A numerical computing software that is widely used in scientific research and engineering.
  • RapidMiner : A data science platform that offers a wide range of data analysis and machine learning tools.

Applications of Data Analysis

Data analysis has numerous applications across various fields. Below are some examples of how data analysis is used in different fields:

  • Business : Data analysis is used to gain insights into customer behavior, market trends, and financial performance. This includes customer segmentation, sales forecasting, and market research.
  • Healthcare : Data analysis is used to identify patterns and trends in patient data, improve patient outcomes, and optimize healthcare operations. This includes clinical decision support, disease surveillance, and healthcare cost analysis.
  • Education : Data analysis is used to measure student performance, evaluate teaching effectiveness, and improve educational programs. This includes assessment analytics, learning analytics, and program evaluation.
  • Finance : Data analysis is used to monitor and evaluate financial performance, identify risks, and make investment decisions. This includes risk management, portfolio optimization, and fraud detection.
  • Government : Data analysis is used to inform policy-making, improve public services, and enhance public safety. This includes crime analysis, disaster response planning, and social welfare program evaluation.
  • Sports : Data analysis is used to gain insights into athlete performance, improve team strategy, and enhance fan engagement. This includes player evaluation, scouting analysis, and game strategy optimization.
  • Marketing : Data analysis is used to measure the effectiveness of marketing campaigns, understand customer behavior, and develop targeted marketing strategies. This includes customer segmentation, marketing attribution analysis, and social media analytics.
  • Environmental science : Data analysis is used to monitor and evaluate environmental conditions, assess the impact of human activities on the environment, and develop environmental policies. This includes climate modeling, ecological forecasting, and pollution monitoring.

When to Use Data Analysis

Data analysis is useful when you need to extract meaningful insights and information from large and complex datasets. It is a crucial step in the decision-making process, as it helps you understand the underlying patterns and relationships within the data, and identify potential areas for improvement or opportunities for growth.

Here are some specific scenarios where data analysis can be particularly helpful:

  • Problem-solving : When you encounter a problem or challenge, data analysis can help you identify the root cause and develop effective solutions.
  • Optimization : Data analysis can help you optimize processes, products, or services to increase efficiency, reduce costs, and improve overall performance.
  • Prediction: Data analysis can help you make predictions about future trends or outcomes, which can inform strategic planning and decision-making.
  • Performance evaluation : Data analysis can help you evaluate the performance of a process, product, or service to identify areas for improvement and potential opportunities for growth.
  • Risk assessment : Data analysis can help you assess and mitigate risks, whether it is financial, operational, or related to safety.
  • Market research : Data analysis can help you understand customer behavior and preferences, identify market trends, and develop effective marketing strategies.
  • Quality control: Data analysis can help you ensure product quality and customer satisfaction by identifying and addressing quality issues.

Purpose of Data Analysis

The primary purposes of data analysis can be summarized as follows:

  • To gain insights: Data analysis allows you to identify patterns and trends in data, which can provide valuable insights into the underlying factors that influence a particular phenomenon or process.
  • To inform decision-making: Data analysis can help you make informed decisions based on the information that is available. By analyzing data, you can identify potential risks, opportunities, and solutions to problems.
  • To improve performance: Data analysis can help you optimize processes, products, or services by identifying areas for improvement and potential opportunities for growth.
  • To measure progress: Data analysis can help you measure progress towards a specific goal or objective, allowing you to track performance over time and adjust your strategies accordingly.
  • To identify new opportunities: Data analysis can help you identify new opportunities for growth and innovation by identifying patterns and trends that may not have been visible before.

Examples of Data Analysis

Some Examples of Data Analysis are as follows:

  • Social Media Monitoring: Companies use data analysis to monitor social media activity in real-time to understand their brand reputation, identify potential customer issues, and track competitors. By analyzing social media data, businesses can make informed decisions on product development, marketing strategies, and customer service.
  • Financial Trading: Financial traders use data analysis to make real-time decisions about buying and selling stocks, bonds, and other financial instruments. By analyzing real-time market data, traders can identify trends and patterns that help them make informed investment decisions.
  • Traffic Monitoring : Cities use data analysis to monitor traffic patterns and make real-time decisions about traffic management. By analyzing data from traffic cameras, sensors, and other sources, cities can identify congestion hotspots and make changes to improve traffic flow.
  • Healthcare Monitoring: Healthcare providers use data analysis to monitor patient health in real-time. By analyzing data from wearable devices, electronic health records, and other sources, healthcare providers can identify potential health issues and provide timely interventions.
  • Online Advertising: Online advertisers use data analysis to make real-time decisions about advertising campaigns. By analyzing data on user behavior and ad performance, advertisers can make adjustments to their campaigns to improve their effectiveness.
  • Sports Analysis : Sports teams use data analysis to make real-time decisions about strategy and player performance. By analyzing data on player movement, ball position, and other variables, coaches can make informed decisions about substitutions, game strategy, and training regimens.
  • Energy Management : Energy companies use data analysis to monitor energy consumption in real-time. By analyzing data on energy usage patterns, companies can identify opportunities to reduce energy consumption and improve efficiency.

Characteristics of Data Analysis

Characteristics of Data Analysis are as follows:

  • Objective : Data analysis should be objective and based on empirical evidence, rather than subjective assumptions or opinions.
  • Systematic : Data analysis should follow a systematic approach, using established methods and procedures for collecting, cleaning, and analyzing data.
  • Accurate : Data analysis should produce accurate results, free from errors and bias. Data should be validated and verified to ensure its quality.
  • Relevant : Data analysis should be relevant to the research question or problem being addressed. It should focus on the data that is most useful for answering the research question or solving the problem.
  • Comprehensive : Data analysis should be comprehensive and consider all relevant factors that may affect the research question or problem.
  • Timely : Data analysis should be conducted in a timely manner, so that the results are available when they are needed.
  • Reproducible : Data analysis should be reproducible, meaning that other researchers should be able to replicate the analysis using the same data and methods.
  • Communicable : Data analysis should be communicated clearly and effectively to stakeholders and other interested parties. The results should be presented in a way that is understandable and useful for decision-making.

Advantages of Data Analysis

Advantages of Data Analysis are as follows:

  • Better decision-making: Data analysis helps in making informed decisions based on facts and evidence, rather than intuition or guesswork.
  • Improved efficiency: Data analysis can identify inefficiencies and bottlenecks in business processes, allowing organizations to optimize their operations and reduce costs.
  • Increased accuracy: Data analysis helps to reduce errors and bias, providing more accurate and reliable information.
  • Better customer service: Data analysis can help organizations understand their customers better, allowing them to provide better customer service and improve customer satisfaction.
  • Competitive advantage: Data analysis can provide organizations with insights into their competitors, allowing them to identify areas where they can gain a competitive advantage.
  • Identification of trends and patterns : Data analysis can identify trends and patterns in data that may not be immediately apparent, helping organizations to make predictions and plan for the future.
  • Improved risk management : Data analysis can help organizations identify potential risks and take proactive steps to mitigate them.
  • Innovation: Data analysis can inspire innovation and new ideas by revealing new opportunities or previously unknown correlations in data.

Limitations of Data Analysis

  • Data quality: The quality of data can impact the accuracy and reliability of analysis results. If data is incomplete, inconsistent, or outdated, the analysis may not provide meaningful insights.
  • Limited scope: Data analysis is limited by the scope of the data available. If data is incomplete or does not capture all relevant factors, the analysis may not provide a complete picture.
  • Human error : Data analysis is often conducted by humans, and errors can occur in data collection, cleaning, and analysis.
  • Cost : Data analysis can be expensive, requiring specialized tools, software, and expertise.
  • Time-consuming : Data analysis can be time-consuming, especially when working with large datasets or conducting complex analyses.
  • Overreliance on data: Data analysis should be complemented with human intuition and expertise. Overreliance on data can lead to a lack of creativity and innovation.
  • Privacy concerns: Data analysis can raise privacy concerns if personal or sensitive information is used without proper consent or security measures.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Data collection

Data Collection – Methods Types and Examples

Delimitations

Delimitations in Research – Types, Examples and...

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

Research Process

Research Process – Steps, Examples and Tips

Research Design

Research Design – Types, Methods and Examples

IMAGES

  1. Research Data

    research data meaning

  2. 5 Steps of the Data Analysis Process

    research data meaning

  3. Standard statistical tools in research and data analysis

    research data meaning

  4. Top 4 Data Analysis Techniques

    research data meaning

  5. What is Research

    research data meaning

  6. 6-1: Types of Research Data (Source: Malhotra et al, 2002)

    research data meaning

VIDEO

  1. Research Data

  2. Intro: Part I

  3. Data Collection Methods / Research Methodology (part 7) #researchmethodology #datacollection

  4. Manage your research data

  5. What is Data?

  6. researchdataanalysis.com/8th Edition Of Research Data Analysis Excellence Awards

COMMENTS

  1. Defining Research Data

    Defining Research Data. One definition of research data is: "the recorded factual material commonly accepted in the scientific community as necessary to validate research findings." ( OMB Circular 110 ). Research data covers a broad range of types of information (see examples below), and digital data can be structured and stored in a variety of ...

  2. Research Data

    Analysis Methods. Some common research data analysis methods include: Descriptive statistics: Descriptive statistics involve summarizing and describing the main features of a dataset, such as the mean, median, and standard deviation. Descriptive statistics are often used to provide an initial overview of the data.

  3. Data Module #1: What is Research Data?

    Research data comes in many different formats and is gathered using a wide variety of methodologies. In this module, we will provide you with a basic definition and understanding of what research data are. ... We define research data as: any information collected, stored, and processed to produce and validate original research results. Data ...

  4. What is research data?

    Research data is any information that has been collected, observed, generated or created to validate original research findings. Although usually digital, research data also includes non-digital formats such as laboratory notebooks and diaries. Types of research data. Research data can take many forms. It might be: documents, spreadsheets

  5. Data analysis

    data analysis, the process of systematically collecting, cleaning, transforming, describing, modeling, and interpreting data, generally employing statistical techniques. Data analysis is an important part of both scientific research and business, where demand has grown in recent years for data-driven decision making.

  6. What is data?

    Research data is any information that has been collected, observed, generated or created to validate original research findings. Research data may be arranged or formatted in such a way as to make it suitable for communication, interpretation and processing. Data comes in many formats, both digital and physical. More information:

  7. What Is Research Data?

    "Research data are the evidence that underpins the answer to the research question, and can be used to validate findings regardless of its form (e.g. print, digital, or physical)." Concordat on Open Research Data, published on 28 July 2016. You might notice the form of the data matters less than its destination.

  8. Research data defined

    Research data are the raw materials collected, processed and studied in the undertaking of research. ... These data are often unique to time and place and by definition cannot be reproduced. Experimental . Data collected as the outputs of field or laboratory experiments and complex analytical processes, e.g. clinical trial data, chemical ...

  9. What are research data?

    Deakin broadly defines research data as: All data collected, observed, or created by researchers in the course of their work, for the purposes of analysis to produce original research results. The following sites provide additional definitions for research data: The Australian National Data Service (ANDS)

  10. Understanding and Implementing Research Data Management

    Getting back to our initial definition of research data management, the final term to be explained in greater detail is the meaning of (re-)usability.It, first of all, refers to the usability of research data within the research project of origin, to ensure that the data can be used throughout the project to realize the research aim of origin. As argued in the introduction, research data ...

  11. Introduction to Data Analysis

    Data analysis falls within the larger research data lifecycle, as seen below. (University of Virginia) ... According to Creswell & Creswell, "qualitative research is an approach for exploring and understanding the meaning individuals or groups ascribe to a social or human problem" (2018, p. 4). Thus, qualitative analysis usually invokes ...

  12. Research data

    Definition. Research data comprises research observations or findings, such as facts, images, measurements, records and files in various formats, and can be stored in databases. Data publication ...

  13. Data Collection

    Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem. While methods and aims may differ between fields, the overall process of ...

  14. 1

    Aims. The aims of this chapter are to: • introduce the topic of research data management (RDM) and what it means in practice. • explain the thinking behind the book, so you can use it effectively. A thought experiment. Imagine going to a busy researcher's office: • What would you expect to see? And if you asked them about their research ...

  15. Data Module #1: What is Research Data?

    Data may be grouped into four main types based on methods for collection: observational, experimental, simulation, and derived. The type of research data you collect may affect the way you manage that data. For example, data that is hard or impossible to replace (e.g. the recording of an event at a specific time and place) requires extra backup ...

  16. Defining Research Data

    Definition by Nanyang Technological University. "Research data are data in whatever formats or form collected, observed, generated, created and obtained during the entire course of a research project. This would include numerical, descriptive, aural, visual or physical forms recorded by the researcher, generated by equipment and derived from ...

  17. What is Research Data Management

    A Data Management Cautionary Tale. Why Manage Data? Research today not only has to be rigorous, innovative, and insightful - it also has to be organized! As improved technology creates more capacity to create and store data, it increases the challenge of making data FAIR: Findable, Accessible, Interoperable, and Reusable (The FAIR Guiding ...

  18. Research Methods

    Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design. When planning your methods, there are two key decisions you will make. First, decide how you will collect data. Your methods depend on what type of data you need to answer your research question:

  19. Researchers and their data: A study based on the use of the word

    Abstract. Data is one of the most used terms in scientific vocabulary. This article focuses on the relationship between data and research by analyzing the contexts of occurrence of the word data in a corpus of 72,471 research articles (1980-2012) from two distinct fields (Social sciences, Physical sciences). The aim is to shed light on the issues raised by research on data, namely the ...

  20. Data Analysis in Research: Types & Methods

    Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. Three essential things occur during the data ...

  21. Research Software vs. Research Data I: Towards a Research Data

    Background: Research Software is a concept that has been only recently clarified.In this paper we address the need for a similar enlightenment concerning the Research Data concept. Methods: Our contribution begins by reviewing the Research Software definition, which includes the analysis of software as a legal concept, followed by the study of its production in the research environment and ...

  22. What Is Qualitative Research?

    Qualitative research is used to understand how people experience the world. While there are many approaches to qualitative research, they tend to be flexible and focus on retaining rich meaning when interpreting data. Common approaches include grounded theory, ethnography, action research, phenomenological research, and narrative research.

  23. Data Analysis

    Definition: Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various statistical and computational techniques to interpret and derive insights from large datasets.