Scientific Research and Methodology : An introduction to quantitative research and statistics

10 collecting data.

So far, you have learnt to ask a RQ and design the study. In this chapter , you will learn how to:

  • record the important steps in data collection.
  • describe study protocols.
  • ask survey questions.

research and data collection protocols

10.1 Protocols

If the RQ is well-constructed, terms are clearly defined, and the study is well designed and explained, then the process for collecting the data should be easy to describe. Data collection is often time-consuming, tedious and expensive, so collecting the data correctly first time is important.

Before collecting the data, a plan should be established and documented that explains exactly how the data will be obtained, which will include operational definitions (Sect. 2.11 ). This plan is called a protocol .

Definition 10.1 (Protocol) A protocol is a procedure documenting the details of the design and implementation of studies, and for data collection.

Unforeseen complications are not unusual, so often a pilot study (or a practice run ) is conducted before the real data collection, to identify problems with the study design or data collection, or ways to improve the study design or data collection. The pilot study may suggest changes to the protocol. (Pilot studies may also be useful for determining the size of the sample; see Sect.  30.4 .)

Definition 10.2 (Pilot study) A pilot study is a small test run of the study protocol used to check that the protocol is appropriate and practical, and to identify (and hence fix) possible problems with the study design or protocol.

A pilot study allows the researcher to:

  • determine the feasibility of the data collection protocol.
  • identify unforeseen challenges.
  • obtain data to determine appropriate sample sizes (Sect. 30 ).
  • potentially save time and money.

research and data collection protocols

The data can be collected once the protocol has been finalised. Protocols ensure studies are repeatable (Sect. 4.4 ) so others can confirm or compare results, and others can understand exactly what was done, and how. Protocols should indicate how design aspects (such as blinding the individuals, random allocation of treatments, etc.) will happen. The final protocol , without pedantic detail, should be reported. Diagrams can be useful to support explanations. All studies should have a well-established protocol for describing how the study was done.

A protocol usually has at least three components that describe:

  • How individuals are chosen from the population (i.e., external validity); and
  • How information is collected from the individuals (i.e., internal validity); and
  • The analyses, and what software (and version) was used.

Example 10.1 (Protocol) To increase the nutritional value of cookies, researchers made cookies using pureed green peas in place of margarine ( Romanchik-Cerpovicz, Jeffords, and Onyenwoke 2018 ) . The researchers wanted to assess the acceptance of these cookies to college students.

The protocol discussed how the individuals were chosen (p. 4):

...through advertisement across campus from students attending a university in the southeastern United States.

This voluntary sample comprised \(80.6\) % women, a higher percentage of women than in the general population, or the college population. (Other extraneous variables were also recorded.)

Exclusion criteria were also applied, excluding people "with an allergy or sensitivity to an ingredient used in the preparation of the cookies" (p. 5). The researchers also described how the data was obtained from the individuals (p. 5):

During the testing session, panelists were seated at individual tables. Each cookie was presented one at a time on a disposable white plate. Samples were previously coded and randomized. The presentation order for all samples was \(25\) %, \(0\) %, \(50\) %, \(100\) % and \(75\) % substitution of fat with puree of canned green peas. To maintain standard procedures for sensory analysis [...], panelists cleansed their palates between cookie samples with distilled water ( \(25^\circ\) C) [...] characteristics of color, smell, moistness, flavor, aftertaste, and overall acceptability, for each sample of cookies [was recorded]...

Thus, internal validity was managed using random allocation, blinding individuals, and washouts. Details are also given of how the cookies were prepared, and how objective measurements (such as moisture content) were determined.

The analyses and software used were also given.

Consider this partial protocol, which shows honesty in describing a protocol:

Fresh cow dung was obtained from free-ranging, grass fed, and antibiotic-free Milking Shorthorn cows ( Bos taurus ) in the Tilden Regional Park in Berkeley, CA. Resting cows were approached with caution and startled by loud shouting, whereupon the cows rapidly stood up, defecated, and moved away from the source of the annoyance. Dung was collected in ZipLoc bags ( \(1\) gallon), snap-frozen and stored at \(-80\)  C. --- Hare et al. ( 2008 ) , p. 10

10.2 Collecting data using questionnaires

10.2.1 writing questions.

Collecting data using questionnaires is common for both observational and experimental studies. Questionnaires are very difficult to do well: question wording is crucial, and surprisingly difficult to get right ( Fink 1995 ) . Pilot testing questionnaires is crucial!

Definition 10.3 (Questionnaire) A questionnaire is a set of questions for respondents to answer.

A questionnaire is a set of question to obtain information from individuals. A survey is an entire methodology, that includes gathering data using a questionnaire, finding a sample, and other components.

Questions in a questionnaire may be open-ended (respondents can write their own answers) or closed (respondents select from a small number of possible answers, as in multiple-choice questions). Open and closed questions both have advantages and disadvantages. Answers to open questions more easily lend themselves to qualitative analysis.

This section briefly discusses writing questions (Sect. 10.2 ).

Example 10.2 (Open and closed questions) German students were asked a series of questions about microplastics ( Raab and Bogner 2021 ) , including:

  • Name sources of microplastics in the household.
  • In which ecosystems are microplastics in Germany? Tick the answer (multiple ticks are possible). Options : (a) sea; (b) rivers; (c) lakes; (d) groundwater.
  • Assess the potential danger posed by microplastics. Options : (a) very dangerous; (b) dangerous; (c) hardly dangerous; (d) not dangerous.

The first question is an open : respondents could provide their own answers. The second question is closed , where multiple options can be selected. The third question is closed , where only one option can be selected

When framing questionnaire questions, remember:

  • Avoid leading questions , which may lead respondents to answer a certain way. Question wording is the usual reason for leading questions.
  • Avoid ambiguity : avoid unfamiliar terms and unclear questions.
  • Avoid asking the uninformed : avoid asking respondents about issues they don't know about. Many people will give a response even if they do not understand (such responses are worthless). For example, people may give directions to places that do not even exist ( Collett and O’Shea 1976 ) .
  • Avoid complex and double-barrelled questions , which can be hard to understand.
  • Avoid problems with ethics : avoid questions about people breaking laws, or revealing confidential or private information. In special cases and with justification, ethics committees may allow such questions.
  • Ensure clarity in question wording.
  • Ensure options are mutually exhaustive , so that answers fit into only one category.
  • Ensure options are exhaustive , so that the categories cover all options.

Example 10.3 (Poor question wording) Consider a questionnaire asking these questions:

  • Because bottles from bottled water create enormous amounts of non-biodegradable landfill and hence threaten native wildlife, do you support banning bottled water?
  • Do you drink more water now?
  • Are you more concerned about Coagulase-negative Staphylococcus or Neisseria pharyngis in bottled water?
  • Do you drink water in plastic and glass bottles?
  • Do you have a water tank installed illegally, without permission?
  • Do you avoid purchasing water in plastic bottles unless it is carbonated, unless the bottles are plastic but not necessarily if the lid is recyclable?

Question 1 is leading because the expected response is obvious.

Question 2 is ambiguous : it is unclear what 'more water now' is being compared to.

Question 3 is unlikely to be answerable, as most people will be uninformed . Nonetheless, many people will still give an opinion. This data will be effectively useless, but the researcher may not realise this.

Question 4 is double-barrelled , and would be better asked as two separate questions (one asking about plastic bottles, and one about glass bottles).

Question 5 is unlikely to be given ethical approval or to obtain truthful answers, as respondents are unlikely to admit to breaking rules.

Question 6 is unclear , since knowing what a yes or no answer means is confusing.

Example 10.4 (Question wording) Question wording can be important ( Jardina 2018 ) .

In the 2014 General Social Survey ( https://gss.norc.org ), when white Americans were asked for their opinion of the amount America spends on welfare , \(58\) % of respondents answered 'Too much'.

However, when white Americans were asked for their opinion of the amount America spends on assistance to the poor , only \(16\) % of respondents answered 'Too much'.

Example 10.5 (Leading question) Consider this question:

Do you like this new orthotic?

This question is leading , since liking is the only option presented. Better would be:

Do you like or dislike this new orthotic?

Example 10.6 (Mutually exclusive options) In a study to determine the time doctors spent on patients (from Chan et al. ( 2008 ) ), doctors were given the options:

  • \(0\) -- \(5\)  mins;
  • \(5\) -- \(10\)  mins; or
  • more than \(10\)  mins.

This is a poor question, because a respondent does not know which option to select for an answer of ' \(5\)  minutes'. The options are not mutually exclusive .

The following (humourous) video shows how questions can be manipulated by those not wanting to be ethical:

10.2.2 Challenges using questionnaires

Using questionnaires presents myriad challenges.

  • Non-response bias (Sect. 5.10 ): Non-response bias is common with questionnaires, as they are often used with voluntary-response samples. The people who do not respond to the survey may be different than those who do respond.
  • Response bias (Sect. 5.10 ): People do not always answer truthfully; for example, what people say may not correspond with what people do (Sect. 9.4 ). Sometimes this is unintentional (e.g., poor questions wording), due to embarrassment or because questions are controversial. Sometimes, respondents repeatedly provide the same answer when a series of multichoice questions are presented (perhaps due to boredom).
  • Recall bias : People may not be able to accurately recall past events clearly, or recall when they happened.
  • Question order : The order of the questions can influence the responses.
  • Interpretation : Phrases and words such as "Sometimes" and "Somewhat disagree" may means different things to different people.

Many of these can be managed with careful questionnaire design, but discussing the methods are beyond the scope of this book.

10.3 Chapter summary

Having a detailed procedure for collecting the data (the protocol ) is important. Using a pilot study to trial the protocol an often reveal unexpected changes necessary for a good protocol. Creating good questionnaires questions is difficult, but important.

10.4 Quick review questions

What is the biggest problem with this question: 'Do you have bromodosis?'

What is the biggest problem with this question: 'Do you spend too much time connected to the internet?'

What is the biggest problem with this question: 'Do you eat fruits and vegetables?'

Which of these are reasons for producing a well-defined protocol?

  • It allows the researchers to make the study externally valid. TRUE FALSE
  • It ensures that others know exactly what was done. TRUE FALSE
  • It ensures that the study is repeatable for others. TRUE FALSE

Which of the following questionnaire questions likely to be leading questions?

  • Do you, or do you not, believe that permeable pavements are a viable alternative to traditional pavements? TRUE FALSE
  • Do you support a ban on bottled water? TRUE FALSE
  • Do you believe that double-gloving by paramedics reduces the risk of infection, increases the risk of infection, or makes no difference to the risk of infection? TRUE FALSE
  • Should Ireland ban breakfast cereals with unhealthy sugar levels? TRUE FALSE

10.5 Exercises

Selected answers are available in App.  E .

Exercise 10.1 What is the problem with this question?

What is your age? (Select one option) Under \(18\) Over \(18\)

Exercise 10.2 What is the problem with this question?

How many children do you have? (Select one option) None 1 or 2 2 or 3 More than 4

Exercise 10.3 Which of these questionnaire questions is better? Why?

  • Should concerned cat owners vaccinate their pets?
  • Should domestic cats be required to be vaccinated or not?
  • Do you agree that pet-owners should have their cats vaccinated?

Exercise 10.4 Which of these questionnaire questions is better? Why?

  • Do you own an environmentally-friendly electric vehicle?
  • Do you own an electric vehicle?
  • Do you own or do you not own an electric vehicle?

Exercise 10.5 In a study of sunscreen use ( Falk and Anderson 2013 ) , participants were asked questions, including these:

  • How often do you sun bathe with the intention to tan during the summer in Sweden? (Possible answers: never, seldom, sometimes, often, always).
  • How long do you usually stay in the sun between \(11\) am and \(3\) pm, during a typical day-off in the summer (June--August)? (Possible answers: \(<30\)  min, \(30\) min-- \(1\) h, \(1\) -- \(2\) h, \(2\) -- \(3\) h, \(>3\) h).

Critique these questions. What biases may be present?

Exercise 10.6 In a study of children's knowledge of their natural environment ( Morón-Monge, Hamed, and Morón Monge 2021 ) , primary school children (from Andalusia, Spain) were asked three questions:

  • No, I don’t like parks.
  • No, I don’t usually visit it.
  • Yes, once per week.
  • Yes, more than once a week
  • Two to three times
  • More than three times
  • Write a story
  • Draw a picture

Which questions are open and which are closed ? Critique the questions.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Data Collection | Definition, Methods & Examples

Data Collection | Definition, Methods & Examples

Published on June 5, 2020 by Pritha Bhandari . Revised on June 21, 2023.

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem .

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The  aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Table of contents

Step 1: define the aim of your research, step 2: choose your data collection method, step 3: plan your data collection procedures, step 4: collect the data, other interesting articles, frequently asked questions about data collection.

Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement : what is the practical or scientific issue that you want to address and why does it matter?

Next, formulate one or more research questions that precisely define what you want to find out. Depending on your research questions, you might need to collect quantitative or qualitative data :

  • Quantitative data is expressed in numbers and graphs and is analyzed through statistical methods .
  • Qualitative data is expressed in words and analyzed through interpretations and categorizations.

If your aim is to test a hypothesis , measure something precisely, or gain large-scale statistical insights, collect quantitative data. If your aim is to explore ideas, understand experiences, or gain detailed insights into a specific context, collect qualitative data. If you have several aims, you can use a mixed methods approach that collects both types of data.

  • Your first aim is to assess whether there are significant differences in perceptions of managers across different departments and office locations.
  • Your second aim is to gather meaningful feedback from employees to explore new ideas for how managers can improve.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

research and data collection protocols

Based on the data you want to collect, decide which method is best suited for your research.

  • Experimental research is primarily a quantitative method.
  • Interviews , focus groups , and ethnographies are qualitative methods.
  • Surveys , observations, archival research and secondary data collection can be quantitative or qualitative methods.

Carefully consider what method you will use to gather data that helps you directly answer your research questions.

When you know which method(s) you are using, you need to plan exactly how you will implement them. What procedures will you follow to make accurate observations or measurements of the variables you are interested in?

For instance, if you’re conducting surveys or interviews, decide what form the questions will take; if you’re conducting an experiment, make decisions about your experimental design (e.g., determine inclusion and exclusion criteria ).

Operationalization

Sometimes your variables can be measured directly: for example, you can collect data on the average age of employees simply by asking for dates of birth. However, often you’ll be interested in collecting data on more abstract concepts or variables that can’t be directly observed.

Operationalization means turning abstract conceptual ideas into measurable observations. When planning how you will collect data, you need to translate the conceptual definition of what you want to study into the operational definition of what you will actually measure.

  • You ask managers to rate their own leadership skills on 5-point scales assessing the ability to delegate, decisiveness and dependability.
  • You ask their direct employees to provide anonymous feedback on the managers regarding the same topics.

You may need to develop a sampling plan to obtain data systematically. This involves defining a population , the group you want to draw conclusions about, and a sample, the group you will actually collect data from.

Your sampling method will determine how you recruit participants or obtain measurements for your study. To decide on a sampling method you will need to consider factors like the required sample size, accessibility of the sample, and timeframe of the data collection.

Standardizing procedures

If multiple researchers are involved, write a detailed manual to standardize data collection procedures in your study.

This means laying out specific step-by-step instructions so that everyone in your research team collects data in a consistent way – for example, by conducting experiments under the same conditions and using objective criteria to record and categorize observations. This helps you avoid common research biases like omitted variable bias or information bias .

This helps ensure the reliability of your data, and you can also use it to replicate the study in the future.

Creating a data management plan

Before beginning data collection, you should also decide how you will organize and store your data.

  • If you are collecting data from people, you will likely need to anonymize and safeguard the data to prevent leaks of sensitive information (e.g. names or identity numbers).
  • If you are collecting data via interviews or pencil-and-paper formats, you will need to perform transcriptions or data entry in systematic ways to minimize distortion.
  • You can prevent loss of data by having an organization system that is routinely backed up.

Finally, you can implement your chosen methods to measure or observe the variables you are interested in.

The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1–5. The data produced is numerical and can be statistically analyzed for averages and patterns.

To ensure that high quality data is recorded in a systematic way, here are some best practices:

  • Record all relevant information as and when you obtain data. For example, note down whether or how lab equipment is recalibrated during an experimental study.
  • Double-check manual data entry for errors.
  • If you collect quantitative data, you can assess the reliability and validity to get an indication of your data quality.

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval
  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Likert scale

Research bias

  • Implicit bias
  • Framing effect
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organizations.

When conducting research, collecting original data has significant advantages:

  • You can tailor data collection to your specific research aims (e.g. understanding the needs of your consumers or user testing your website)
  • You can control and standardize the process for high reliability and validity (e.g. choosing appropriate measurements and sampling methods )

However, there are also some drawbacks: data collection can be time-consuming, labor-intensive and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research, you also have to consider the internal and external validity of your experiment.

Operationalization means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioral avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalize the variables that you want to measure.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 21). Data Collection | Definition, Methods & Examples. Scribbr. Retrieved March 12, 2024, from https://www.scribbr.com/methodology/data-collection/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs. quantitative research | differences, examples & methods, sampling methods | types, techniques & examples, what is your plagiarism score.

Enago Academy

Write an Error-free Research Protocol As Recommended by WHO: 21 Elements You Shouldn’t Miss!

' src=

Principal Investigator: Did you draft the research protocol?

Student: Not yet. I have too many questions about it. Why is it important to write a research protocol? Is it similar to research proposal? What should I include in it? How should I structure it? Is there a specific format?

Researchers at an early stage fall short in understanding the purpose and importance of some supplementary documents, let alone how to write them. Let’s better your understanding of writing an acceptance-worthy research protocol.

Table of Contents

What Is Research Protocol?

The research protocol is a document that describes the background, rationale, objective(s), design, methodology, statistical considerations and organization of a clinical trial. It is a document that outlines the clinical research study plan. Furthermore, the research protocol should be designed to provide a satisfactory answer to the research question. The protocol in effect is the cookbook for conducting your study

Why Is Research Protocol Important?

In clinical research, the research protocol is of paramount importance. It forms the basis of a clinical investigation. It ensures the safety of the clinical trial subjects and integrity of the data collected. Serving as a binding document, the research protocol states what you are—and you are not—allowed to study as part of the trial. Furthermore, it is also considered to be the most important document in your application with your Institution’s Review Board (IRB).

It is written with the contributions and inputs from a medical expert, a statistician, pharmacokinetics expert, the clinical research coordinator, and the project manager to ensure all aspects of the study are covered in the final document.

Is Research Protocol Same As Research Proposal?

Often misinterpreted, research protocol is not similar to research proposal. Here are some significant points of difference between a research protocol and a research proposal:

What Are the Elements/Sections of a Research Protocol?

According to Good Clinical Practice guidelines laid by WHO, a research protocol should include the following:

Research Protocol

1. General Information

  • Protocol title, protocol identifying number (if any), and date.
  • Name and address of the funder.
  • Name(s) and contact details of the investigator(s) responsible for conducting the research, the research site(s).
  • Responsibilities of each investigator.
  • Name(s) and address(es) of the clinical laboratory(ies), other medical and/or technical department(s) and/or institutions involved in the research.

2. Rationale & Background Information

  • The rationale and background information provides specific reasons for conducting the research in light of pertinent knowledge about the research topic.
  • It is a statement that includes the problem that is the basis of the project, the cause of the research problem, and its possible solutions.
  • It should be supported with a brief description of the most relevant literatures published on the research topic.

3. Study Objectives

  • The study objectives mentioned in the research proposal states what the investigators hope to accomplish. The research is planned based on this section.
  • The research proposal objectives should be simple, clear, specific, and stated prior to conducting the research.
  • It could be divided into primary and secondary objectives based on their relativity to the research problem and its solution.

4. Study Design

  • The study design justifies the scientific integrity and credibility of the research study.
  • The study design should include information on the type of study, the research population or the sampling frame, participation criteria (inclusion, exclusion, and withdrawal), and the expected duration of the study.

5. Methodology

  • The methodology section is the most critical section of the research protocol.
  • It should include detailed information on the interventions to be made, procedures to be used, measurements to be taken, observations to be made, laboratory investigations to be done, etc.
  • The methodology should be standardized and clearly defined if multiple sites are engaged in a specified protocol.

6. Safety Considerations

  • The safety of participants is a top-tier priority while conducting clinical research .
  • Safety aspects of the research should be scrutinized and provided in the research protocol.

7. Follow-up

  • The research protocol clearly indicate of what follow up will be provided to the participating subjects.
  • It must also include the duration of the follow-up.

8. Data Management and Statistical Analysis

  • The research protocol should include information on how the data will be managed, including data handling and coding for computer analysis, monitoring and verification.
  • It should clearly outline the statistical methods proposed to be used for the analysis of data.
  • For qualitative approaches, specify in detail how the data will be analysed.

9. Quality Assurance

  • The research protocol should clearly describe the quality control and quality assurance system.
  • These include GCP, follow up by clinical monitors, DSMB, data management, etc.

10. Expected Outcomes of the Study

  • This section indicates how the study will contribute to the advancement of current knowledge, how the results will be utilized beyond publications.
  • It must mention how the study will affect health care, health systems, or health policies.

11. Dissemination of Results and Publication Policy

  • The research protocol should specify not only how the results will be disseminated in the scientific media, but also to the community and/or the participants, the policy makers, etc.
  • The publication policy should be clearly discussed as to who will be mentioned as contributors, who will be acknowledged, etc.

12. Duration of the Project

  • The protocol should clearly mention the time likely to be taken for completion of each phase of the project.
  • Furthermore a detailed timeline for each activity to be undertaken should also be provided.

13. Anticipated Problems

  • The investigators may face some difficulties while conducting the clinical research. This section must include all anticipated problems in successfully completing their projects.
  • Furthermore, it should also provide possible solutions to deal with these difficulties.

14. Project Management

  • This section includes detailed specifications of the role and responsibility of each investigator of the team.
  • Everyone involved in the research project must be mentioned here along with the specific duties they have performed in completing the research.
  • The research protocol should also describe the ethical considerations relating to the study.
  • It should not only be limited to providing ethics approval, but also the issues that are likely to raise ethical concerns.
  • Additionally, the ethics section must also describe how the investigator(s) plan to obtain informed consent from the research participants.
  • This section should include a detailed commodity-wise and service-wise breakdown of the requested funds.
  • It should also include justification of utilization of each listed item.

17. Supplementary Support for the Project

  • This section should include information about the received funding and other anticipated funding for the specific project.

18. Collaboration With Other Researchers or Institutions

  • Every researcher or institute that has been a part of the research project must be mentioned in detail in this section of the research protocol.

19. Curriculum Vitae of All Investigators

  • The CVs of the principal investigator along with all the co-investigators should be attached with the research protocol.
  • Ideally, each CV should be limited to one page only, unless a full-length CV is requested.

20. Other Research Activities of Investigators

  • A list of all current research projects being conducted by all investigators must be listed here.

21. References

  • All relevant references should be mentioned and cited accurately in this section to avoid plagiarism.

How Do You Write a Research Protocol? (Research Protocol Example)

Main Investigator    

Number of Involved Centers (for multi-centric studies)

Indicate the reference center

Title of the Study

Protocol ID (acronym)

Keywords (up to 7 specific keywords)

Study Design

Mono-centric/multi-centric

Perspective/retrospective

Controlled/uncontrolled

Open-label/single-blinded or double-blinded

Randomized/non-randomized

n parallel branches/n overlapped branches

Experimental/observational

Endpoints (main primary and secondary endpoints to be listed)

Expected Results                                                

Analyzed Criteria

Main variables/endpoints of the primary analysis

Main variables/endpoints of the secondary analysis

Safety variables

Health Economy (if applicable)

Visits and Examinations

Therapeutic plan and goals

Visits/controls schedule (also with graphics)

Comparison to treatment products (if applicable)

Dose and dosage for the study duration (if applicable)

Formulation and power of the studied drugs (if applicable)

Method of administration of the studied drugs (if applicable)

Informed Consent

Study Population

Short description of the main inclusion, exclusion, and withdrawal criteria

Sample Size

Estimated Duration of the Study

Safety Advisory

Classification Needed

Requested Funds

Additional Features (based on study objectives)

Click Here to Download the Research Protocol Example/Template

Be prepared to conduct your clinical research by writing a detailed research protocol. It is as easy as mentioned in this article. Follow the aforementioned path and write an impactful research protocol. All the best!

' src=

Clear as template! Please, I need your help to shape me an authentic PROTOCOL RESEARCH on this theme: Using the competency-based approach to foster EFL post beginner learners’ writing ability: the case of Benin context. I’m about to start studies for a master degree. Please help! Thanks for your collaboration. God bless.

Rate this article Cancel Reply

Your email address will not be published.

research and data collection protocols

Enago Academy's Most Popular Articles

Launch of "Sony Women in Technology Award with Nature"

  • Industry News

Breaking Barriers: Sony and Nature unveil “Women in Technology Award”

Sony Group Corporation and the prestigious scientific journal Nature have collaborated to launch the inaugural…

Guide to Adhere Good Research Practice (FREE CHECKLIST)

Achieving Research Excellence: Checklist for good research practices

Academia is built on the foundation of trustworthy and high-quality research, supported by the pillars…

ResearchSummary

  • Promoting Research

Plain Language Summary — Communicating your research to bridge the academic-lay gap

Science can be complex, but does that mean it should not be accessible to the…

Journals Combat Image Manipulation with AI

Science under Surveillance: Journals adopt advanced AI to uncover image manipulation

Journals are increasingly turning to cutting-edge AI tools to uncover deceitful images published in manuscripts.…

Content Analysis vs Thematic Analysis: What's the difference?

  • Reporting Research

Choosing the Right Analytical Approach: Thematic analysis vs. content analysis for data interpretation

In research, choosing the right approach to understand data is crucial for deriving meaningful insights.…

Choosing the Right Analytical Approach: Thematic analysis vs. content analysis for…

Research Recommendations – Guiding policy-makers for evidence-based decision making

Demystifying the Role of Confounding Variables in Research

Language as a Bridge, Not a Barrier: ESL researchers’ path to successful…

research and data collection protocols

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

research and data collection protocols

When should AI tools be used in university labs?

Case Western Reserve University

  • Research Data Lifecycle Guide

Data Collection

Data collection is the process of gathering and measuring information used for research. Collecting data is one of the most important steps in the research process, and is part of all disciplines including physical and social sciences, humanities, business, etc. Data comes in many forms with different ways to store and record data, either written in a lab notebook and or recorded digitally on a computer system. 

While methods may differ across disciplines,  good data management processes begin with accurately and clearly describing the information recorded, the process used to collect the data, practices that ensure the quality of the data, and sharing data to enable reproducibility. This section breaks down different topics that need to be addressed while collecting and managing data for research.

Learn more about what’s required for data collection as a researcher at Case Western Reserve University. 

Ensuring Accurate and Appropriate Data Collection

Accurate data collection is vital to ensure the integrity of research . It is important when planning and executing a research project to consider methods collection and the storage of data to ensure that results can be used for publications and reporting.   The consequences from improper data collection include:

  • inability to answer research questions accurately
  • inability to repeat and validate the study
  • distorted findings resulting in wasted resources
  • misleading other researchers to pursue fruitless avenues of investigation
  • compromising decisions for public policy
  • causing harm to human participants and animal subjects

While the degree of impact from inaccurate data may vary by discipline, there is a potential to cause disproportionate harm when data is misrepresented and misused. This includes fraud or scientific misconduct.

Any data collected in the course of your research should follow RDM best practices to ensure accurate and appropriate data collection. This includes as appropriate, developing data collection protocols and processes to ensure inconsistencies and other errors are caught and corrected in a timely manner.

Examples of Research Data

Research data is any information that has been collected, observed, generated or created in association with research processes and findings.

Much research data is digital in format, but research data can also be extended to include non-digital formats such as laboratory notebook, diaries, or written responses to surveys. Examples may include (but are not limited to):

  • Excel spreadsheets that contains instrument data
  • Documents (text, Word), containing study results
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Audiotapes, videotapes
  • Photographs, films
  • Protein or genetic sequences
  • Test responses
  • Slides, artifacts, specimens, samples
  • Collection of digital objects acquired and generated during the process of research
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts
  • Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • Source code used in application development

To ensure reproducibility of experiments and results, be sure to include and document information such as: 

  • Methodologies and workflows
  • Standard operating procedures and protocols

Data Use Agreements 

When working with data it is important to understand any restrictions that need to be addressed due to the sensitivity of the data. This includes how you download and share with other collaborators, and how it needs to be properly secured. 

Datasets can include potentially sensitive data that needs to be protected, and not openly shared. In this case, the dataset cannot be shared and or downloaded without permission from CWRU Research Administration and may require an agreement between collaborators and their institutions. All parties will need to abide by the agreement terms including the destruction of data once the collaboration is complete.

Storage Options 

UTech provides cloud and on-premise storage to support the university research mission. This includes Google Drive , Box , Microsoft 365 , and various on-premise solutions for high speed access and mass storage. A listing of supported options can be found on UTech’s website .

In addition to UTech-supported storage solutions, CWRU also maintains an institutional subscription to OSF (Open Science Framework) . OSF is a cloud-based data storage, sharing, and project collaboration platform that connects to many other cloud services like Drive, Box, and Github to amplify your research and data visibility and discoverability. OSF storage is functionally unlimited.

When selecting a storage platform it is important to understand how you plan to analyze and store your data. Cloud storage provides the ability to store and share data effortlessly and provides capabilities such as revisioning and other means to protect your data. On-premise storage is useful when you have large storage demands and require a high speed connection to instruments that generate data and systems that process data. Both types of storage have their advantages and disadvantages that you should consider when planning your research project.

Data Security

Data security is a set of processes and ongoing practices designed to protect information and the systems used to store and process data. This includes computer systems, files, databases, applications, user accounts, networks, and services on institutional premises, in the cloud, and remotely at the location of individual researchers. 

Effective data security takes into account the confidentiality, integrity, and availability of the information and its use. This is especially important when data contains personally identifiable information, intellectual property, trade secrets, and or technical data supporting technology transfer agreements (before public disclosure decisions have been made).

Data Categorization 

CWRU uses a 3-tier system to categorize research data based on information types and sensitivity . Determination is based upon risk to the University in the areas of confidentiality, integrity, and availability of data in support of the University's research mission. In this context, confidentiality measures to what extent information can be disclosed to others, integrity is the assurance that the information is trustworthy and accurate, and availability is a guarantee of reliable access to the information by authorized users.

Information (or data) owners are responsible for determining the impact levels of their information, i.e. what happens if the data is improperly accessed or lost accidentally, implementing the necessary security controls, and managing the risk of negative events including data loss and unauthorized access.

Loss, corruption, or inappropriate access to information can interfere with CWRU's mission, interrupt business and damage reputations or finances. 

Securing Data

The classification of data requires certain safeguards or countermeasures, known as controls, to be applied to systems that store data. This can include restricting access to the data, detecting unauthorized access, preventative measures to avoid loss of data, encrypting the transfer and storage of data, keeping the system and data in a secure location, and receiving training on best practices for handling data. Controls are classified according to their characteristics, for example:

  • Physical controls e.g. doors, locks, climate control, and fire extinguishers;
  • Procedural or administrative controls e.g. policies, incident response processes, management oversight, security awareness and training;
  • Technical or logical controls e.g. user authentication (login) and logical access controls, antivirus software, firewalls;
  • Legal and regulatory or compliance controls e.g. privacy laws, policies and clauses.

Principal Investigator (PI) Responsibilities

The CWRU Faculty Handbook provides guidelines for PIs regarding the custody of research data. This includes, where applicable, appropriate measures to protect confidential information. It is everyone’s responsibility to ensure that our research data is kept securely and available for reproducibility and future research opportunities.

University Technology provides many services and resources related to data security including assistance with planning and securing data. This includes processing and storing restricted information used in research. 

Data Collected as Part of Human Subject Research 

To ensure the privacy and safety of the individual participating in a human subject research study, additional rules and processes are in place that describe how one can use and disclose data collected,  The Office of Research Administration provides information relevant to conducting this type of research. This includes:

  • Guidance on data use agreements and processes for agreements that involve human-related data or human-derived samples coming in or going out of CWRU.
  • Compliance with human subject research rules and regulations.

According to 45 CFR 46 , a human subject is "a living individual about whom an investigator (whether professional or student) conducting research:

  • Obtains information or biospecimens through intervention or interaction with the individual, and uses, studies, or analyzes the information or biospecimens; or
  • Obtains, uses, studies, analyzes, or generates identifiable private information or identifiable biospecimens."

The CWRU Institutional Review Board reviews social science/behavioral studies, and low-risk biomedical research not conducted in a hospital setting for all faculty, staff, and students of the University. This includes data collected and used for human subjects research. 

Research conducted in a hospital setting including University Hospitals requires IRB protocol approval.

Questions regarding the management of human subject research data should be addressed to the CWRU Institutional Review Board .

Getting Help With Data Collection

If you are looking for datasets and other resources for your research you can contact your subject area librarian for assistance.

  • Kelvin Smith Library

If you need assistance with administrative items such as data use agreements or finding the appropriate storage solution please contact the following offices.

  • Research Administration
  • UTech Research Computing
  • Information Security Office

Guidance and Resources

  • Information Security Policy
  • Research Data Protection
  • CWRU Faculty Handbook
  • CWRU IRB Guidance

Collect Data

Data will be collected and subsequently analyzed during the Manage stage, using the protocol for collecting data developed in the Plan/Propose stage and the processes and technical resources established during the Setup stage. Strict adherence to the data collection protocol design is critical to assuring that the data collected as well as the results of their analysis can be validated.

  • The Principal Investigator (PI) is responsible for the overall conduct (including administration and compliance) and results of the research, including the collection of data.
  • Members of the study team who will be involved in any aspect of data collection are each responsible for observing best research, administrative and compliance practices, appropriate to their role in the project.
  • A statistician or other data analyst on the study team may be involved in monitoring data collection.
  • Clinical studies may engage a Data Safety and Monitoring Board (DSMB) responsible for periodically monitoring the data collection protocol.
  • Information technology professionals may be involved in providing support for hardware and software used for data collection, according to best technical practices.

Forms, Tools, and Resources

Policy, regulation, and guidance.

  • Office of the Chief Information Security Officer: Laws

Announcements

Or support offices.

  • Human Subjects Division (HSD)
  • Office of Animal Welfare (OAW)
  • Office of Research (OR)
  • Office of Research Information Services (ORIS)
  • Office of Sponsored Programs (OSP)

OR Research Units

  • Applied Physics Laboratory (APL-UW)
  • WA National Primate Research Center (WaNPRC)

Research Partner Offices

  • Corporate and Foundation Relations (CFR)
  • Enivronmental Health and Safety (EH&S)
  • Grant and Contract Accounting (GCA)
  • Institute of Translational Health Sciences (ITHS)
  • Management Accounting and Analysis (MAA)
  • Post Award Fiscal Compliance (PAFC)

Collaboration

  • Centers and Institutes
  • Collaborative Proposal Development Resources
  • Research Fact Sheet
  • Research Annual Report
  • Stats and Rankings
  • Honors and Awards
  • Office of Research

© 2024 University of Washington | Seattle, WA

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Data Collection – Methods Types and Examples

Data Collection – Methods Types and Examples

Table of Contents

Data collection

Data Collection

Definition:

Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation.

In order for data collection to be effective, it is important to have a clear understanding of what data is needed and what the purpose of the data collection is. This can involve identifying the population or sample being studied, determining the variables to be measured, and selecting appropriate methods for collecting and recording data.

Types of Data Collection

Types of Data Collection are as follows:

Primary Data Collection

Primary data collection is the process of gathering original and firsthand information directly from the source or target population. This type of data collection involves collecting data that has not been previously gathered, recorded, or published. Primary data can be collected through various methods such as surveys, interviews, observations, experiments, and focus groups. The data collected is usually specific to the research question or objective and can provide valuable insights that cannot be obtained from secondary data sources. Primary data collection is often used in market research, social research, and scientific research.

Secondary Data Collection

Secondary data collection is the process of gathering information from existing sources that have already been collected and analyzed by someone else, rather than conducting new research to collect primary data. Secondary data can be collected from various sources, such as published reports, books, journals, newspapers, websites, government publications, and other documents.

Qualitative Data Collection

Qualitative data collection is used to gather non-numerical data such as opinions, experiences, perceptions, and feelings, through techniques such as interviews, focus groups, observations, and document analysis. It seeks to understand the deeper meaning and context of a phenomenon or situation and is often used in social sciences, psychology, and humanities. Qualitative data collection methods allow for a more in-depth and holistic exploration of research questions and can provide rich and nuanced insights into human behavior and experiences.

Quantitative Data Collection

Quantitative data collection is a used to gather numerical data that can be analyzed using statistical methods. This data is typically collected through surveys, experiments, and other structured data collection methods. Quantitative data collection seeks to quantify and measure variables, such as behaviors, attitudes, and opinions, in a systematic and objective way. This data is often used to test hypotheses, identify patterns, and establish correlations between variables. Quantitative data collection methods allow for precise measurement and generalization of findings to a larger population. It is commonly used in fields such as economics, psychology, and natural sciences.

Data Collection Methods

Data Collection Methods are as follows:

Surveys involve asking questions to a sample of individuals or organizations to collect data. Surveys can be conducted in person, over the phone, or online.

Interviews involve a one-on-one conversation between the interviewer and the respondent. Interviews can be structured or unstructured and can be conducted in person or over the phone.

Focus Groups

Focus groups are group discussions that are moderated by a facilitator. Focus groups are used to collect qualitative data on a specific topic.

Observation

Observation involves watching and recording the behavior of people, objects, or events in their natural setting. Observation can be done overtly or covertly, depending on the research question.

Experiments

Experiments involve manipulating one or more variables and observing the effect on another variable. Experiments are commonly used in scientific research.

Case Studies

Case studies involve in-depth analysis of a single individual, organization, or event. Case studies are used to gain detailed information about a specific phenomenon.

Secondary Data Analysis

Secondary data analysis involves using existing data that was collected for another purpose. Secondary data can come from various sources, such as government agencies, academic institutions, or private companies.

How to Collect Data

The following are some steps to consider when collecting data:

  • Define the objective : Before you start collecting data, you need to define the objective of the study. This will help you determine what data you need to collect and how to collect it.
  • Identify the data sources : Identify the sources of data that will help you achieve your objective. These sources can be primary sources, such as surveys, interviews, and observations, or secondary sources, such as books, articles, and databases.
  • Determine the data collection method : Once you have identified the data sources, you need to determine the data collection method. This could be through online surveys, phone interviews, or face-to-face meetings.
  • Develop a data collection plan : Develop a plan that outlines the steps you will take to collect the data. This plan should include the timeline, the tools and equipment needed, and the personnel involved.
  • Test the data collection process: Before you start collecting data, test the data collection process to ensure that it is effective and efficient.
  • Collect the data: Collect the data according to the plan you developed in step 4. Make sure you record the data accurately and consistently.
  • Analyze the data: Once you have collected the data, analyze it to draw conclusions and make recommendations.
  • Report the findings: Report the findings of your data analysis to the relevant stakeholders. This could be in the form of a report, a presentation, or a publication.
  • Monitor and evaluate the data collection process: After the data collection process is complete, monitor and evaluate the process to identify areas for improvement in future data collection efforts.
  • Ensure data quality: Ensure that the collected data is of high quality and free from errors. This can be achieved by validating the data for accuracy, completeness, and consistency.
  • Maintain data security: Ensure that the collected data is secure and protected from unauthorized access or disclosure. This can be achieved by implementing data security protocols and using secure storage and transmission methods.
  • Follow ethical considerations: Follow ethical considerations when collecting data, such as obtaining informed consent from participants, protecting their privacy and confidentiality, and ensuring that the research does not cause harm to participants.
  • Use appropriate data analysis methods : Use appropriate data analysis methods based on the type of data collected and the research objectives. This could include statistical analysis, qualitative analysis, or a combination of both.
  • Record and store data properly: Record and store the collected data properly, in a structured and organized format. This will make it easier to retrieve and use the data in future research or analysis.
  • Collaborate with other stakeholders : Collaborate with other stakeholders, such as colleagues, experts, or community members, to ensure that the data collected is relevant and useful for the intended purpose.

Applications of Data Collection

Data collection methods are widely used in different fields, including social sciences, healthcare, business, education, and more. Here are some examples of how data collection methods are used in different fields:

  • Social sciences : Social scientists often use surveys, questionnaires, and interviews to collect data from individuals or groups. They may also use observation to collect data on social behaviors and interactions. This data is often used to study topics such as human behavior, attitudes, and beliefs.
  • Healthcare : Data collection methods are used in healthcare to monitor patient health and track treatment outcomes. Electronic health records and medical charts are commonly used to collect data on patients’ medical history, diagnoses, and treatments. Researchers may also use clinical trials and surveys to collect data on the effectiveness of different treatments.
  • Business : Businesses use data collection methods to gather information on consumer behavior, market trends, and competitor activity. They may collect data through customer surveys, sales reports, and market research studies. This data is used to inform business decisions, develop marketing strategies, and improve products and services.
  • Education : In education, data collection methods are used to assess student performance and measure the effectiveness of teaching methods. Standardized tests, quizzes, and exams are commonly used to collect data on student learning outcomes. Teachers may also use classroom observation and student feedback to gather data on teaching effectiveness.
  • Agriculture : Farmers use data collection methods to monitor crop growth and health. Sensors and remote sensing technology can be used to collect data on soil moisture, temperature, and nutrient levels. This data is used to optimize crop yields and minimize waste.
  • Environmental sciences : Environmental scientists use data collection methods to monitor air and water quality, track climate patterns, and measure the impact of human activity on the environment. They may use sensors, satellite imagery, and laboratory analysis to collect data on environmental factors.
  • Transportation : Transportation companies use data collection methods to track vehicle performance, optimize routes, and improve safety. GPS systems, on-board sensors, and other tracking technologies are used to collect data on vehicle speed, fuel consumption, and driver behavior.

Examples of Data Collection

Examples of Data Collection are as follows:

  • Traffic Monitoring: Cities collect real-time data on traffic patterns and congestion through sensors on roads and cameras at intersections. This information can be used to optimize traffic flow and improve safety.
  • Social Media Monitoring : Companies can collect real-time data on social media platforms such as Twitter and Facebook to monitor their brand reputation, track customer sentiment, and respond to customer inquiries and complaints in real-time.
  • Weather Monitoring: Weather agencies collect real-time data on temperature, humidity, air pressure, and precipitation through weather stations and satellites. This information is used to provide accurate weather forecasts and warnings.
  • Stock Market Monitoring : Financial institutions collect real-time data on stock prices, trading volumes, and other market indicators to make informed investment decisions and respond to market fluctuations in real-time.
  • Health Monitoring : Medical devices such as wearable fitness trackers and smartwatches can collect real-time data on a person’s heart rate, blood pressure, and other vital signs. This information can be used to monitor health conditions and detect early warning signs of health issues.

Purpose of Data Collection

The purpose of data collection can vary depending on the context and goals of the study, but generally, it serves to:

  • Provide information: Data collection provides information about a particular phenomenon or behavior that can be used to better understand it.
  • Measure progress : Data collection can be used to measure the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Support decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions.
  • Identify trends : Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Monitor and evaluate : Data collection can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.

When to use Data Collection

Data collection is used when there is a need to gather information or data on a specific topic or phenomenon. It is typically used in research, evaluation, and monitoring and is important for making informed decisions and improving outcomes.

Data collection is particularly useful in the following scenarios:

  • Research : When conducting research, data collection is used to gather information on variables of interest to answer research questions and test hypotheses.
  • Evaluation : Data collection is used in program evaluation to assess the effectiveness of programs or interventions, and to identify areas for improvement.
  • Monitoring : Data collection is used in monitoring to track progress towards achieving goals or targets, and to identify any areas that require attention.
  • Decision-making: Data collection is used to provide decision-makers with information that can be used to inform policies, strategies, and actions.
  • Quality improvement : Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Characteristics of Data Collection

Data collection can be characterized by several important characteristics that help to ensure the quality and accuracy of the data gathered. These characteristics include:

  • Validity : Validity refers to the accuracy and relevance of the data collected in relation to the research question or objective.
  • Reliability : Reliability refers to the consistency and stability of the data collection process, ensuring that the results obtained are consistent over time and across different contexts.
  • Objectivity : Objectivity refers to the impartiality of the data collection process, ensuring that the data collected is not influenced by the biases or personal opinions of the data collector.
  • Precision : Precision refers to the degree of accuracy and detail in the data collected, ensuring that the data is specific and accurate enough to answer the research question or objective.
  • Timeliness : Timeliness refers to the efficiency and speed with which the data is collected, ensuring that the data is collected in a timely manner to meet the needs of the research or evaluation.
  • Ethical considerations : Ethical considerations refer to the ethical principles that must be followed when collecting data, such as ensuring confidentiality and obtaining informed consent from participants.

Advantages of Data Collection

There are several advantages of data collection that make it an important process in research, evaluation, and monitoring. These advantages include:

  • Better decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions, leading to better decision-making.
  • Improved understanding: Data collection helps to improve our understanding of a particular phenomenon or behavior by providing empirical evidence that can be analyzed and interpreted.
  • Evaluation of interventions: Data collection is essential in evaluating the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Identifying trends and patterns: Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Increased accountability: Data collection increases accountability by providing evidence that can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.
  • Validation of theories: Data collection can be used to test hypotheses and validate theories, leading to a better understanding of the phenomenon being studied.
  • Improved quality: Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Limitations of Data Collection

While data collection has several advantages, it also has some limitations that must be considered. These limitations include:

  • Bias : Data collection can be influenced by the biases and personal opinions of the data collector, which can lead to inaccurate or misleading results.
  • Sampling bias : Data collection may not be representative of the entire population, resulting in sampling bias and inaccurate results.
  • Cost : Data collection can be expensive and time-consuming, particularly for large-scale studies.
  • Limited scope: Data collection is limited to the variables being measured, which may not capture the entire picture or context of the phenomenon being studied.
  • Ethical considerations : Data collection must follow ethical principles to protect the rights and confidentiality of the participants, which can limit the type of data that can be collected.
  • Data quality issues: Data collection may result in data quality issues such as missing or incomplete data, measurement errors, and inconsistencies.
  • Limited generalizability : Data collection may not be generalizable to other contexts or populations, limiting the generalizability of the findings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Research Paper Conclusion

Research Paper Conclusion – Writing Guide and...

Appendices

Appendices – Writing Guide, Types and Examples

Research Report

Research Report – Example, Writing Guide and...

Delimitations

Delimitations in Research – Types, Examples and...

Scope of the Research

Scope of the Research – Writing Guide and...

Research Contribution

Research Contribution – Thesis Guide

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Data Collection Methods | Step-by-Step Guide & Examples

Data Collection Methods | Step-by-Step Guide & Examples

Published on 4 May 2022 by Pritha Bhandari .

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental, or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem .

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The  aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Table of contents

Step 1: define the aim of your research, step 2: choose your data collection method, step 3: plan your data collection procedures, step 4: collect the data, frequently asked questions about data collection.

Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement : what is the practical or scientific issue that you want to address, and why does it matter?

Next, formulate one or more research questions that precisely define what you want to find out. Depending on your research questions, you might need to collect quantitative or qualitative data :

  • Quantitative data is expressed in numbers and graphs and is analysed through statistical methods .
  • Qualitative data is expressed in words and analysed through interpretations and categorisations.

If your aim is to test a hypothesis , measure something precisely, or gain large-scale statistical insights, collect quantitative data. If your aim is to explore ideas, understand experiences, or gain detailed insights into a specific context, collect qualitative data.

If you have several aims, you can use a mixed methods approach that collects both types of data.

  • Your first aim is to assess whether there are significant differences in perceptions of managers across different departments and office locations.
  • Your second aim is to gather meaningful feedback from employees to explore new ideas for how managers can improve.

Prevent plagiarism, run a free check.

Based on the data you want to collect, decide which method is best suited for your research.

  • Experimental research is primarily a quantitative method.
  • Interviews , focus groups , and ethnographies are qualitative methods.
  • Surveys , observations, archival research, and secondary data collection can be quantitative or qualitative methods.

Carefully consider what method you will use to gather data that helps you directly answer your research questions.

When you know which method(s) you are using, you need to plan exactly how you will implement them. What procedures will you follow to make accurate observations or measurements of the variables you are interested in?

For instance, if you’re conducting surveys or interviews, decide what form the questions will take; if you’re conducting an experiment, make decisions about your experimental design .

Operationalisation

Sometimes your variables can be measured directly: for example, you can collect data on the average age of employees simply by asking for dates of birth. However, often you’ll be interested in collecting data on more abstract concepts or variables that can’t be directly observed.

Operationalisation means turning abstract conceptual ideas into measurable observations. When planning how you will collect data, you need to translate the conceptual definition of what you want to study into the operational definition of what you will actually measure.

  • You ask managers to rate their own leadership skills on 5-point scales assessing the ability to delegate, decisiveness, and dependability.
  • You ask their direct employees to provide anonymous feedback on the managers regarding the same topics.

You may need to develop a sampling plan to obtain data systematically. This involves defining a population , the group you want to draw conclusions about, and a sample, the group you will actually collect data from.

Your sampling method will determine how you recruit participants or obtain measurements for your study. To decide on a sampling method you will need to consider factors like the required sample size, accessibility of the sample, and time frame of the data collection.

Standardising procedures

If multiple researchers are involved, write a detailed manual to standardise data collection procedures in your study.

This means laying out specific step-by-step instructions so that everyone in your research team collects data in a consistent way – for example, by conducting experiments under the same conditions and using objective criteria to record and categorise observations.

This helps ensure the reliability of your data, and you can also use it to replicate the study in the future.

Creating a data management plan

Before beginning data collection, you should also decide how you will organise and store your data.

  • If you are collecting data from people, you will likely need to anonymise and safeguard the data to prevent leaks of sensitive information (e.g. names or identity numbers).
  • If you are collecting data via interviews or pencil-and-paper formats, you will need to perform transcriptions or data entry in systematic ways to minimise distortion.
  • You can prevent loss of data by having an organisation system that is routinely backed up.

Finally, you can implement your chosen methods to measure or observe the variables you are interested in.

The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1 to 5. The data produced is numerical and can be statistically analysed for averages and patterns.

To ensure that high-quality data is recorded in a systematic way, here are some best practices:

  • Record all relevant information as and when you obtain data. For example, note down whether or how lab equipment is recalibrated during an experimental study.
  • Double-check manual data entry for errors.
  • If you collect quantitative data, you can assess the reliability and validity to get an indication of your data quality.

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organisations.

When conducting research, collecting original data has significant advantages:

  • You can tailor data collection to your specific research aims (e.g., understanding the needs of your consumers or user testing your website).
  • You can control and standardise the process for high reliability and validity (e.g., choosing appropriate measurements and sampling methods ).

However, there are also some drawbacks: data collection can be time-consuming, labour-intensive, and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to test a hypothesis by systematically collecting and analysing data, while qualitative methods allow you to explore ideas and experiences in depth.

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research , you also have to consider the internal and external validity of your experiment.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

Operationalisation means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioural avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalise the variables that you want to measure.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2022, May 04). Data Collection Methods | Step-by-Step Guide & Examples. Scribbr. Retrieved 11 March 2024, from https://www.scribbr.co.uk/research-methods/data-collection-guide/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs quantitative research | examples & methods, triangulation in research | guide, types, examples, what is a conceptual framework | tips & examples.

Advertisement

Advertisement

Data collection protocols for VANETs: a survey

  • Survey and State of the Art
  • Open access
  • Published: 10 January 2022
  • Volume 8 , pages 2593–2622, ( 2022 )

Cite this article

You have full access to this open access article

  • Maryam Gillani 1 ,
  • Hafiz Adnan Niaz   ORCID: orcid.org/0000-0002-2020-417X 1 ,
  • Muhammad Umar Farooq 2 &
  • Ata Ullah 3  

4436 Accesses

21 Citations

Explore all metrics

We live in the era of Intelligent Transport Systems (ITS), which is an extension of Vehicular AdHoc Networks (VANETs). In VANETs, vehicles act as nodes connected with each other and sometimes with a public station. Vehicles continuously exchange and collect information to provide innovative transportation services; for example, traffic management, navigation, autonomous driving, and the generation of alerts. However, VANETs are extremely challenging for data collection, due to their high mobility and dynamic network topologies that cause frequent link disruptions and make path discovery difficult. In this survey, various state-of-the-art data collection protocols for VANETs are discussed, based on three broad categories, i.e., delay-tolerant, best-effort, and real-time protocols. A taxonomy is designed for data collection protocols for VANETs that is essential to add precision and ease of understandability. A detailed comparative analysis among various data collection protocols is provided to highlight their functionalities and features. Protocols are evaluated based on three parametric phases. First, protocols investigation based on six necessary parameters, including delivery and drop ratio, efficiency, and recovery strategy. Second, a 4-D functional framework is designed to fit most data collection protocols for quick classification and mobility model identification, thus eradicating the need to read extensive literature. In the last, in-depth categorical mapping is performed to deep dive for better and targeted interpretation. In addition, some open research challenges for ITS and VANETs are discussed to highlight research gaps. Our work can thus be employed as a quick guide for researchers to identify the technical relevance of data collection protocols of VANETs.

Similar content being viewed by others

Autonomous vehicles: challenges, opportunities, and future implications for transportation policies.

Saeed Asadi Bagloee, Madjid Tavana, … Tracey Oliver

research and data collection protocols

A Survey of Traffic Prediction: from Spatio-Temporal Data to Intelligent Transportation

Haitao Yuan & Guoliang Li

research and data collection protocols

Improving the quality of real-time data transmission service in VANETS by balancing the load on road side units

Behzad Saemi, Fatemeh Halataei, … Ali Asghar Rahmani Hosseinabadi

Avoid common mistakes on your manuscript.

Introduction

VANETs are composed of vehicles with dynamic connections and rapidly changing, abrupt movements [ 1 ]. Different vehicles are connected in an ad hoc manner, with vehicles joining or leaving the group without generating any pre-indication [ 2 ]. A massive increase in vehicles has imposed significant challenges to fast, reliable and secure data collection [ 3 ]. Recently, vehicular data collection has attracted greater interest, due to increasing developments in Intelligent Transport Systems (ITS) [ 4 ]. VANETs are highly dependent on data exchanges, such as receiving data for traffic monitoring, notification of accidents, weather alerts, all aimed at maintaining a secure transportation system for the world [ 5 ]. VANETs integrate wireless LANs, ad hoc and cellular networks and differ from other ad hoc networks due to the hybrid nature of the network architecture, high node mobility and dynamic application scenarios.

Data collection is critical because vehicles are equipped with various sensors for speed, location, temperature and pressure estimation, as well as audio and video streaming support that demand intelligent data collection [ 7 , 8 ]. Vehicles can share the data using Vehicle to Vehicle (V2V) or Vehicle to Infrastructure (V2I) communications. Source and destination vehicles can also exchange data, as shown in Fig.  1 . The data collection process begins by taking data from various incorporated sensors and then performing aggregation and compression to reduce communication cost and time, depending upon the protocol and adapted scheme [ 9 ]. Data collection protocols are designed by considering both predictable movement patterns and abrupt variation in speed. Data is collected from distinct vehicles by involving intermediaries, where the shortest path is selected by means of routing protocols [ 10 ].

figure 1

VANETs architecture and data collection scenarios

ITS can process the data to control environmental pollution, traffic prediction, accident analysis and road congestion control [ 11 ]. Data collection schemes and protocols in VANETs perform their task through deployment in cell phones or embedded systems integrated into vehicles to monitor sensor readings, while regularly fulfilling data collection requirements. Collected data is transferred after pre-determined intervals or in real-time to the data centres via backbone networks for further analysis, processing and storage [ 12 ]. Vehicles may also communicate directly with the Roadside Unit (RSU), or share data via a base station, as illustrated in Fig.  1 .

Three kinds of data collection protocols exist in VANETs, i.e., Real-Time (RT) data collection protocols, Delay-Tolerant (DTN) and Best-Effort (BE) protocols. Real-time data collection protocols in VANETs are time-sensitive, and data must be collected and transmitted within a tolerable time-delay. However, real-time protocols are extremely sensitive to packet delays and packet loss. Therefore, efficient data dissemination is one of the most crucial elements to propel the desired flow of VANETs services, particularly related to real-time specifications. These protocols are essential in medical emergencies, security agencies, or when amassing sensitive military and defence-related information [ 13 ]. Real-time data collection is challenging due to the density of vehicles, dense road topologies, and high dependability requirements for a large set of intelligent applications, combined with cloud services [ 14 ]. Changes in traffic patterns, whether urban, rural or highway scenarios, also affect the functionality of real-time data collection protocols [ 15 ].

DTN-based data collection protocols in VANET can manage delays in receiving and sending data within relaxed and pre-defined thresholds [ 16 ]. In case of frequent network failures, DTN can achieve better performance than real-time scenarios [ 17 , 18 ]. DTNs are well suited to highly mobile and terrestrial environments, assuring delivery of data by means of automatic store-and-forward mechanisms. Immediate data forwarding is also possible in a DTN when the required sources are available. However, the timeliness of data transmission is sometimes affected in DTNs, which is acceptable in specific applications in VANETs, such as in routine matters, weather prediction systems, audio/video streaming applications, underwater communication, and wildlife monitoring.

Best-effort protocols are designed to achieve the “best” possible attainable workload at a designated time with a probability of violation at run time [ 19 ]. These kinds of protocols do not provide guaranteed reliability and functional and offer no definite bounds on delivery time. Internet Protocol (IP) assists in a best-effort delivery system, trying to reduce data loss as much as possible, though data loss is inevitable in exceptional cases, such as network hardware failure. Different packets may take different routes throughout the network and be subjected to random delays. However, these kinds of protocols still work for scenarios where data timeliness and reliability are can afford to be somewhat compromised. Higher layer protocols are usually used to add reliability and cost-effectiveness to take full advantage of the network’s capabilities.

Recent work over the past decade has considered new algorithms, protocol refinement and standardization, resulting in the IEEE 802.11p [ 20 ] and IEEE 1609 standards [ 21 ]. Due to the dynamic requirements of VANETs, an intelligent protocol for routing aware data collection is essential [ 22 , 23 , 24 ]. Routing involves the best path selection between source and destination. This affects timely data transmissions; optimizing data collection mechanisms alone does not guarantee timely delivery to the destination. Furthermore, routing can be expensive when two-way paths are set up among vehicles and the base station (BS) [ 25 ].

Optimized data collection is carried out using additional information, like vehicle location, direction and average road speeds, to find long-lived one-way data paths from data sources to the BS. In some cases, no data path exists between vehicles and the BS, and this can be resolved using the ‘store, carry and forward’ strategy, where vehicle data resides data at the link layer until delivered to the next hop towards the BS, as in a DTN approach.

Various state-of-the-art projects are based on data collections throughout the world [ 26 ]. These include The Car-2-Car Communication Consortium (C2C-CC) [ 27 ], which particularly focuses on improving road safety through the Cooperative Intelligent Transport System (C-ITS) [ 28 ]. Networks on Wheels (NOW) [ 29 ] adopts the same theme, while additionally considering security and Vehicle Infrastructure Integration (VII) [ 30 , 31 , 32 , 33 ]. Secure Vehicle Communication (SeVeCOM) [ 34 , 35 ], Internet Intelligent Transport System Consortium [ 36 , 37 ] and the Advanced Safety Vehicles Projects [ 38 , 39 , 40 ] are also among the highlighted projects.

The aim of this work is to explore, taxonomize, and discuss existing data acquisition techniques in VANETs. We are particularly interested in the communication and routing aspect of these techniques in DTN, best-effort, and real-time based VANET models. The motive is to provide an all-inclusive overview on data acquisition techniques to help users in making the right choice of the protocol. In literature, numerous works focus on summarizing the data acquisition techniques for vehicular ad hoc networks. In [ 41 ], the authors investigate topology-based, cluster-based, location-based and fog-based data collection techniques and highlight important issues that need to be addressed in data collection protocols for VANETs. In [ 42 ], prediction based protocols for vehicular ad hoc networks are summarized. The article highlights the results of traffic conditions, driving conditions and urban layout on the predictability of vehicle locations. A thorough investigative study and taxonomy of clustering protocols for VANETs is presented in [ 43 ]. The authors also provide a comparison of different parameters, including stability, density and convergence, for thorough understanding of VANET clustering algorithms. In [ 44 ], the authors provide a critical representation and taxonomy of named data networking-(NDN)-based data dissemination algorithms for VANETs. The authors provide a qualitative comparison on the basis of forwarding strategy, granularity, caching scheme and latency, etc.

However, one or more of the following limitations have been found in the existing works:

The existing surveys are not comprehensive, and do not cover all types of data collection techniques;

The existing surveys are not up to date and do not include recent works on data collection protocols;

Application-specific analysis of data collection approaches is not provided.

The existing surveys lack a standard classification/evaluation criteria, eventually ignoring application-specific analysis of data collection approaches.

In addition to addressing these limitations, this article makes the following contributions:

It provides more profound knowledge and summarizes diverse sets of current and/or important data collection protocols, while covering two decades of advancements made in the area of data collection protocols. Specifically, the work includes DTNs, Best-effort, and real-time data collection protocols for VANETs.

A Detailed taxonomy is provided for data collection protocols in VANETs with four-step hierarchal divisions. This taxonomy enables researchers to quickly get up to speed without digging into details and lengthy discussions (Fig. 2 ).

An area-based critical analysis and comparison is performed by considering six fundamental parameters for relevant schemes.

A 4-D functional framework is designed for explaining and analysing data collection protocols under standard application requirements. The proposed 4-D functional framework critically covers deep underlying factors to analyse various protocols presented in each category. It also accommodates almost every protocol to fit the VANETs category, due to its flexible and adaptable design.

A detailed qualitative analysis is presented to guide readers on the merits and demerits of data collection protocols.

Some open research challenges are covered with respect to recent literature with a detailed dendrogram that can act as a guide to potential research areas for researchers looking for research gaps.

figure 2

Taxonomy of data collection protocols for VANETs

To assist readers, a list of acronyms is provided in the appendix of this work.

The remaining part of the survey is organized as follows: “Delay tolerant network protocols” provides details of delay-tolerant network protocols, with related subsections (Random and probability, Flooding and variance, Geo-location, Movement vector), followed by detailed summary and analysis in table form. “Best-effort protocols” gives details of best-effort protocols, along with subsections for hierarchical protocols and flat data acquisition protocols. Hierarchical protocols are further categorized based on predictive, infrastructure, distributed, and zone-based clustering forms. In subsequent subsections, we also provide sub-classifications of Infrastructure based clustering while covering multi-hop, MAC-assisted (contention-based and contention-free) forms, followed by a detailed summary and supporting analysis in table form. “Real-time protocols” presents real-time protocols, with subsections related to cluster-based and position-based protocols, followed by a detailed summary and deep analysis in table form. “4-D Functional analysis” presents a multi-dimensional model, named 4D functional analysis, that covers further categorical analysis based on various parameters, both in table and text-based format. “Open research challenges” discusses open research challenges and “Concluding notes” concludes the survey.

Figure  2 illustrate the taxonomy of data collection protocols for VANETs. There are three main categories including delay tolerant, best-effort and real-time protocols. Delay tolerant protocols are sub-divided into random, flooding, geolocation and movement vector based protocols. Best-effort protocols for hierarchical category are sub-divided into predictive, infrastructure, distributed and zone-based clustering. Best-effort protocols for flat data collection protocols are sub-divided into reactive and proactive protocols. Real-time protocols are discussed and categorized based on cluster and position-based protocols. Aforementioned categories and subdivisions with associated examples are briefly discussed in their respective sections.

Delay tolerant network protocols

Networks with inconsistent connectivity and non-consecutive end-to-end paths between nodes are named Delay-Tolerant Networks (DTNs) [ 45 ]. DTNs follow a “store-carry-forward” mechanism to manage frequent connectivity disruption. For example, suppose a node receives data when there is no connection to transmit it: in that case, the message resides in a buffer until it recognises a chance to disseminate after establishing a connection. Now, the question arises, what are the best strategies for deciding about time quanta to hold the data and efficiently forward in a timely manner. For this purpose, dynamic and static network information can be used for decision making.

Dynamic network information includes area, collision and vehicular information, while the static network contains social connectivity details among nodes [ 45 ]. Both types of information play their role in seeking the best node and time to transmit messages. In VANETs, links are periodically disconnected to save energy or frequently fail due to high mobility. Ad hoc networks are decentralized types of networks that are infrastructure-independent and do not require any pre-existing infrastructure [ 46 ]. The lack of fixed infrastructure contributes to new research problems, such as network configuration, device discovery, topology maintenance, ad hoc addressing and self-routing [ 47 ]. DTNs are sub-divided into various categories, discussed below.

Random and probabilistic protocols

In random forwarding protocols, nodes forward data packets to other nodes that contact them in the first instance. Forwarding of packets starts with a random search for the destination. Sometimes, packets keep on moving within a specific set of nodes or may reach a dead end. Random forwarding is further divided into two categories, i.e., deterministic routing and stochastic DTN routing. In deterministic routing, knowledge of the current topology is extended future behaviour, and changes are predicted accordingly. Stochastic routing, on the other hand, is based on the unknown or random movement of nodes in which few or no future predictions can be made. Packet distribution in this scenario is achieved through random, hop-by-hop movements with a certain probability of getting to the destination; i.e., there is no guaranteed delivery.

In random probabilistic protocols, data is sent to the hosts in a specific order. This particular order is taken into account for hop counts and data delivery, based on previous encounters. Maurice et al. [ 48 ] proposed a Probabilistic Bundle Relaying Scheme (PBRS) for a two-hop vehicular DTN. This scheme probabilistically determines the carrying capacity of each vehicle according to the suitability of transmitted data packets. Data Bundles are given to the current vehicles only if doing so contributes to reducing the mean transit delay. PBRS is functional with minimal knowledge of the network.

Nidhi et al. [ 49 ] proposed a probabilistic relaying scheme (PRS) that is an extension of PBRS. It includes multi-copy vehicular DTNs [ 48 ]. In PRS, multiple copies are generated to increase the packet delivery ratio, while giving more benefits over multi-hop protocols. Transmitting multiple copies might increase data consumption but increases the packet delivery ratio. Anders et al. proposed a Probability Routing Protocol using History of Encounters and Transitivity (PRoPHET) [ 50 ]. This is a probabilistic routing protocol in which data from past encounters is utilized to optimize the packet delivery ratio. Better performance is attained by determining the next suitable hops for given packets. Distance-based PRoPHET (DiPRoPHET) [ 51 ], enhances (PRoPHET)’s protocol delivery ratio, as well as covering message delivery delay issues, by introducing it in a cross-layer process to retrieve the distance value from the lower layer and then use it in the upper layer.

Flooding and variance

Flooding is a strategy in which data packets keep replicating through a sufficient number of nodes until destination nodes receive them. Network knowledge can be utilized to select a subset of nodes to create a path and reach the destination. Each node is supposed to retain a copy or multiple copies of each message to transmit opportunistically. However, flooding causes network congestion and reduces the message delivery ratio. It also increases competition for network resources like bandwidth and storage.

Flooding can be classified into Single-Copy (SC) and Multi-Copy (MC) methods. In the single-copy form, a single data packet in the network is forwarded by various nodes, whereas in multi-copy, replicated data packets are forwarded through contact-based sharing [ 52 ]. Guizhu et al. [ 53 ] proposed a Transfer Utility of Node’s Buffer Scheduling Strategy (BSP) to forward multiple copies dynamically. Amrita et al. proposed a Seasonality Aware Social (SAS) [ 54 ] forwarding technique that focuses on controlled forwarding through modelling contact history between node-pairs. It focuses on the weighted similarity index through repetitive contact patterns in real mobility traces via direct connections.

Geolocation-based protocols

Geolocation-based protocols are suitable when the source knows the coordinates of the destination node [ 6 , 55 ].

Pierpaolo et al. [ 56 ] proposed a Floating Car Data (FCD) collection protocol for urban scenarios to provide connectivity through Dedicated Short Range Communication (DSRC) and cellular communication, such as Long-Term Evolution (LTE) offloading. Onboard LTE radio modules are considered as collecting vehicles. FCD is capable of collecting data directly from LTEs through individual LTE channels. In this scenario, delays are increased during dense cellular traffic. Moreover, Cooperative Awareness Messages (CAM) are periodically exchanged to report vehicle mobility and interaction scenarios.

Rui et al. [ 57 ] proposed a Gateway Location Awareness (GLA) scheme. This is a location-aware ranking classification that chooses vehicles with a higher tendency to forward information within a short span, while interpreting nodes according to moving patterns. GLA is combined with Aging Social Aware Ranking (ASAR) for improved performance. ASAR additionally allows the selection of a vehicle with more frequent connections, rather than selecting one with little or no connectivity with corresponding vehicles. This hybrid approach maximizes the data delivery with a lower data overhead.

Bilgin et al. proposed a Data Collection Mechanism for smart grids, using Public Transportation Buses (DCMPTB) [ 58 ] while using smart metres and smart grid communication systems. This protocol is designed to utilize I2V and V2V for data transfers from smart metres to public buses (I2V) and then moves from one bus stop to another bus stop through V2V. The source already knows the following coordinates of the destination node; i.e., buses know the next bus stop.

Vasco et al. originated the idea of GeoSpray [ 59 ], in which the hybrid approach is designed for single and multiple copy requirements. It follows asynchronous communication with the store-carry-forward mechanism. To exploit alternative paths, GeoSpray starts with a limited number of multiple copies and then switches towards a forwarding scheme to take the best possible advantage of all the vehicles in contact. In other words, GeoSpray uses two schemes in one design to gain maximum benefit from it.

Movement vector

Movement vectors specify the speed and direction of movement of a vehicle. They are shared by vehicles to update their current location. Position-based routing protocols use it to choose the shortest paths with low delays. It also helps to decide on path re-establishment and whether data packets should be replicated or not, according to link characteristics and vehicle mobility. Packets are replicated if a neighbour moves with high velocity and is close to leaving the group in a particular region. Packets are not replicated when vehicles are moving in the same direction towards the destination.

Hyunwoo et al. proposed History-based Vector Routing (HVR) [ 60 ] that allows each node to maintain the vector information of other encountered nodes, and then this information is shared to other nodes. While utilizing historic information, nodes start predicting the location of each packet’s destination to achieve accurate forwarding. Zhaojun et al. proposed the Pass and Run protocol [ 61 ], which is specifically designed to protect the privacy of communication in DTNs. This protocol is tracking-resistant (i.e., does not allow tracking) and works through addressing the vehicle location and considering the driving patterns and history of vehicles to prevent misuse of information. Pass and Run uses greedy and random strategies to decide whether to submit the data packet to the RSU or to transfer it to the next vehicle.

A summary analysis of the aforementioned DTN protocols is given in Table 1 , based on parameters like End-to-end delay, Average forwarded messages, Packet delivery ratio, Packet drop ratio, Recovery strategy, and Effect of traffic Density. End-to-end delay is the time taken by a data packet to traverse from a source to destination. It is calculated through d  =  N * L / R (Packet of length L over N links with transmission rate R ). ‘Average forwarded messages’ is defined as the number of messages forwarded at a given time to deliver a data packet. ‘Packet delivery ratio’ is the ratio of the number of packets initiated from the source and the number of packets received at the destination. ‘Packet drop ratio’ calculates the total number of data packets received at the destination divided by the number of data packets sent from the source. ‘Recovery strategy’ refers to the capability of the protocol to respond to unpredictable failure or collapse. Moreover, the recovery strategy illustrates if a ‘plan B’ exists to deal with accidental and unusual scenarios. The effect of traffic density on protocols’ performance is another integral aspect in the analysis of performance and the success ratio. Traffic density indicates the number of vehicles present/interacting at a given time and location (road segment).

From Table 1 , it can be seen that low end-to-end delay yields a high level of forwarded messages, better packet delivery ratio and low packet dropping ratio. Thus, it can be concluded that a high ratio of packet delivery indicates better performance. However, exceptions like Pass and Run [ 61 ], where the packet drop ratio is high, with medium packet delivery rate but high average forwarded messages indicate that different design constraints result in moderate results. Although the Pass and Run protocol tries to keep delay to a minimum by transmitting data packets to the nearest nodes, it does not assure the sender that the packet will be delivered to the destination in due course, because of unpredictable paths of vehicles. DTNs are thus better suited for weather prediction systems, underwater communication and wildlife monitoring, where some delay is tolerable and the value of the data and time are equally important; i.e., slight delays are fine as long as data is securely received in a cost-effective way.

  • Best-effort protocols

Best-effort protocols try to achieve the “best” possible attainable workload per given time, with a probability of violation at run time [ 62 ]. These kinds of protocols seek to maximize application benefits by meeting most of the requirements. They do not claim complete reliability, which is supposed to be provided by the higher layer protocols, but to deliver packets towards a destination within designated time constraints.

Hierarchical protocols

Hierarchical protocols are distributed at multiple levels of clustering, along with sub-groups [ 43 , 63 ]. They tend to manage the assigned tasks individually. Data being forwarded to one group is not necessarily given to another group. These protocols are applicable for wide-area grouping of vehicles, where standalone sub-groups can also be set up along with a Cluster Head (CH). Each CH exchanges the information to member vehicles in a hierarchy.

Predictive clustering

Predictive clustering utilizes the recent geographic positions, specific interests and predicted future behaviour of vehicles to structure a cluster [ 64 ]. Clusters and cluster-head selection are based on predictable movements of vehicles [ 65 ], assigning priorities for controlled access of cluster formation. Although vehicles keep on changing their positions, they are somewhat detectable due to the routes of roads [ 66 ]. Saliha et al. proposed Fitness Clustering [ 67 ] based on rapid and real heuristics. It primarily targets data dissemination in emergency cases.

In Fitness Clustering, the original message is optimized to reduce the number of exchanged packets. It focuses on making stable clustering by considering parameters like transmission period, the degree of connectivity, relative velocity and the lifetime validity of the link. Islam et al. proposed a Prediction-Based Efficient Clustering Scheme (MPECS) [ 68 ] that uses a Voronoi Diagram to divide the area into distinct regions and then allow every vehicle to decide its own longevity and cost the cluster head in its current area. This technique evaluates the vehicle’s impact on clustering stability and cost, so that more extended cluster stability comes with minimal overhead and cost.

Intelligence-based clustering

Intelligence-based clustering maintains the hierarchy to eradicate unbalanced cluster formation by utilizing machine learning and artificial intelligence [ 69 , 70 ]. In these protocols, CH election is mainly performed after cluster formation [ 71 ]. Neeraj et al. proposed an Agent-Learning-based Clustering Algorithm (ALCA) [ 72 ] to eradicate the issues related to high density, random mobility and finding an exact route. Agents learn from the deployed environment, where neighbouring agents also collaborate for information sharing, and estimation of vehicles is maintained through clustering. The CH is elected through node density and the direction of vehicle mobility.

Reward or penalty functions are suggested based on various parameters, such as the agent's ID, action set, learning rate, and learning factor. Learning agents decide to increment or decrement the parametric values through these functions until maximum values are achieved. Manisha et al. proposed an intelligent forwarding–based stable and reliable data dissemination scheme (IF-BS) [ 73 ]. IF-BS works intelligently to let vehicles decide the next forwarding node by considering the stability of connecting edges and waiting for metrics. When the next node is assured from source to destination, less link disruption occurs, and more data delivery is assured.

Position based clustering

In position-based Clustering, position coordinates of vehicles and CHs are the main consideration for clustering. Cluster structure depends on the vehicle's geographic positioning, and its CH is elected based on priorities associated with vehicle requirements [ 74 ]. Stable CHs are preferred, and their stability is evaluated through various factors, including a more extended trip of the vehicle, high speed and proximity to the base station (BS) [ 75 ]. These clusters can manage the rapid movement of vehicles and are considered a key clustering protocol for VANETs.

Ismail et al. [ 76 ] proposed a geographical Clustered Gathering Protocol (CGP) where the CH performs data collection, aggregation and dissemination. It then transfers this data to a sink or BS. CGP uses an opportunistic approach, with a store-and-forward mechanism that is used when the next road segment is empty, and the vehicle has to wait for the CH to come closer to it. CGP works on single-way communication and is applicable to single and straightforward road topologies. However, it is not easy to manage for more extensive and complex regions. Cluster management overhead is also a constraint associated with CGP.

Position-based routing can be managed efficiently by utilizing the RSUs, BSs and smart vehicles with internet access to directly access the servers to reduce the communication overheads associated with V2V messaging. Shahab et al. proposed Probabilistic Direction Aware Cooperative Collision Avoidance (P-DACCA) [ 77 ], which estimates the probability of a collision based the expected state of nodes. Through this calculation, an early warning is generated when the probability exceeds a predefined threshold. This factor avoids upcoming threats of collisions and reduces the number of collisions, but also reduces communication overhead, as well as giving low latency.

Zhenzehn et al. proposed the Software-Defined Vehicular Network (SDVN) [ 78 ], the first-ever algorithm that successfully utilized cooperative cellular and ad hoc network accesses for extensive data collection. After SDVN, multiple protocols have been designed and suggested that follow the same mechanism. SDVN possesses high monetary cost, but it fulfils the data delivery ratio through a cellular predictive process. It takes a predictive decision based on real-time network status, rather than empirical knowledge.

Rakesh et al. proposed the Data Relationship Degree-based Clustering Data Aggregation (DRDCDA) [ 79 ], which is based on universal delegate sensing vehicle selection. This kind of selection is used as a unique factor to calculate the vehicle data and then measure a correlation with data from neighbouring vehicles to perform local cluster formation.

Lane based clustering

Lane based clustering protocols estimate the road lanes, based on the traffic flow of vehicles. In some schemes, it is assumed that each vehicle knows its lane, while other schemes often consider a virtual lane to assume lane-based connectivity. The CH is elected based on levels given in a lane in relation to other lane nodes. For example, a vehicle with a high lane level will be elected as CH, and so on.

Mohammad et al. [ 80 ] proposed a Lane-Based Clustering Algorithm (LBCA) as a stable clustering method where each vehicle identifies its lane using a lane detection system. Lane analysis is performed using weight-based metrics for right lane, left lane and no turn. It shows a longer CH lifetime than the Lowest-ID, Highest-Degree and Utility Function algorithms.

Interest-Based clustering

Interest-based clustering utilizes the area of interest for specified intentions of getting results. It is a “concern centred” technique that is suitable when data collection and delivery is desired for a specified area of particular interest. It is primarily applicable in emergency scenarios [ 81 ]. For example, from the complete road topology data, it extracts data where an accident or emergency occurs. Thus, it achieves high efficiency with low communication costs.

Yaoyao et al. [ 82 ] discussed a Partitioning-Based Scheduling (PBS) algorithm that utilizes mobile devices as mechanical information carriers in partitioned networks. PBS stores the partitioning of nodes and cluster formations in a KD-Tree, where powerful nodes maintain records [ 83 ].

Tarek presented a hybrid, dynamically allocated resource policy for quality of service and fair data packet scheduling via a Call Admission Control (CAC) scheme. The CAC scheme dynamically utilizes vehicles and vehicle density transmission powers to provide the desired throughput for real-time communication. The performance of vehicular communication is enhanced when all stations are considered greedy to transfer packets. For contention-based channel access, a back-off mechanism is used to provide fairness among hostile 802.11p users of the same access category [ 84 ].

Ghada et al. proposed All-Member-Interests-based Merging (AMIM) [ 85 ], in which decisions are taken for the benefit of all members of clusters, rather than only considering the CH. AMIM considers vehicle speed, position and direction, focussing on Link Expiration Time (LET) and Signal-To-Noise Ratio (SNR). AMMI works together with Double-Head Cluster (DHC), where two functioning CHs are selected to overcome frequent re-clustering. When a Cluster member loses its connection with its CH, the alternative CH catches it to carry on processing data delivery in the area of interest.

Infrastructure-based clustering

Infrastructure-based clustering is about cluster formation based on a partially defined infrastructure of clusters for the communication process [ 86 ]. Clustering focuses on a low degree of velocity and high node connectivity for cluster leadership [ 65 ]. CH is elected based on a vehicle’s relative velocity [ 87 ]. This scheme also plays a vital role in preventing cluster re-ordering when two CHs come within range simultaneously.

Multi-hop clustering

Cluster formation is accomplished through multiple hop distances, where every node is considered a maximum of K-hops away from at least one CH [ 88 ]. It is significant in extending cluster sizes and reducing the number of cluster heads. Two primary factors are the number of K-Hop neighbours and identification IDs. Grzegorz presented a Modified Mobility-Adaptive Clustering (M-DMAC) [ 89 ] for high-mobility nodes, where a vehicle with the highest neighbouring ratio is selected as CH. M-DMAC is a modified version of DMAC in which a generic clustering is used, which is not reliable for dealing with a changed mobility pattern. M-DMAC focuses on avoidance of re-clustering by incrementing the stability of clusters through estimating connection time for moving nodes, i.e., freshness checking.

MAC-assisted clustering

MAC-assisted clustering uses link-layer information for cluster formation. The topological insights available to the link-layer help in selecting CHs with minimum relative speeds and distances to all neighbours. For instance, a point coordinator (access point) in the IEEE 802.11 family can be considered as a CH. Similarly, parameters like neighbour count, mean squared deviation in received signal strengths, and relative distances to the registered RSUs can be used in a fully distributed MAC. However, MAC-assisted clustering increases link-level traffic, which also raises the probability of collisions in contention-based MAC protocols. Collisions trigger retransmissions at application and/or transport layers. As a result, lower transmission efficiency is achieved. This issue is addressed through the reduction of channel contention for timely and reliable message delivery [ 87 ]. In addition, these protocols are less affected by variation in vehicle speeds [ 88 ]. CSMA/CA-based protocols are inherently designed to wrestle collisions. In CSMA/CA, every associated terminal should be able to detect the transmission of all other terminals [ 91 ]. However, not all packets transmitted from different terminals can be sensed, due to the hidden node problem, mobility, and various other infrastructure-based obstacles. This factor negatively affects CSMA's performance. Increased collisions result in extended delays, rescheduling of transmissions, data loss, and wastage of resources. To better utilize CSMA performance on VANETs protocols, various modifications have been proposed that are discussed thoroughly in this section.

Contention-based protocols

Contention-based protocols are flexible and responsive to the dynamic nature of networks with appropriate intermittent and short message sharing. These protocols are reliable for sharing safety messages [ 90 ]. They allow multiple users to utilize the same channel without predefined coordination. Bouziane et al. discussed a Distributed Data Gathering Protocol (DDGP) [ 91 ] that uses vehicles and mobile collectors for data collection. DDGP enables vehicles to access the channel in a distributed way, based on their location information. The efficiency of the protocols is increased by removing expired and redundant data. A Col packet is sent that contains the length of the collection area, data packet type, acknowledgement packet, and announcement packet. Data collection is accomplished through these parameters, along with segmentation and clustering [ 92 ]. Two segments, i.e., Collection Segments (CS) and Silence Segments (SS), increase efficiency. However, these segments cannot deal with hurdles or blockages in road topology, such as road blockage in accidents.

Contention-free protocols

Contention-free protocols require centralized scheduling by proper allocation of resources, i.e., time slots, channels and positioning of nodes, to avoid collision [ 90 ]. These protocols consider time-synchronization of nodes that is not applicable for large-scale VANETs. It results in slow responses to distributed networks. For example, multiple access techniques, Time Division Multiple Access (TDMA), result in low throughput with collision-free medium access. Multiple Access Techniques become problematic in low traffic loads due to idle slots [ 93 ].

TDMA-based systems encounter issues in the synchronization of nodes due to rapid topology change. Secondly, changing time slots in a decentralized scenario is also a tedious task to perform. Hassan et al. proposed a TDMA-based protocol, VeMAC [ 94 ], that exclusively targets hidden terminal problems through single- and multi-hop broadcast services at the control channel level. VeMAC eliminates transmission collisions through excessive node mobility, which avoids collisions by allocating disjointed sets of time slots to nodes in opposite directions compared to roadside units. This characteristic makes VeMAC favourable for attaining higher throughput at the control channel.

In Frequency Division Multiple Access (FDMA), a collision-free medium-access technique handles communication with different radio channels [ 95 ]. However, it increases the cost of sensor nodes, because each channel is assigned to only one user at a time. Therefore, Alessandro et al. [ 96 ] proposed an orthogonal FDMA-based Obvious (OBV) protocol that uses carrier frequency and available bandwidth through control channels. OBV is divided into segments, or frames, with each frame separated by a contention period (CP) and a contention-free period (CFP). The CP is retrieved through a contention-based algorithm and utilized to exchange resources to be used in the CFP for data transmission in the specified resources.

In Spatial Division Multiple Access (SDMA), a geographical zone is partitioned into multiple divisions and then mapped to the respective channel. It allows scheduled access time slots based on a vehicle’s location on the road topology. Although it does not maximize the usage of available bandwidth, it is a widely accepted technique for stable resource allocation for vehicular mobility [ 97 ]. However, SDMA has the least applicability in multi-hop message delivery. In SDMA-based protocols, a time slot is given that is also called the allowed time for a vehicle to transmit data.

Bouziane et al. proposed a Clustered Data Gathering Protocol (CDGP) [ 98 ] that minimizes the extent of collisions in a highly dense network and enhances the robustness and reliability of data collection. Its clustering technique is based on a hybrid architecture, a data collection phase through Dynamic SDMA (D-SDMA) and a retransmission mechanism to handle faulty messages. CDGP comprises three main tasks, i.e., propagation of collection messages through a Road Side Unit (RSU) for initiation of the data collection process; formation of clusters, along with CH election, in each collection segment; and a data collection phase. The CH in each segment allocates a time slot to each block of road containing a vehicle. In the propagation of collection messages phase, the RSU starts the collection process by dispatching a beacon packet (Col) that contains the RSU position (RSU-POS), length (A), direction (DC), data type (DT) and validation time (VT).

In the cluster formation process, clusters and CH are formed within the validation time. Second, if the CH allocates a time slot to an empty block, then the whole duration of the slot is wasted, thus increasing the waiting time. Another protocol based on CDGP is Extended Cluster-Based Data Aggregation (ECDGP) [ 99 ]. ECDGP is proposed to be applicable on DTN, as well as Real-time Scenarios, with additional features of supporting multiple data types and aggregation of data before delivering them to the initiator. Moreover, it offers flexible data collection through aggregation and segmentation. The retransmission mechanism is developed to ensure reliability.

Wang et al. proposed TrafficGather [ 100 ], adapting the same SDMA concept. TrafficGather divides roads into road blocks with separate clusters. This protocol allows each vehicle to transmit traffic information at a designated time slot. The only drawback of this protocol is that TrafficGather is limited in that a large number of time slots are lost when they are allocated to empty cells, especially in a sparse network. Furthermore, the use of a flooding strategy during the last phase may cause the ‘broadcast storm’ problem.

In Dynamic Spatial Division Multiple Access (D-SDMA), reallocation and retransmission functions for erroneous data packets are further included [ 91 ]. Bouziane et al. [ 101 ] presented a Token-based Clustered Data Gathering Protocol (TCDGP) one year after the previous presented CDGP; TCDGP is meant to overcome gaps in the earlier protocol.

TCDGP is slightly different from CDGP in its functionalities. It inherits all the characteristics of CDGP and adds one, i.e., token-based dynamic SDMA (TD-SDMA). In TD-SDMA, each CH periodically sends a token packet with two fields: a Block_num of the packet intended for sending and an Ack field (a single bit) that is used as receipt of data to retransmit the data in case of error. TCDGP allows reservation of a time slot only for vehicles having data to send, thus resolving a slot wastage problem. On the other hand, the TCDGP protocol has more message overhead, due to transmitting a token packet to each block segment to assign time slots.

Distributed clustering

Distributed Clustering Protocols are designed to control CH allocation in a distributed environment [ 102 ]. In distributed clustering, the number of links increases with the number of channels, while making inter-cluster communication more effective. However, it shows reduced effectiveness in realistic vehicular speed scenarios, due to high transmission overhead and low vehicle density situations. Distributed clustering is also known as decentralized clustering, because of its more insufficient cluster connection time and cluster stability.

Inter cluster links initially increase due to overlapping regions of vehicles, but keep on declining and disconnecting because of rapid vehicle movement. Transmission efficiency becomes affected by the rapid vehicle movement and makes it more decentralized.

Leandros et al. [ 103 ] proposed a distributed Spring Clustering (Sp-Cl) scheme for stable clustering. It focuses on making fewer clusters as compared to other lower ID clustering schemes. Cluster stability is measured through cluster configuration against vehicle mobility. Nodes keep on joining and leaving the clusters with Sp-Cl dealing with vehicle transitions among clusters by reducing re-clustering.

Oliveria et al. proposed Adaptive Data Dissemination Protocol (AddP) [ 104 ], a multi-hop broadcasting protocol that deals explicitly with the high-density area, following distributed clustering. In AddP, CH is selected based on vehicles' position and velocity, thus making relay selection dependent upon density and distance factors. It dynamically adjusts the periodicity of beacon messages, while reducing the communication overhead.

A modified version of AddP is later proposed as Optimal Adaptive data dissemination Protocol OAddP [ 105 ], which deals with different traffic flow, utilizing prediction-based decision-making schemes to generate clusters and disseminate data. Selo et al. proposed a distributed approach based on Coalitional Game Clustering (CGC) [ 106 ] that allows every vehicle to make a distributed cluster with other vehicles based on coalition value. This value is based on connection lifetime and speed difference among vehicles. In CGC, distributed clustering only requires values like link quality and speed of neighbouring vehicles. This approach helps in achieving a high SNR with balanced distributed clusters.

Zone-based clustering

Zone-based clustering works based on the highest residual energy in each zone specified for providing location and detecting objects for real-time reporting [ 107 ]. It is concerned with the formation of clusters in different zones. These zones can be formed based on different parameters, i.e., N-Hop neighbours, interest group, energy, etc. Data acquisition schemes may vary within zones, outside zones and among different zones. Clusters are formed based on zone interests. In zone-based clustering, the clustering process is relatively controlled, and this ultimately makes the CH election and packet delivery ratio effective [ 108 ].

Abderrahim et al. [ 109 ] proposed a Clustering-based, Multi-metric adaptive mobile Gateway Management Mechanism (CMGM) for a VANET-3G integrated network architecture that further uses the concept of mobile gateways. CMGM works for clustering gateway candidates, in which the CH acts as a gateway to interface VANET with the 3G environment.

Brendha et al. proposed a Zone-Based Cluster Head for Routing (CZCHR) [ 110 ]. Forwarding collector packets are generated from one road end to another, based on various parameters like buffer queue, length and link lifetime. Despite sending forwarding collector packets (FCP) in all nodes in the zone, it follows the traffic-aware technique to send FCP when the nodes in the zone leave from the current road to the intersection. This strategy makes it lightweight, less crowded and energy efficient.

Flat data collection protocols

All nodes connected to a network are treated as equally operational on a flat topology in flat data collection protocols. A sink is supposed to receive data from sources through multi-hop paths [ 111 ]. Flat protocols are also known as homogeneous protocols because all nodes have the same capabilities. Flat data acquisition protocols can be implemented in small networks because small networks give better results with a flat topology [ 112 ]. Reactive routing protocols maintain only active routes, unlike conventional routing protocols. Routes are maintained for the nodes that are currently being used for sending data packets.

Nodes are supposed to take path information from packet headers or their internally maintained routing tables. In larger and mobile networks, reactive routing is suitable, especially in VANETs [ 113 ]. Anjana et al. [ 114 ] proposed a Data Gathering based Routing Protocol (DGRP). Without maintaining a routing table, it allows the adoption of changes to opt for the best possible choice by considering Quality of Service parameters. All routing paths are created through source nodes present in the network. After collecting all the information, source nodes send a query to destinations through the network system. Complexity and overhead requirements for a distributed location database service can be considered as DGRP constraints.

Omar et al. [ 115 ] proposed an intelligent Unmanned Aerial Vehicle Assisted Routing protocol (UVAR) for urban VANETs. UVAR improvises data routing and vehicle connectivity through an aerial UAV, while targeting the ground only when the network is poorly dense. Furthermore, UVAR targets forwarding data packets through aerial vehicles to the ground reactively to outperform conventional V2V communications. Thus, it is functional in both environments, i.e., in the ground for improving data delivery efficiency and in the sky for transmission of data packets using reactive routing.

UVAR is remarkable in re-establishing communication links, along with re-linking disconnected road segments. However, UAVs, despite being efficient, add extra costs for batteries, fuel, and maintenance. This protocol compliments UAV-UAV communication and UAV-to-Ground-Vehicles to retain diverse information about the connectivity status [ 116 ]. This additional cost makes it difficult to incorporate UVAR as a better solution. Moreover, UVAR does not utilize GPS information and trajectory calculation during route discovery and data forwarding.

Boangoat et al. designed a protocol named PROMPT [ 117 ], which is a cross-layer, position-based, delay-aware communication protocol. PROMPT works on positions independently of vehicle movement and relies on vehicle monitor information exchange statistics that help in selecting the most suitable paths. In contrast, proactive protocols are functional on shortest-path algorithms; these are table-based protocols and keep all the required information of connected nodes in tables. Tables are usually shared with neighbouring nodes, so that every change must be also updated to other nodes [ 118 ]. In VANETs, proactive protocols are not suitable, due to rapid changes in positioning. Moreover, consumption of more bandwidth and large table sizes of information make proactive protocols inappropriate for VANETs. Wassim et al. proposed a proactive protocol named Adaptive Data Collection Scheme (ADCS) using 3G/LTE [ 119 ]. ADCS being a proactive protocol, ended up giving a high packet loss ratio.

A summary of Best-Effort Protocols is given in Table 2 . This summary illustrates that best-effort protocols are better suited for high traffic density due to their adaptability to traffic conditions [ 76 , 77 , 78 , 79 ]. Second, unlike DTN protocols [ 48 , 51 , 60 , 61 ] the average levels of forwarded messages in best-effort are either low or medium for the majority of protocols [ 72 , 73 , 79 , 83 , 84 , 94 , 96 , 98 , 99 , 100 , 101 , 104 , 105 , 110 , 114 , 117 , 120 ] but still give a better packet delivery ratio: this is another positive factor of best-effort protocols. A better packet delivery ratio, along with a lower packet drop ratio, distinguishes them from DTNs. Furthermore, best-effort protocols are significantly higher performers in terms of flexibility in adaptability to traffic density. However, a recovery strategy is less well followed in best-effort protocols due to the nature of their operations; i.e., they are neither time-bound nor guarantee reliability. Best-effort protocols are thus considered suitable when the value of cost and space are more important than timely and reliable data collection and delivery, as, for example, in retrieving data about weekly accident cases on a specific road for future traffic analysis [ 121 ].

  • Real-time protocols

In real-time protocols, the value of data decreases rapidly with time and limits the tolerable delay [ 122 ]. The value of data is of prime concern because of VANETs’ high dependence on current data to make decisions, to improve user safety, traffic flows and to assist auto driving. Data delivered after the designated time is least helpful in rapidly changing traffic conditions. Outdated or delayed data does not offer the desired real-time traffic monitoring which is essential aspect of the smart traffic navigation services currently used by millions of drivers. Thus, real-time protocols are significant and better suited for analysing current and live traffic conditions. In this case, primitive sensor readings are not required to be stored at the device, reducing storage overhead. However, data being sent to a network can be relatively large and, thus, the communication overhead increases. In addition to time constraints, the diversity of areas for information extraction and dynamic path selection are also critical factors for evaluation. We have categorized the protocols as follows.

Cluster-based protocols

Clustering involves the grouping of nodes according to density, velocity, position, and geography. Due to frequent mobility in VANET, clustering algorithms perform dynamic restructuring of connectivity patterns among neighbouring vehicles. It works on a few clusters to have more control over its structure without exceeding a communication overhead. These protocols work on a virtual backbone infrastructure to efficiently deliver and collect data in VANETs [ 11 , 123 ].

Zongjian et al. proposed a Real-time traffic-Information-aware Data Extraction (RIDE) [ 122 ] scheme for satisfying data collection time constraints. It treats data collection as a schedule optimization problem and proves it to be an NP-complete problem. It is a real-time traffic adaptive data collection protocol that considers the criticality of time to minimize the data transmission overhead [ 122 ]. RIDE is based on a Dynamic Programming (DP)-based solution designed to manage small-scale data collection where only a small number of vehicles are involved. DP gives an optimal solution in which one problem can be divided into sub-problems, whose solution can be memorized to substitute for future values instead of recalculation. RIDE further uses a Genetic Algorithm (GA) for substantial road segments and complex situations. GA works by choosing a random neighbour of the source in the next layer. After that, it assigns data forwarding counts randomly according to the constraints. The iteration keeps on repeating until it reaches the BS, thus achieving real-time data collection [ 122 ].

Clustering Adaptation Near Intersection (CANI) [ 85 ], uses Online Sequential Extreme Learning Machine (OS-ELM) to let vehicles continuously learn and update in real-time to predict behaviour and adapt clustering near intersections to accomplish data collection.

Position-based protocols

In position-based protocols, each vehicle keeps track of its neighbouring vehicles by periodically sending beacon messages [ 124 ]. It maintains dynamically updatable data storage at the sink, which causes communication overhead. The sender traces the position of the destination by utilizing coordinates. Salim et al. has explored a Hybrid Bee Swarm Routing (HyBR) protocol based on a continuous learning paradigm for maximum data packet delivery with minimum delay. It combines two routing methods, i.e., topology and geography-based routing, where the former works through sending beacon messages where all nodes are informed of their neighbours and activated links. Each node possesses its routing table, which contains various routes toward the desired destination. In the latter case, a fitness function is given in which optimal route discovery is made by selecting mutation operators, parents and crossovers, according to the geographical information of the vehicles [ 125 ]. The main problems in HyBR are route poisoning and outdated information in the routing table.

Ion et al. proposed a protocol named DISCOVER [ 123 ] that collects the data in a large city area using a single network structure, i.e., a multi-hop made up of vehicles only. DISCOVER is distributed and adapted for different traffic densities and traffic conditions in a real-time manner. There are two designated waves, i.e., a forward wave and a reverse wave. The forward wave is meant for dissemination, and the reverse wave works for data collection. In this protocol, FCD works for periodic delivery of vehicular data via the RSU.

Tarek et al. proposed the Secure-Greedy Traffic-Aware Routing protocol (S-GyTAR) [ 126 ]. Real-time traffic evaluation is performed to identify malicious nodes and thus stop them from forwarding data. This protocol continuously monitors traffic for secure data communication in real-time scenarios. In S-GyTAR, CH evaluates the trustworthiness of cluster members through Reputation Model (RM). This protocol is a modified version of GyTAR (a position-based protocol) [ 127 ], in which data is sent through the network, intersection by intersection, until it reaches its final destination. Although a part of this protocol relies on clustering, it is best suited in position-based protocol due to its ability to position suspicious vehicles and then stop them from functioning ahead.

Julio et al. presented a Real-Time Adaptive Dissemination (RTAD) [ 128 ] that allows each vehicle to automatically adopt the best-suited dissemination scheme for specific situations. RTAD utilizes parameters like vehicle density and topological characteristics. As a result, more vehicles are informed through fewer messages, thus mitigating broadcasting storms.

A summary of real-time protocols is presented in Table 3 . In real-time protocols, the levels of average forwarded messages are relatively high, which ensures guaranteed packet delivery with a lower packet drop ratio. One of the critical aspects of real-time protocols is the presence of recovery strategies in their design that makes them preferable when dealing with any accidental situation when it is important to deliver data within a designated time. Real-time data collection protocols add considerable cost for the functioning of the network, but facilitate updated, timely, and reliable data transfers. We can see in Table 3 that end-to-end delay is low for all protocols, with high levels of forwarded messages. However, exceptions like [ 122 , 125 ] exist that give high packet delivery ratios with low average forwarded messages and low packet drop ratios. Real-time protocols are now considered through hybrid and intelligent operational domains that facilitate low forwarded messages with a better delivery ratio. [ 122 , 125 ] are designed based on artificial intelligence approaches that shift the paradigm gradually from high-cost protocols to cost-effective protocols.

Another possibility of real-time protocols that can be seen from Table 3 is their adaptability to deal appropriately with different traffic patterns. However, real-time protocols can be compromised occasionally by the packet dropping ratio [ 123 , 126 , 127 ] and high number of forwarded messages [ 85 , 123 , 126 , 127 , 128 ], and their high cost, but these protocols cannot compromise over data delivery. Thus, the packet delivery ratio cannot be low in real-time protocols to ensure guaranteed data delivery within the specified time. These protocols are ideally considered for defence services and emergency applications, where the value of data is more critical then cost.

4-D Functional analysis

A functional framework is designed to analyse data collection protocols based on four characteristics: node mobility, protocol class, protocol type and data acquisition initiation, as illustrated in Fig.  3 . We analyse the protocols according to these four characteristics, as explored in Table 4 . Node mobility identifies whether a protocol supports static and mobile sinks or sensors. It has been observed that most of the protocols follow the pattern of mobile sensor, static sink (MS) or mobile sensor, mobile sink (MM). Only PBS [ 83 ] and Sp-Cl [ 103 ] follow the static sensors, mobile sink (SM) option, where the mobile data collector takes the data from sensing points that collect the data from vehicles. No one adopted the static sensors, static sink approach, as vehicles are in motion, together with sensors. The next characteristic, titled “protocol class”, is used to identify the category of the protocol, such as flat protocol (FP), hierarchical or cluster-based (CB) and semi-structured (SST), which are hybrid.

figure 3

4-D Functional Framework

The next characteristic, “Protocol Classification”, identifies the suitability of the protocol for Real-Time (RT), Delay Tolerant (DT) or Best-Effort (BE) scenarios as shown separately in each category. The “Data Initiator” for data acquisition initiation can detect when the sink node pulls data as “Sink Pull (PS)” or data is pushed by the source node as “Sensor Push (SP)”. Most protocols follow the SP approach, which is adopted to continuously push the sensing data to central repositories. However, it may cause a bottleneck, and data may be lost during transmission in case of congestion. PS-based approaches are used when the query forwarded to all sensors is weak, resulting in loss of critical data.

Table 5 explores a comparison of data collection protocols that are categorized as DTN, BE or RT. It explores the system model, proposed technique, and related metrics. It shows how a protocol is good in privacy protection, but not energy-efficient, and similar combinations of advantages and limitations. We also consider efficiency, latency, motion estimation, area of application, environment, and mechanism. It has been observed that the majority of protocols are designed and tested to be functional in an urban environment. This gives irregular patterns, with a high density of vehicles, along with a high ratio of traffic jams, accidents, and greater chances of being attacked by malicious nodes. This is why the urban environment gives a more challenging environment to newly developed protocols, and thus most researchers choose it for beneficial results and analysis.

The protocol’s system model or deployment environment is based on the urban (U) or Highway (H) scenario. In the next column, we identify the mechanism adopted in the protocol. Next, the clustering and cross-layer support is identified as Yes or No. The source of delivery is mostly Vehicle (V), but other options are Transport Buses (TB), CANI and UAV in [ 58 , 78 ] and [ 115 ], respectively. It can be noticed that protocols involved in multi-tasking and which cover multiple attributes in addition to data collection are likely to be less energy efficient [ 49 , 56 , 59 , 115 ]. This is because more energy is consumed if a protocol is dealing with multiple attributes.

Another critical aspect of the protocols designed in VANETS is that not every protocol developed for them uses vehicles as the source for data collection. For example, unmanned aerial vehicles [ 115 ] and public transport [ 58 ] are relatively new trends to consider for data collection in VANETs. Next, we explore whether the protocols consider privacy protection and what type of routing approaches are used, selecting from Reactive (R), proactive (PR) or hybrid (HY). In the following column, we note that most of the protocols adopted the AWS mobility model, and a number of simulation tools are also presented. Moreover, the protocols are also analysed to present position verification, latency, and motion estimation support. These matrices are essential for designing a dependable, optimised, adaptive and scalable protocol for data collection in VANETs (Fig.  4 ).

figure 4

Dendrogram of open research challenges for VANETs

In Table 5 , parameters like Energy Efficiency, Privacy Protection and Routing Approach are amendable to improve protocols performance. Energy Efficiency as key component of data collection protocols is relevant to protocol categories as well. For example, DTN protocols are not energy efficient mostly. This is due to the fact that while waiting for the data to be delivered, vehicles keep on transferring the data to other vehicles and sometimes retain it to themselves while using massive memory units. That’s why, energy efficiency largely compromised over DTN. On the other hand, Best-Effort protocols are mostly energy efficient in comparison with DTN and Real-Time protocols, because of their nature of collecting data only when it is feasible without replicating and transferring it unnecessarily.

Another critical aspect is, Real-Time protocols are also energy efficient due to their time bound characteristics. They generate high targeted average forwarded messages to reduce regenerations and avoid keeping data for longer period of time. Privacy protection is a matter of overall protocol priorities. To save protocol resources and completion time, researchers avoid privacy protection. That’s why, irrespective of category only 12 protocols out of almost 60 protocols has incorporated privacy protection factor. Routing approaches are mainly Hybrid, Reactive, and Proactive. The choice of approach is also based on protocols need to interact with other vehicles and BS. For example, in Table 5 , Hybrid, Proactive, and Reactive are randomly distributed among different categories.

Layered perspective deals with identification of protocols on the basis of cross layer, network layer, physical layer, application layer, and link layer. Majority of protocols are cross-layer protocols because cross-layer design allows protocol to share and exchange network information among different layers. This quality ensure the best route selection by considering energy consumption as well as other performance requirements. Network layer protocols are better in handling the routing and sending the data between different networks. Link layer protocols are operational only on local network segment (link). Application layer protocols are mostly shared communication protocols that defines how application processes among clients and servers.

As per above mentioned discussion, we can interpret the factor like energy efficiency, routing approach and privacy protection can be amended, but change in one factor may result in change of protocol category as well. A protocol can be converted to energy efficient, but while increasing efficiency we may limit the delivery time. Hence, changing DTN or Best Effort to Real-time protocol. Similarly, to add privacy protection we may need to compromise on time and efficiency and a Best effort may switched to DTN resultantly. A slight change in one parameter influence others and create changes in protocols classifications.

Protocols standardization indicate whether the protocols have been standardized in industry or just proposed by academia. 74% protocols are academically researched and proposed and 26% protocols are practically implemented in various industrial scenarios and actual cities. Academically proposed protocols are well-supported and tested through virtual environments created by SUMO and Mobisim. Protocols that are actually implemented as a data collection solution for real environment of cities/counties are the ones that are industrially standardized. As per stat, there are lesser amount of protocols that are standardized in industry i.e. 26% and more protocols are merely proposed in academia.

It is likely that energy and time efficiency is not swiftly being implemented among actual cities. This trend is expected to be changed and future may bring more industrially supported protocols as interest in smart cities and IoV based projects are considerably rising. An important factor related to standardization is lack of standardizing bodies. There are few standardization bodies i.e. Internet Engineering Task Force (IETF) that work for internet routing protocol standardization criteria and document internet standards for routing criteria. There are no prominent VANETs standardization bodies as such that particularly deals with Intelligent Transport systems, VANETs and particularly data collection protocols for VANETs. Protocols that are implemented in cities and counties are approved through IETF, IEEE and related umbrella standardization bodies.

Open research challenges

VANETs have drawn remarkable interest in both industrial and academic sectors due to their potential applications and services. The boom in self-driving cars and other prediction-based traffic services has significantly increased the demand for improvement in VANETs. However, locating vehicles’ positioning, maintaining, and interpreting an exact view of the entire network, a high number of nodes, rapidly changing node mobility, swift topological changes, and frequent network disconnections add potential challenges in the area of VANETs.

Designing energy-efficient and cost-effective communication approaches for data collection is a dire need at this time [ 142 , 143 ]. VANETs are challenging for data communication because of frequent node disruptions, high node density, and limited infrastructure availability to cope with the change [ 144 , 145 ]. Furthermore, the intervention of suspicious and malicious vehicles affects normal vehicle operations in terms of data collection and makes it a critical task [ 146 ]. Data is collected from all kinds of vehicles together, raising authenticity concerns and adding doubt in analysing it. Therefore, different protocols are being designed nowadays to meet the challenging data communication needs in the best possible manner [ 147 ].

When a protocol is developed, researchers focus on single or multiple objectives. Some of them focus on making lightweight protocols, and others target energy efficiency to save considerable space. It is nearly impossible to incorporate all the desirable factors in one protocol without compromising any of the factors of storage, time or efficiency. In other words, there are always trade-offs.

Other challenges include geographical mapping and addressing, risk management and trust analysis, data confirmation and authenticity evaluation, inter-vehicular and intra-vehicular communication baselines, reliability checks, data prioritization, addressing and monitoring issues, privacy and anonymity, real-time changes and protocol competency to deal with real-time changes and so on. The challenges are infinite and exploit different parameters, making VANETs vulnerable to various attacks. A few of the latest and most crucial research challenges are discussed here.

Hardware hacks

VANETs are directly and indirectly dependent on hardware, as vehicles use hardware-based sensors for speed, temperature, location and various other devices for monitoring and security purposes. Manipulation of sensors by hardware hackers might alter actual and real-time data to deliver fabricated information to alter planned routes and affect vehicles’ navigation for effective data communication. Attackers find it easy to exploit hardware along the path of the data to the destination; it passes through various nodes assisted by hardware, thus creating chances to over-write it. Hardware authentication is challenging in VANETs because of the millions of hardware devices involved at any given time. Data trustworthiness becomes complicated when an infinite number of devices communicate concurrently to effect the desired data collection. Hardware hacks for safety–critical systems are even more critical to handle [ 148 ]. Lack of tamper resistance (intentional causing of malfunctions by users) in sensors and devices allows exploitation of physical access to system and vehicular products. Although there are many data protection techniques for VANETs, energy-efficient hardware protection from hacks is still an area to be resolved [ 23 ].

Delay constraints for data delivery

Due to rapidly changing traffic conditions, data needs to be delivered within a specific amount of time; updated data is most credible and valuable for different real-time applications. Security-based data is especially important to receive within a certain deadline due to its high sensitivity, costs, and the risk factors associated with it. The time factor is incorporated with private data to give the least possible time for attackers to exploit data authenticity [ 149 ]. The contamination of redundant data, along with rebroadcasting, affects data delivery and leads to unforeseeable delays. Delay is purposefully added through slow message propagation to limit or pause the next data initiation phase, which results in scheduling delays. With redundant, rebroadcasted and outdated data, come quality compromises and delayed transmissions.

Maliciously induced delays significantly damage traffic routes, causing jams and poor monitoring. Plausibility checks also occasionally add delays; in other words, while checking quality reasonability, delay happens. Unidentifiable encryption, route breakage, accountability issues, encryption failures and authentication failures are other critical causes of delays in data delivery. Heterogeneous networks collaborating to accomplish data communication is also another leading aspect of extreme network congestion and network isolation that requires reconstruction of network scheduling strategies.

When the desired information takes a more extended period than expected or exceeds the allowed threshold, this indicates the theft of data by attackers before it could reach the required destination [ 147 ]. This problem arises in data communication because of rapidly changing traffic conditions and other factors, such as link disruption or road jams, resulting in delays in data delivery, thus giving more time to attackers for data theft. The delaying factors mentioned above are areas that need considerable attention to fully explore the desired solution and ensure timely data delivery [ 144 ].

Massive data management

A considerable amount of work has been done for data communication in nearly all possible scenarios, but the management of the enormous data blocks coming from a massive number of vehicles that are sending data at high rates per second is still challenging. Such massive data management becomes even more complicated with poor identification of data sources. Unpredictable data sizes, duplicated data, malicious data involvement, and manipulated data also make data management a complex task. Duplicated and manipulated data considerably increase data sizes, adding to the complexity of data handling. Data analysis sometime requires data rearrangements upon locating, for instance, identification spams, broadcasting failures, and unidentifiable data.

Data management with signalling disruptions, interconnected topologies, and complicated cryptographic structures hinders the smooth functioning of data algorithms. The exploitation of the data already saved is subject to saving data in such a way as to avoid data theft, protect saved data blocks, and allow changing and overwriting data without creating opportunities for manipulation by potential attackers. There is a need for algorithms to deal with a massive amount of data storage units. These concerns are the least answered questions and still give researchers space to investigate these areas to drive potential solutions [ 150 ].

Standardization in VANETs

When the term ‘vehicle’ is mentioned in this context, it does not only indicate a normal car used for personal tasks. The word ‘vehicle’ is a broader term that includes public buses, trains, electric scooters, motorbikes, six-wheelers, loaders, taxis, and ambulances—some of the vehicles better suited for urban scenarios and some for highways. There are even some categories of vehicles that are not yet identified; unidentified vehicle categories with broader and ambiguous contexts make it complicated to properly define the functionalities and scope of VANETs. For example, electric two-wheelers are not allowed on some motorways and big highways. A taxi might give less link disruption than a privately owned car that spends less time on the road. What are the standards for all vehicles to develop a data communication protocol for variable clustering requirements [ 77 ]?. The word ‘vehicle’ itself is a generic and broad term, and standardization of these small things can significantly impact the design of protocols and algorithms.

Integrity and data trust

VANETs are highly dependent upon V2V and V2I communication. At this time, there is a critical need to establish protocols to resolve data integrity issues in both V2I and V2V communication [ 150 ]. When data is transferred hop by hop (vehicle to vehicle), data must be received from source to destination without any alteration [ 151 ]. Malicious vehicles try to drop or fabricate received data and then send manipulated data without anyone even knowing about suspicious activities [ 152 ].

An efficient and effective detection scheme to deal with such V2V frauds is needed to handle fabrication and modification in a real-time manner, or preferably before it happens [ 153 ]. Data trust can be exploited during data cloud failures, especially during storage, synthesis, updates and movement. Data duplication also challenges integrity, along with various other trust management concerns, e.g., localization challenges, privacy breaches, or anonymity with geographical addressing.

Traceability and revocation

Extensive and remarkable work is being carried out to catch malicious and suspicious nodes during data communication, dissemination, and routing processes [ 154 ]. However, the least focus is given to permanently blocking malicious nodes from contacting the network again. In other words, protocols work to highlight malicious nodes, but do they keep data of those malicious nodes to permanently block their access to the same network again? Every time a malicious node is detected and rejected, it has to pass through the same process repeatedly to get access. Eventually, the protocol will be detecting it frequently to stop its possible attempt. Researchers can maintain a database of malicious nodes to save time, energy, and effort. Moreover, non-traceable packet transfer with redundant information and re-entry of suspicious nodes makes this area of VANETs yet to be explored and challenging at the same time.

Concluding notes

VANETs have been critically challenged recently, due to their extensive applicability in ITS, Internet of Vehicles IoV and growing interest in smart cities. Data collection has been a widely studied aspect of VANETs for secure and smooth communication flows. Data collection protocols are paramount for ITS and IoV, regarding efficiency, efficacy, time and cost-effectiveness. This paper encapsulated a detailed overview of data collection protocols based on three primary categories of VANETs, i.e., DTN, BE, and RT. We examined these protocols with a structured taxonomy design to provide broader insight into categories, sub-categories and relevant supporting examples suitable for data collection in VANETs. Each technique is thoroughly interpreted and investigated, based on various evaluative parameters, such as End-to-End Delay, Packet Delivery Ratio, Packing Drop Ratio, Average Forwarded Messages, Recovery Strategy, since these parameters affect the success of a VANETs.

Later, we supported our comparative analysis via a 4-D functional framework comprised of four integral data collection areas, i.e., Node Mobility, Data Initiator, Protocol Class and Protocol Classification. Our proposed 4-D functional framework can categorize any VANET-based protocol without requiring researchers to pass through extended literature readings. Finally, a comparison table of data collection protocols with different evaluative parameters is provided to assist users in determining the better choice based on suitability and credibility. Due to the diversity of data collection protocols, we have mapped them in various general categories for feasibility and ease of understandability. This diversity of data collection protocols demands selection criteria based on mechanism adapted, network layers, routing approach, latency, privacy approach, motion estimation and sources of delivery used. In this regard, we have performed an exhaustive categorical comparison to highlight the advantages and disadvantages of data collection schemes under different metrics, highlighting varying network characteristics.

Moreover, we have included the simulation tools used in the selected schemes to guide researchers as to the credibility of other research experimentation. This parameter enables us to analyse the worth and actual capabilities of DTN, RT and BE protocols for data collection schemes associated with VANETs. A three-step analysis (Parametric, 4-D functional and Categorical) allows readers to instantly identify a protocol’s advantages, disadvantages, operational benefits, constraints, and other valuable features necessary to fully understand its domain. Furthermore, we have opted to mark and highlight every protocol in a tabular format to avoid lengthy literature reads for researchers. Current open research challenges with brief dendrogram is presented in the last section to broadly specify the barriers and application gaps of VANETs. These challenges can facilitate researchers of this area to proposed solution in response to any of the gap identified. Covered open research challenges deduced after studying more than sixty data collection protocols to draw researchers’ attention to unexplored and underserved areas to bridge various gaps can let us enjoy the full benefits and services of VANETs.

Abbreviations

Unmanned aerial vehicles

UAV-Assisted VANETs routing protocol

Floating car data collection

Two Exponents Model

Simple Obstacle Shadowing Model

Adaptive data dissemination protocol

Adaptive beacon congestion control mechanism

Partitioning gradient based

Distributed optimized time

Real-time traffic Information aware data collection solution,

Distributed data gathering protocol

Carrier sense multiple access/collision avoidance

Data collection mechanism for smart grids using public transportation buses

Software Defined Vehicular Networks

Data relationship degree-based clustering data aggregation

Preferred group based

Zone Routing Protocol

Data gather based routing protocol

Real-time adaptive dissemination system

Extended cluster-based data gathering protocol

Transfer Utility of Node’s Buffer Scheduling Strategy

Secure-Greedy Traffic-Aware Routing Protocol

Adaptive Data Collection Scheme

Call Admission Control

Optimized Link State Routing- Version 2

Token based clustering data gathering protocol

Token based-space division multiple access

Adaptive data collection protocol using reinforcement learning

Hybrid bio-inspired bee swarm routing protocol

Secure dynamic aggregation

Secure Data Aggregation Scheme

Compressive sensing based data collection

Agent learning–based clustering algorithm

MAC protocol for VANETs

OFDMA based MAC protocol for VANETs

Distance-based probability routing protocol using history of encounters and transitivity

Clustering data gathering protocol

Quality of Information- Data Gathering

Delay-Bounded Vehicular Data Gathering

Strategy Selection Algorithm

Spring clustering

Hybrid Traffic-Aware Routing

Stochastic compressive data collection

Dynamic Clustering-based Adaptive Mobile Gateway Management

Probabilistic Bundle Relaying Schemes

Lane Based Clustering Algorithm

Clustered Gathering protocol

Greedy Traffic-Aware Routing Protocol

History-based vector routing

Cluster-based accurate syntactic compression of aggregated data

Modified-Distributed and Mobility-Adaptive Clustering

Partitioning-Based Scheduling

Whale Optimization Algorithm

Cooperative Collision Avoidance

Fotros M, Rezazadeh J, Sianaki OA (2020) A survey on vanets routing protocols for iot intelligent transportation systems. Workshops of the International Conference on Advanced Information Networking and Applications. Springer, pp 1097–1115

Google Scholar  

Kaur R, Ramachandran RK, Doss R, Pan L (2021) The importance of selecting clustering parameters in VANETs: a survey. Comput Sci Rev 40:100392

Abdulshaheed HR, Yaseen ZT, Salman AM, Al-Barazanchi I (2020) A survey on the use of WiMAX and Wi-Fi on vehicular ad-hoc networks (VANETs). IOP Conf Ser 870(1):012122

Fotros M, Rezazadeh J, Ameri Sianaki O (2020) A survey on VANETs routing protocols for IoT intelligent transportation systems. In: Barolli L, Amato F, Moscato F, Enokido T, Takizawa M (eds) Web, artificial intelligence and network applications. Springer International Publishing, Cham, pp 1097–1115

Manivannan D, Moni SS, Zeadally S (2020) Secure authentication and privacy-preserving techniques in Vehicular Ad-hoc NETworks (VANETs),". Veh Commun 25:100247

Khan S, Sharma I, Aslam M, Khan MZ, Khan S (2021) Security challenges of location privacy in VANETs and state-of-the art solutions: a survey. Future Internet 13(4):96

Quyoom A, Mir AA, Sarwar A (2020) Security attacks and challenges of VANETs: a literature survey. J Multimed Inf Syst 7(1):45–54

Jiang X, Yu FR, Song T, Leung VC (2021) Resource allocation of video streaming over vehicular networks: a survey, some research issues and challenges. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2021.3065209

Article   Google Scholar  

Pavithra T and Nagabhushana B (2020) A survey on security in VANETs. In: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), IEEE, pp 881–889

Sheikh MS, Liang J, Wang W (2020) Security and privacy in vehicular ad hoc network and vehicle cloud computing: a survey. Wireless Commun Mob Comput. https://doi.org/10.1155/2020/5129620

Bhoi SK and Khilar PM (2013) Vehicular communication: a survey. IET Netw 3(3):204–217. https://standards.ieee.org/develop/wg/1609_WG.html

Soumya S, Ponnapalli VS (2020) A survey—vanets and protocols. In: ICDSMLA 2019. Springer, pp 413–419

Islam A, Ranjan S, Rawat AP, Maity S (2021) A Comprehensive Survey on Attacks and Security Protocols for VANETs. Innov Comput Sci Eng. https://doi.org/10.1007/978-981-33-4543-0_62

Hemalatha R (2021) A survey: security challenges of vanet and their current solution. Turkish J Comput Math Educ (TURCOMAT) 12(2):1239–1244

Aljabry IA and Al-Suhail GA (2021) A survey on network simulators for vehicular ad-hoc networks (VANETS). Int J Comput Appl 975:8887

Gonçalves Filho J, Patel A, Batista BLA, Júnior JC (2016) A systematic technical survey of DTN and VDTN routing protocols. Comput Stand Interface 48:139–159

Madni MAA, Iranmanesh S, Raad R (2020) DTN and Non-DTN routing protocols for inter-cubesat communications: a comprehensive survey. Electronics 9(3):482

Das SR, Sinha K, Mukherjee N, Sinha BP (2021) Delay and disruption tolerant networks: a brief survey. Intell Cloud Comput. https://doi.org/10.1007/978-981-15-5971-6_32

Ahmad SA and Shcherbakov M (2018) A survey on routing protocols in vehicular adhoc networks. In: 2018 9th international conference on information, intelligence, systems and applications (IISA), IEEE, pp 1–8

IEEE. IEEE Standards Association. https://standards.ieee.org/findstds/standard/802.11p-2010.html

IEEE. IEEE Standards Association. https://standards.ieee.org/develop/wg/1609_WG.html

Dua A, Kumar N, Bawa S (2014) A systematic review on routing protocols for vehicular ad hoc networks. Veh Commun 1(1):33–52

Ali I, Hassan A, Li F (2019) Authentication and privacy schemes for vehicular ad hoc networks (VANETs): a survey. Veh Commun 16:45–61

Cheng N et al (2018) Big data driven vehicular networks. IEEE Netw 32(6):160–167

Abraham A, Koshy R (2021) A survey on VANETs routing protocols in urban scenarios. Second international conference on networks and advances in computational technologies. Springer, pp 217–229

Hamdi MM, Audah L, Rashid SA, Mohammed AH, Alani S and Mustafa AS (2020) A review of applications, characteristics and challenges in vehicular ad hoc networks (VANETs). In: 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), IEEE, pp 1–7

Baldessari R et al (2007) Car-2-car communication consortium-manifesto vol. 1.1, pp 1–94, https://elib.dlr.de/48380/1/C2C-CC_manifesto_v1.1.pdf

Mai T, Jiang R, Chung E (2016) A cooperative intelligent transport systems (C-ITS)-based lane-changing advisory for weaving sections. J Adv Transp 50(5):752–768

Festag A et al (2008) ‘NoW–network on wheels’: project objectives, technology and achievements. In: Proceedings of 5rd International Workshop on Intelligent Transportation (WIT). Hamburg, Germany, pp 211–216

Farradyne P (2005) Vehicle infrastructure integration (VII). VII Architecture and Functional Requirement Document, vol. 1

Ma Y, Chowdhury M, Sadek A, Jeihani M (2009) Real-time highway traffic condition assessment framework using vehicle–infrastructure integration (VII) with artificial intelligence (AI). IEEE Trans Intell Transp Syst 10(4):615–627

Ma Y, Chowdhury M, Sadek A, Jeihani M (2012) Integrated traffic and communication performance evaluation of an intelligent vehicle infrastructure integration (VII) system for online travel-time prediction. IEEE Trans Intell Transp Syst 13(3):1369–1382

Misener JA and Shladover SE (2006) PATH investigations in vehicle-roadside cooperation and safety: a foundation for safety and vehicle-infrastructure integration research. In: 2006 IEEE Intelligent Transportation Systems Conference, IEEE, pp 9–16

Leinmüller T et al (2006) Sevecom-secure vehicle communication. In: IST Mobile and Wireless Communication Summit, no. POST_TALK.

Papadimitratos P et al (2008) Secure vehicular communication systems: design and architecture. IEEE Commun Mag 46(11):100–109

Meneguette RI, De Grande R, Loureiro A (2018) Intelligent transport system in smart cities. Springer, Berlin

Sedar R, Kalalas C, Vázquez-Gallego F and Alonso-Zarate J (2021) Intelligent transport system as an example of a wireless IoT system. In: Wireless Networks and Industrial IoT. Springer, pp 243–262

Armengaud E et al (2019) European innovation for next generation electrified vehicles and components. In: 2019 IEEE International Conference on Connected Vehicles and Expo (ICCVE), IEEE, pp 1–6.

Mallozzi P, Pelliccione P, Knauss A, Berger C, Mohammadiha N (2019) Autonomous vehicles: state of the art, future trends, and challenges. Autom Syst Softw Eng. https://doi.org/10.1007/978-3-030-12157-0_16

Aksjonov H, Beglerovic H, Hartmann M, Jugade S, Vaseur C (2019) On driver–vehicle–environment integration for multi-actuated ground vehicles safety advancement. IEEE ICCVE 2019:7

Pourghebleh B, Jafari Navimipour N (2019) Towards efficient data collection mechanisms in the vehicular ad hoc networks. Int J Commun Syst 32(5):e3893

Abdel-Halim IT, Fahmy HMA (2018) Prediction-based protocols for vehicular ad hoc networks: survey and taxonomy. Comput Netw 130:34–50

Senouci O, Harous S, Aliouat Z (2020) Survey on vehicular ad hoc networks clustering algorithms: Overview, taxonomy, challenges, and open research issues. Int J Commun Syst 33(11):e4402

Al-Omaisi H, Sundararajan EA, Alsaqour R, Abdullah NF and Abdelhaq M (2021) A survey of data dissemination schemes in vehicular named data networking. Veh Commun 30:100353

Wei K, Liang X, Xu K (2013) A survey of social-aware routing protocols in delay tolerant networks: applications, taxonomy and design-related issues. IEEE Commun Surv Tutor 16(1):556–578

Joseph M and Scott C. Mobile Ad-hoc Networks (manet), IETF. https://www.ietf.org/proceedings/55/177.htm

Qiu T, Chen N, Li K, Qiao D, Fu Z (2017) Heterogeneous ad hoc networks: architectures, advances and challenges. Ad Hoc Netw 55:143–152

Khabbaz MJ, Fawaz WF, Assi CM (2011) Probabilistic bundle relaying schemes in two-hop vehicular delay tolerant networks. IEEE Commun Lett 15(3):281–283

Sonkar N, Pandey S, Kumar S (2019) Probabilistic bundle relaying scheme in a multi-copy vehicular delay tolerant network. Int J Veh Inf Commun Syst 4(1):43–54

Lindgren A, Doria A, Schelén O (2003) Probabilistic routing in intermittently connected networks. ACM SIGMOBILE Mob Comput Commun Rev 7(3):19–20

Sok P, Tan S and Kim K (2013) PRoPHET routing protocol based on neighbor node distance using a community mobility model in delay tolerant networks. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, IEEE, pp 1233–1240

Benamar N, Singh KD, Benamar M, El Ouadghiri D, Bonnin J-M (2014) Routing protocols in vehicular delay tolerant networks: a comprehensive survey. Comput Commun 48:141–158

Wang G, Shao M, Li R, Ma Y and Wang B (2015) Spray and wait routing algorithm based on transfer utility of node in DTN. In: 2015 IEEE International Conference on Progress in Informatics and Computing (PIC), IEEE, pp 428–432

Paul AB, Akhil G, Biswas S, Nandi S, Sett N (2020) SAS: seasonality aware social-based forwarder selection in delay tolerant networks. International conference on innovations for community services. Springer, pp 245–265

Mujahid MA, Bakar KA, Darwish TS, Zuhra FT (2021) Cluster-based location service schemes in VANETs: current state, challenges and future directions. Telecommun Syst 76(3):471–489

Salvo P, Turcanu I, Cuomo F, Baiocchi A, Rubin I (2017) Heterogeneous cellular and DSRC networking for Floating Car Data collection in urban areas. Veh Commun 8:21–34

Oliveira R, Luís M, Sargento S (2019) On the performance of social-based and location-aware forwarding strategies in urban vehicular networks. Ad Hoc Netw 93:101925

Bilgin BE, Baktir S, Gungor VC (2016) A novel data collection mechanism for smart grids using public transportation buses. Comput Stand Interfaces 48:19–29

Soares VN, Rodrigues JJ, Farahmand F (2014) GeoSpray: a geographic routing protocol for vehicular delay-tolerant networks. Inf Fus 15:102–113

Kang H and Kim D (2009) HVR: history-based vector routing for delay tolerant networks. In: 2009 Proceedings of 18th International Conference on Computer Communications and Networks, IEEE, pp 1–6

Lu Z, Gao M, Liu Z, Qu G, Dunbar C (2019) Pass and run: a privacy preserving delay tolerant network communication protocol for cyber vehicles. IEEE Des Test 36(6):56–62

Ramaswamy L and Ravindran B (2002) A best-effort communication protocol for real-time broadcast networks. In: Proceedings International Conference on Parallel Processing, IEEE, pp 519–526

Taguchi K, Enokido T and Takizawa M (2003) Hierarchical protocol for broadcast-type group communication. In: 2003 International Conference on Parallel Processing Workshops, 2003. Proceedings, IEEE, pp 21–28

Bali RS, Kumar N, Rodrigues JJ (2014) Clustering in vehicular ad hoc networks: taxonomy, challenges and solutions. Veh Commun 1(3):134–152

Tal I and Muntean G-M (2021) Clustering and 5G-enabled smart cities: a survey of clustering schemes in VANETs. In: Research Anthology on Developing and Optimizing 5G Networks and the Impact on Society: IGI Global, pp 1012–1050

Wang Z, Liu L, Zhou M, Ansari N (2008) A position-based clustering technique for ad hoc intervehicle communication. IEEE Trans Syst Man Cybernet Part C (Appl Rev) 38(2):201–208

Benkerdagh S, Duvallet C (2019) Cluster-based emergency message dissemination strategy for VANET using V2V communication. Int J Commun Syst 32(5):e3897

Abdel-Halim IT, Fahmy HMA, Bahaa-El Din AM (2019) Mobility prediction-based efficient clustering scheme for connected and automated vehicles in VANETs. Comput Netw 150:217–233

Nazib RA, Moh S (2021) Reinforcement learning-based routing protocols for vehicular ad hoc networks: a comparative survey. IEEE Access 9:27552–27587

Gillani M, Niaz HA, Tayyab M (2021) Role of machine learning in WSN and VANETs. Int J Elect Comput Eng Res 1(1):15–20

Ari AAA, Yenke BO, Labraoui N, Damakoa I, Gueroui A (2016) A power efficient cluster-based routing algorithm for wireless sensor networks: honeybees swarm intelligence based approach. J Netw Comput Appl 69:77–97

Kumar N, Chilamkurti N, Park JH (2013) ALCA: agent learning–based clustering algorithm in vehicular ad hoc networks. Pers Ubiquit Comput 17(8):1683–1692

Chahal M, Harit S (2019) A stable and reliable data dissemination scheme based on intelligent forwarding in VANETs. Int J Commun Syst 32(3):e3869

Goel N, Sharma G and Dhyani I (2016) A study of position based VANET routing protocols. In: 2016 international conference on computing, communication and automation (ICCCA), IEEE, pp 655–660

Kumar S, Verma AK (2015) Position based routing protocols in VANET: a survey. Wireless Pers Commun 83(4):2747–2772

Salhi I, Cherif MO and Senouci S-M (2009) A new architecture for data collection in vehicular networks. In: 2009 IEEE International Conference on Communications, IEEE, pp 1–6

Ullah A, Yao X, Shaheen S, Ning H (2019) Advances in position based routing towards ITS enabled FoG-oriented VANET—a survey. IEEE Trans Intell Transp Syst 21(2):828–840

Jiao Z, Ding H, Dang M, Tian R and Zhang B (2016) Predictive big data collection in vehicular networks: a software defined networking based approach. In: 2016 IEEE Global Communications Conference (GLOBECOM), IEEE, pp 1–6

Kumar R, Dave M (2016) Data relationship degree–based clustering data aggregation for VANET. Int J Electron 103(3):485–503

Mohammad SA and Michele CW (2010) Using traffic flow for cluster formation in vehicular ad-hoc networks. In: IEEE local computer network conference, IEEE, pp 631–636

Sucasas V, Radwan A, Marques H, Rodriguez J, Vahid S, Tafazolli R (2016) A survey on clustering techniques for cooperative wireless networks. Ad Hoc Netw 47:53–81

Gu Y, Bozdag D, Ekici E, Ozguner F and Lee C-G (2005) Partitioning based mobile element scheduling in wireless sensor networks. In: 2005 Second Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks, 2005. IEEE SECON 2005, Citeseer, pp 386–395

Gu Y, Bozdağ D, Brewer RW, Ekici E (2006) Data harvesting with mobile elements in wireless sensor networks. Comput Netw 50(17):3449–3465

MATH   Google Scholar  

Bejaoui T (2014) QoS-oriented high dynamic resource allocation in vehicular communication networks. Sci World J. https://doi.org/10.1155/2014/718698

Alsuhli GH, Khattab A, Fahmy YA and Massoud Y (2019) Enhanced urban clustering in VANETs using online machine learning. In: 2019 IEEE International Conference on Vehicular Electronics and Safety (ICVES), IEEE, pp 1–6

Alsabah MKJ, Trabelsi H and Jerbi W (2021) Survey on clustering in VANET networks. In: 2021 18th International Multi-Conference on Systems, Signals & Devices (SSD), IEEE, pp 493–502

Singh JP, Bali RS (2015) A hybrid backbone based clustering algorithm for vehicular ad-hoc networks. Procedia Comput Sci 46:1005–1013

Guizani B, Ayeb B, and Koukam A (2015) A stable k-hop clustering algorithm for routing in mobile ad hoc networks. In: 2015 International Wireless Communications and Mobile Computing Conference (IWCMC), IEEE, pp 659–664

Wolny G (2008) Modified DMAC clustering algorithm for VANETs. In: 2008 Third International Conference on Systems and Networks Communications, 26–31 Oct. 2008, pp 268–273, https://doi.org/10.1109/ICSNC.2008.28

Lee J, Jeong J, Oh T, Jun J and Son SH (2016) DCMAC: data-oriented cluster-based media access control protocol for vehicular networks. In: 2016 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA), IEEE, pp 258–261

Brik B, Lagraa N, Lakas A, Cheddad A (2016) DDGP: Distributed data gathering protocol for vehicular networks. Veh Commun 4:15–29

Ren M, Zhang J, Khoukhi L, Labiod H and Vèque V (2021) A review of clustering algorithms in VANETs. Ann Telecommun 1–23

Almalag MS, Olariu S and Weigle MC (2012) Tdma cluster-based mac for vanets (tc-mac). In: 2012 IEEE international symposium on a world of wireless, mobile and multimedia networks (WoWMoM), IEEE, pp 1–6

Omar HA, Zhuang W, Li L (2012) VeMAC: a TDMA-based MAC protocol for reliable broadcast in VANETs. IEEE Trans Mob Comput 12(9):1724–1736

Demirkol I, Ersoy C, Alagoz F (2006) MAC protocols for wireless sensor networks: a survey. IEEE Commun Mag 44(4):115–121

Bazzi A, Zanella A, Masini BM (2014) An OFDMA-based MAC protocol for next-generation VANETs. IEEE Trans Veh Technol 64(9):4088–4100

Tomar RS and Verma S (2012) Enhanced SDMA for VANET communication. In: 2012 26th International Conference on Advanced Information Networking and Applications Workshops, IEEE, pp 688–693

Brik B, Lagraa N, Yagoubi MB and Lakas A (2012) An efficient and robust clustered data gathering protocol (CDGP) for vehicular networks. In: Proceedings of the second ACM international symposium on Design and analysis of intelligent vehicular networks and applications, pp 69–74

Brik B, Lagraa N, Lakas A, Cherroun H, Cheddad A (2016) ECDGP: extended cluster-based data gathering protocol for vehicular networks. Wirel Commun Mob Comput 16(10):1238–1255

Chang W-R, Lin H-T and Chen B-X (2008) Trafficgather: an efficient and scalable data collection protocol for vehicular ad hoc networks. In: 2008 5th IEEE Consumer Communications and Networking Conference, IEEE, pp 365–369

Brik B, Lagraa N, Cherroun H and Lakas A (2013) Token-based clustered data gathering protocol (TCDGP) in vehicular networks. In: 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC), IEEE, pp 1070–1074

Zahedi ZM, Akbari R, Shokouhifar M, Safaei F, Jalali A (2016) Swarm intelligence based fuzzy routing protocol for clustered wireless sensor networks. Expert Syst Appl 55:313–328

Maglaras LA and Katsaros D (2012) Distributed clustering in vehicular networks. In: 2012 IEEE 8th international conference on wireless and mobile computing, networking and communications (WiMob), IEEE, pp 593–599

Oliveira R, Montez C, Boukerche A, Wangham MS (2017) Reliable data dissemination protocol for VANET traffic safety applications. Ad Hoc Netw 63:30–44

Dwivedy B, Bhola AK and Yadav S (2019) Cluster based multi hop data dissemination protocol in V2V networks using whale optimization technique. In: 2019 International Conference on Automation, Computational and Technology Management (ICACTM), IEEE, pp 228–231

Sulistyo S, Alam S, Adrian R (2019) Coalitional game theoretical approach for VANET clustering to improve SNR. J Comput Netw Commun. https://doi.org/10.1155/2019/4573619

Gangwar PK, Singh Y and Mohindru V (2014) An energy efficient zone-based clustering approach for target detection in wireless sensor networks. In: International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014), IEEE, pp 1–7

Prabhu SB, Balakumar N (2016) Enhanced zone-based clustering method for energy efficient wireless sensor network. Int J Innov Res Electron Commun 3(4):17–22

Benslimane A, Taleb T, Sivaraj R (2011) Dynamic clustering-based adaptive mobile gateway management in integrated VANET—3G heterogeneous wireless networks. IEEE J Sel Areas Commun 29(3):559–570

Brendha R, Prakash VSJ (2019) Geographical zone-based cluster head for routing in urban vehicular network. In: Information and Communication Technology for Intelligent Systems. Springer, Berlin, pp 149–160

Biradar RV, Patil V, Sawant S, Mudholkar R (2009) Classification and comparison of routing protocols in wireless sensor networks. Special Issue Ubiquitous Comput Secur Syst 4(2):704–711

Arce P, Guerri JC, Pajares A, Lázaro O (2008) Performance evaluation of video streaming over ad hoc networks using flat and hierarchical routing protocols. Mob Netw Appl 13(3):324–336

Di Francesco M, Das SK, Anastasi G (2011) Data collection in wireless sensor networks with mobile elements: a survey. ACM Trans Sens Netw (TOSN) 8(1):1–31

Alhan A and Chawla M (2015) Analysis of encryption Dgrp-data gather routing protocol based on Opnet in VANETs. In: 2015 International Conference on Computational Intelligence and Communication Networks (CICN), IEEE, pp 1046–1051

Oubbati OS, Lakas A, Zhou F, Güneş M, Lagraa N, Yagoubi MB (2017) Intelligent UAV-assisted routing protocol for urban VANETs. Comput Commun 107:93–111

Krishna MMM. A survey UAV-assisted VANET routing protocol

Jarupan B, Ekici E (2010) PROMPT: a cross-layer position-based communication protocol for delay-aware vehicular access networks. Ad Hoc Netw 8(5):489–505

Hartenstein H, Laberteaux L (2008) A tutorial survey on vehicular ad hoc networks. IEEE Commun Mag 46(6):164–171

Drira W, Puthal D and Filali F (2014) ADCS: an adaptive data collection scheme in vehicular networks using 3G/LTE. In: 2014 International Conference on Connected Vehicles and Expo (ICCVE), IEEE, pp 753–758

Haider S, Abbas G, Abbas ZH, Boudjit S, Halim Z (2020) P-DACCA: A probabilistic direction-aware cooperative collision avoidance scheme for VANETs. Futur Gener Comput Syst 103:1–17

Zhang L, Jin B (2013) Dubhe: A reliable and low-latency data dissemination mechanism for VANETs. Int J Distrib Sens Netw 9(12):581821

He Z, Zhang D (2017) Cost-efficient traffic-aware data collection protocol in VANET. Ad Hoc Netw 55:28–39

AlMheiri SM and AlQamzi HS (2015) MANETs and VANETs clustering algorithms: a survey. In: 2015 IEEE 8th GCC Conference & Exhibition, IEEE, pp 1–6

Teymoori F, Nabizadeh H and Teymoori F (2013) A new approach in position-based routing protocol using learning automata for VANETs in city scenario. arXiv preprint http://arxiv.org/abs/1308.0099

Bitam S, Mellouk A, Zeadally S (2013) HyBR: A hybrid bio-inspired bee swarm routing protocol for safety applications in vehicular ad hoc networks (VANETs). J Syst Architect 59(10):953–967

Bouali T, Aglzim E-H and Senouci S-M (2014) A secure intersection-based routing protocol for data collection in urban vehicular networks. In: 2014 IEEE Global Communications Conference, IEEE, pp 82–87

Jerbi M, Senouci S-M, Rasheed T, Ghamri-Doudane Y (2009) Towards efficient geographic routing in urban vehicular networks. IEEE Trans Veh Technol 58(9):5048–5059

Sanguesa JA et al (2015) RTAD: A real-time adaptive dissemination system for VANETs. Comput Commun 60:53–70

Palazzi CE, Pezzoni F, Ruiz PM (2012) Delay-bounded data gathering in urban vehicular sensor networks. Pervasive Mob Comput 8(2):180–193

Soua A and Afifi H (2013) Adaptive data collection protocol using reinforcement learning for VANETs. In: 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC), IEEE, pp 1040–1045

Alsuhli GH, Khattab A and Fahmy YA (2019) Double-head clustering for resilient VANETs. Wireless Commun Mob Comput 2019:1–17

Setiabudi A, Pratiwi AA, Perdana D and Sari FR (2016) Performance comparison of GPSR and ZRP routing protocols in VANET environment. In: 2016 IEEE region 10 symposium (TENSYMP), IEEE, pp 42–47

Clausen T, Dearlove C, Jacquet P and Herberg U (2014) The optimized link state routing protocol version 2

van der Heijden RW, Dietzel S and Kargl F (2013) SeDyA: secure dynamic aggregation in VANETs. In: Proceedings of the sixth ACM conference on Security and privacy in wireless and mobile networks. pp 131–142

Liu C, Chigan C and Gao C (2013) Compressive sensing based data collection in VANETs. In: 2013 IEEE Wireless Communications and Networking Conference (WCNC), IEEE, pp 1756–1761

Feng C, Zhang R, Jiang S, Li Z (2012) QoI-based data gathering and routing guidance in VANETs. International Conference on Web-Age Information Management. Springer, pp 87–98

Guo L, Beyah R and Li Y (2011) SMITE: a stochastic compressive data collection protocol for mobile wireless sensor networks. In: 2011 Proceedings IEEE INFOCOM, IEEE, pp 1611–1619

Ibrahim K and Weigle MC (2008) CASCADE: cluster-based accurate syntactic compression of aggregated data in VANETs. In: 2008 IEEE Globecom Workshops, IEEE, pp 1–10

Lee U, Magistretti E, Gerla M, Bellavista P, Corradi A (2008) Dissemination and harvesting of urban data using vehicular sensing platforms. IEEE Trans Veh Technol 58(2):882–901

Turcanu I, Salvo P, Baiocchi A, Cuomo F (2016) An integrated vanet-based data dissemination and collection protocol for complex urban scenarios. Ad Hoc Netw 52:28–38

Lee J-W, Lo C-C, Tang S-P, Horng M-F and Kuo Y-H (2011) A hybrid traffic geographic routing with cooperative traffic information collection scheme in VANET. In: 13th International Conference on Advanced Communication Technology (ICACT2011), IEEE, pp 1496–1501

Arif M, Wang G, Bhuiyan MZA, Wang T, Chen J (2019) A survey on security attacks in VANETs: Communication, applications and challenges. Veh Commun 19:100179

Sharma S, Kaul A (2021) VANETs cloud: architecture, applications, challenges, and issues. Arch Comput Methods Eng 28:2081–2102

Malhi AK, Batra S, Pannu HS (2020) Security of vehicular ad-hoc networks: a comprehensive survey. Comput Secur 89:101664

Yeferny T and Hamad S (2021) Vehicular ad-hoc networks: architecture, applications and challenges. arXiv preprint http://arxiv.org/abs/2101.04539

Hussain R, Lee J, Zeadally S (2020) Trust in VANET: A survey of current solutions and future research opportunities. IEEE Trans Intell Transp Syst 22(5):2553–2571

Ghosal A, Conti M (2020) Security issues and challenges in V2X: a survey. Comput Netw 169:107093

Gillani M, Ullah A and Niaz HA (2018) Survey of requirement management techniques for safety critical systems. In: 2018 12th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS), 2018: IEEE, pp 1–5

Gayathri M, Gomathy C (2021) A deep survey on types of cyber attacks in VANET. J Crit Rev 8(01):1029–1039

Obaidat M, Khodjaeva M, Holst J, Zid MB (2020) "Security and privacy challenges vehicular ad hoc networks. In: Connected Vehicles in the Internet of Things. Springer, Berlin, pp 223–251

Gillani M, Ullah A and Niaz HA (2018) Trust management schemes for secure routing in VANETs—a survey. In: 2018 12th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS), 2018: IEEE, pp 1–6

Mundhe P, Verma S, Venkatesan S (2021) A comprehensive survey on authentication and privacy-preserving schemes in VANETs. Comput Sci Rev 41:100411

Sun R, Huang Y, Zhu L (2021) Communication by credence: trust communication in vehicular Ad Hoc networks. Mob Netw Appl. https://doi.org/10.1007/s11036-020-01695-0

Lu Z, Qu G, Liu Z (2018) A survey on recent advances in vehicular network security, trust, and privacy. IEEE Trans Intell Transp Syst 20(2):760–776

Download references

These research was supported by University College Dublin.

Author information

Authors and affiliations.

School of Computer Science, University College Dublin (UCD), Dublin, Ireland

Maryam Gillani & Hafiz Adnan Niaz

College of EME, National University of Sciences and Technology (NUST), Islamabad, Pakistan

Muhammad Umar Farooq

National University of Modern Languages, Islamabad, Pakistan

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Hafiz Adnan Niaz .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Gillani, M., Niaz, H.A., Farooq, M.U. et al. Data collection protocols for VANETs: a survey. Complex Intell. Syst. 8 , 2593–2622 (2022). https://doi.org/10.1007/s40747-021-00629-x

Download citation

Received : 08 July 2021

Accepted : 17 December 2021

Published : 10 January 2022

Issue Date : June 2022

DOI : https://doi.org/10.1007/s40747-021-00629-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Delay tolerant protocols
  • Data collection protocols
  • Intelligent transport systems
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • J Am Med Inform Assoc
  • v.18(3); May-Jun 2011

Data standards for clinical research data collection forms: current status and challenges

Rachel l richesson.

1 Department of Pediatrics, University of South Florida College of Medicine, Tampa, Florida, USA

Prakash Nadkarni

2 Yale University, New Haven, Connecticut, USA

Case report forms (CRFs) are used for structured-data collection in clinical research studies. Existing CRF-related standards encompass structural features of forms and data items, content standards, and specifications for using terminologies. This paper reviews existing standards and discusses their current limitations. Because clinical research is highly protocol-specific, forms-development processes are more easily standardized than is CRF content. Tools that support retrieval and reuse of existing items will enable standards adoption in clinical research applications. Such tools will depend upon formal relationships between items and terminological standards. Future standards adoption will depend upon standardized approaches for bridging generic structural standards and domain-specific content standards. Clinical research informatics can help define tools requirements in terms of workflow support for research activities, reconcile the perspectives of varied clinical research stakeholders, and coordinate standards efforts toward interoperability across healthcare and research data collection.

Introduction

Data collection for clinical research involves gathering variables relevant to research hypotheses. These variables (‘patient parameters,’ ‘data items,’ ‘data elements,’ or ‘questions’) are aggregated into data-collection forms (‘Case Report Forms’ or CRFs) for study implementation. The International Organization for Standardization/International Electro-technical Commission (ISO/IEC) 11179 technical standard) 1 defines a data element as ‘a unit of data for which the definition, identification, representation, and permissible values are specified through a set of attributes.’ Such attributes include: the element's internal name, data type, caption presented to users, detailed description, and basic validation information such as range checks or set membership.

Data element and CRF reuse can reduce study implementation time, and facilitate sharing and analyzability of data aggregated from multiple sources. 2 3 In this paper, we summarize relevant CRFs standards and their limitations, and highlight important unaddressed informatics-standardization challenges in optimizing research processes and facilitating interoperability of research and healthcare data.

Background and significance

CRFs support either primary (real-time) data collection, or secondarily recorded data originating elsewhere (eg, the electronic health record (EHR) or paper records). EHR and Research data capture differ in that the latter records a subset of patient parameters—the research protocol's variables—in much greater depth and in maximally structured form; narrative text is de-emphasized except to record unanticipated information.

Historically, CRFs were paper-based. While primary electronic data capture (EDC) has steadily increased, 4 paper is still used when EDC is infeasible for logistic or financial reasons. The existence of secondary EDC also influences manual workflow processes related to verification of paper-based primary data, for example, checks for completeness, legibility, and valid codes. The present limbo between paper and EDC complicates standardization efforts.

CRF standards: current activities

Currently, no universal CRF-design standards exist, though conventions and some ‘best’ practices do. 5–9 The Clinical Data Interchange Standards Consortium (CDISC, http://www.cdisc.org ), which focuses primarily on regulated studies, has proposed such standards. However, these proposals, while valuable for general areas such as drug safety, do not address broader issues of clinical research, including observational research, genetic studies, and studies using patient-reported experience as key study endpoints.

Clinical Data Standards Acquisition Standards Harmonization

In response to the Food and Drug Administration (FDA)'s 2004 report, ‘Innovation/Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products,’ a CDISC project, Clinical Data Standards Acquisition Standards Harmonization (CDASH), addresses data collection standards through standardized CRFs. 10 Initial CDASH standards focused on cross-specialty areas such as clinical-trials safety. Disease- or therapeutic-specific standards are now being considered, along with tools and process development to facilitate data-element reuse across diseases.

OpenEHR archetypes

The OpenEHR foundation has proposed archetypes 11 12 as a basis for HL7 Clinical Document Architecture templates. 13 Archetypes are agreed-upon specifications that support rigorous computable definitions of clinical concepts. For example, the archetype for Blood Pressure measurement includes type of measure (eg, diastolic, systolic, mean arterial), measurement conditions (eg, activity level, position), body site where measured, time of day when measured, and measurement units. While Logical Observation Identifiers, Names, and Codes (LOINC) covers similar ground, it does so through numerous, unlinked concepts rather than a unified template. Also, measurement aspects idiosyncratic to BP—for example, body position—are handled by incorporation into the concept's Component Name, for example, ‘INTRAVASCULAR SYSTOLIC^SITTING.’ OpenEHR, by contrast, allows the semantic model's structure to vary with the parameter being described.

Clinical researchers have been specifying parameter measurement with precision long before ‘archetypes’ were conceived. For example, to accurately compare two medications for a chronic illness, one must control all the conditions that can influence a parameter's measurement.

Standards for medical subdomains

Domain-specific common data elements (CDEs) are emerging from groups such as the American College of Cardiology, 14 15 the National Cancer Institute (NCI)'s Cancer Bioinformatics Grid (caBIG), 16 17 NIH-Roadmap-Initiative interoperability demonstration projects, 18 the National Institute of Neurological Disorders and Stroke, 19 20 the Consensus Measures for Phenotypes and EXposures (PhenX) project (for clinical phenotyping standards for Genome-wide Association Studies), 21 the Diabetes Data Strategy (Diabe-DS) project, 22 23 and the Health Information Advisory Committee (HITAC) effort. 24 25

Data-element repositories

The NCI's Cancer Data Standards Repository (caDSR) uses the ISO/IEC 11179 design to ‘bank’ CRF questions and answer sets. 26–28 Criticisms of caDSR include lack of curation, redundancy, and the absence of a representation of CRFs. 26 CaDSR's utility will possibly improve as NCI redesigns it with requirements from collaborating organizations. CDISC is using the ISO/IEC 11179 design for CSHARE, a repository for domain-specific research questions and answer sets.

The Agency for Healthcare Research and Quality (AHRQ)-hosted United States Health Information Knowledgebase (USHIK) metadata registry 29 includes artifacts emerging from federal healthcare-standardization task forces, such as information models, data elements, data-collection requirements, functional requirements, system specifications, and supporting documentation. Controlled terminologies such as LOINC and the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), also possess certain characteristics of data-element repositories: for example, LOINC encodes questions and answers for many patient-directed surveys and clinical patient assessment instruments.

Harmonization of healthcare and clinical research data standards

Clinical information models such as the Health Level 7 (HL7) Reference Information Model (RIM) use terminologies differently than the research-oriented CDISC Operational Data Model (ODM). 30 While HL7 interoperability depends in part on mapping data elements to concepts in standard terminologies, ODM only cares that a terminology may act as a source for a data element's contents—for example, an element ‘ICD9CM_Code’ is populated with terms from ICD-9-CM, 2010 edition. ODM does not support mapping of data elements themselves (eg, serum total cholesterol, systolic BP) to terminologies. Consequently, although intended to support data interchange, ODM cannot address the mapping problem, where semantically identical data elements may have different names across different systems.

The Biomedical Research Integrated Domain Group (BRIDG) domain-analysis model was developed jointly by FDA, NCI, CDISC, and HL7 to overcome this gap. 31 BRIDG, however, still does not specify use of standard terminologies, and to date only pilot applications have been developed using BRIDG.

Limitations of existing standards and methodology

Cdash best-practice recommendations.

While valuable overall, some CDASH recommendations reflect historical paper-based workflows and off-line or non-electronic operations. For example, for parameters that must be computed in real time, the CDASH specifications advocate worksheets that require data-entry staff to use calculators, instead of programming computations directly into electronic CRFs.

CDASH controversially recommends not providing coding dictionaries for adverse events, medications, or medical history to research staff when interviewing patients, supposedly to minimize potential bias. This advice, if followed, risks introducing errors (eg, misspelled drug names owing to faulty patient recall) that can only be resolved by recontacting the patient, whereas online drug-name lists are searchable with algorithms such as Double Metaphone that support spelling-error recovery. 32 Further, online access is almost mandated for adverse-event grading using NCI's Common Terminology Criteria for Adverse Events (CTC AE), where adverse-event severity grades are defined unambiguously to minimize interobserver variation but are far too numerous to commit to human memory.

ISO/IEC 11179 data model

ISO/IEC 11179, used by CaDSR and CSHARE, was originally intended for descriptive-metadata registries. Applying it to the significantly different problem of clinical research has unearthed numerous concerns. While able to capture isolated data elements' semantics reasonably well, ISO/IEC 11179 cannot represent interelement constraints—for example, the sum of the differential white-blood-cell-count components must equal 100, and systolic blood pressure must be greater than diastolic. There is no concept of higher-level element groupings (such as CRFs), of element order within groupings, of calculated elements, or of rules where certain elements are only editable conditional to specific values being entered in previous elements (so-called skip logic).

The standard has a limited concept of data-element presentation: the relationship between an element and its associated caption and explanation text is modeled as one to one. However, for many research applications, the relationship is actually one to many. For example, in multinational clinical studies, the same CRF may be deployed in different languages. Here, data elements, while having fixed internal names, will have alternative (language-specific) captions and explanations.

Finally, while ISO/IEC 11179 repositories are effectively thesauri, their data model differs radically from the standard concepts–terms–relationships design used for thesauri such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) or the Unified Medical Language System (UMLS). While concepts (with or without associated definitions) are central to thesauri, in ISO/IEC 11170, data elements (equivalent to terms) are central: concepts merely categorize elements, and narrative definitions are (incorrectly) associated with data elements instead of concepts.

The Extended Metadata Registry (XMDR) consortium, http://xmdr.org , aims to extend ISO/IEC 11179 to address terminology issues. This group, however, appears to be inactive—its last publicly posted group meeting was in late 2007—and its impact is uncertain.

CDISC operational data model

The XML-based CDISC operational data model (ODM), a metadata- and data- interchange model, rectifies many limitations of ISO/IEC 11179. It explicitly models CRFs, and addresses the multilanguage issue through a TranslatedText element—a Unicode string plus a language identifier. The ODM also partially addresses calculated parameters and cross-element-based validation. However, it is not comprehensive enough to allow receiving systems to use imported metadata directly for CRF generation.

Computations and validation expressions

Computations and validation expressions need to be expressed in the syntax of a specific programming language. Unless both systems use the same language, manual modifications of the metadata by programmers are required. The ODM accepts this unavoidable limitation. The FormalExpression element used to specify computations must contain a ‘context’ subelement naming the language used—for example, ‘Oracle PL/SQL.’

Script/instructions

For CRFs used as interview questionnaires, cross-interviewer variation is minimized through standardized scripts—sentences spoken verbatim to the subject to elicit the desired information. Instructions provide guidance in CRF usage and data-gathering. Electronically, scripts/instructions are typically displayed on demand during data capture but are hidden during review or modification. ODM lacks both script and instruction definitions.

Validation of text

ODM lacks support for regular-expression validation of text data elements.

Interelement dependencies

Interchange models are generally simpler than data-storage models. 33 Metadata essential to storage-model robustness, but irrelevant for interchange purposes, may be omitted from the interchange model. An example of ODM omission is interelement dependencies: an element being validated, computed, or skipped depends on other elements. Dependencies may be complex—for example, calculation of renal function status depends on computation of estimated glomerular filtration rate, which in turn depends on serum creatinine, age, sex, and race.

Dependency checking prevents accidental removal or renaming of independent elements from a CRF, which would cause the CRF to operate incorrectly.

Other important ODM omissions include ownership and context of use. Practically all CRFs used in autism research, for example, are copyrighted. Copyright information must be part of the CRF definition. Context of use includes documentation about the clinical conditions where the CRF applies and prerequisites for the CRF's use.

CRF standards characterization and status

CRF standards can be conceptualized at several levels: form, group, section, and item. We summarize the areas of agreement and dispute at each level, and also consider aspects of CRF-design processes that impact consistent research data collection.

Little consensus exists on of the choice and content of CRF standardization candidates. Few CRFs can be reused unchanged across all protocols. Even for seemingly common activities such as physical exam and medical history, structured data capture—explicit recording of findings—varies vastly by disease and protocol. For example, gastrointestinal bleeding or hepatic encephalopathy is recorded explicitly in a cirrhosis study, but not in schizophrenia.

Within a tightly defined disease domain, standard CRFs seem feasible and useful, though their content may change with future variations in study designs. For example, the venerable Hamilton Depression Rating Scale originated in 1960 as a 17-item questionnaire. 34 Later, some researchers created different subsets, while others incorporated additional questions. 35 Many proposed ‘standard’ CRFs may well meet a similar fate. Long-term content stability may be one measure of CRF-standard success.

The segregation of data items relevant to a research protocol into individual CRFs is often based on considerations other than logical grouping, and may vary with the study design. For example, in a one-time survey, one may well designate a single CRF to capture all items if these are not too numerous. In a longitudinal study, however, items recorded only once at the start of the study are placed in a CRF separate from items that are sampled repeatedly over multiple visits.

One concern about ‘standard’ CRF use is that users should not be pressured to collect parameters defined within the CRF that are not directly related to a given protocol's research objectives: such collection costs resources and violates Good Clinical Practice guidelines. 36 Even instructing research staff to ignore specific parameters constitutes unnecessary information overload: presenting extraneous parameters onscreen is poor interface design. Dynamic CRF-rendering offers one way out of this dilemma: protocol-specific CRF customization allows individual investigators to specify, at design time, the subset of parameters that they consider relevant. Web-application software can read the customization metadata and render only applicable items.

Group level

A group is a set of semantically closely related parameters. For example, a Concomitant Medications group would include the medication name; how recorded (eg, generic or brand name); dosage details—numeric value, units, frequency, and duration; a start date, end date, whether this was a continuation of previous therapy, therapeutic indications, and possibly compliance information.

Other parameter groupings, such as the components of a differential white-blood-cell count or a liver-function panel, occur naturally in medicine. Typically, a group is associated with a single time-stamp that records when the event (eg, a blood draw) related to its parameters occurred, or two time-stamps to record the start and end of events that have a duration (eg, a course of radiotherapy).

Explicit associations between related parameters within the group include skip logic and expressions for calculated elements. Both LOINC and PhenX standards consider groups (‘panels’) as a series of observations. OpenEHR archetypes can also be used as section building-blocks.

Section level

A section encompasses one or more groups. The division of CRFs into sections is often arbitrary. In paper-based data capture, CRFs consisting of a single, giant section are not unknown. For example, the 1989 revision of the Minnesota Multiple Personality Inventory for psychiatric assessment has 567 questions. In real-time EDC, by contrast, subdivision into smaller sections is generally preferred, allowing (or requiring) the subject to save data changes before moving to another section. This minimizes the risks of inadvertent data loss due to failure to save, timeouts, or service interruption. Section size is often determined by the number of items that can be presented on a single desktop-computer screen.

The requirement for CRF-content flexibility to deal with disease and protocol variations impacts the involved sections/groups. It is doubtful whether section names/captions should be standardized. The designation of section headings and explanation that serve to describe the section's purpose is, we believe, best left to individual investigators.

Standardization of items is non-controversial, being the linchpin of semantic interoperability. Survey Design and Measurement Theory provides well-accepted best practices for design of good items such as mutually exclusive and exhaustive answer choices, 37 non-leading question text, 7 8 and consistency of scale escalation in answer sets. 6 A review of the literature, including the CDASH recommendations, gives useful general guidance on constructing yes/no questions, scale direction, date/time formats, scope of CRF data collection, prepopulated data, and collection of calculated or derived data. 5–9

All the standards discussed earlier emphasize use of narrative definitions for items. Such definitions need to be made maximally granular—that is, divided into separate fields—because different parts of the definition such as explanatory text, scripts, instructions, and context of use serve different purposes.

Certain items (especially questionnaire-based ones) have a discrete set of permissible values (also called ‘responses’ or ‘answers’). The set elements may be unordered (eg, ‘Yes, No, Don't Know’) or ordered (eg, severity grades such as ‘Absent, Mild, Moderate, Severe’ or Likert scales). One must record whether enumerations are unordered or ordered, because they impact how data based on these items can be queried. Thus, one can ask for patients who had a severity greater than or equal to ‘Moderate,’ but data based on unordered enumerations can only be compared for equality or inequality to a value.

CRF development process

The notion of process as vital to quality metrics and outcomes is reinforced through standards such as ISO 9000 38 39 and the health-outcomes research literature. 40–42 While CRF content is necessarily variable, consensus regarding standards for explicit processes for identification or development of quality data is more readily reached.

The CDASH standards document, ‘Recommended Methodologies for Creating Data Collection Instruments,’ presents important and necessary features of the CRF development process. The techniques described include: adequate and ‘cross-functional’ team review, version control, and documented procedures for design, training, and form updates. The FDA also requires rigor in the development, validation, and use of data elements related to patient-reported outcomes as study endpoints in investigational new drug studies. 43

Future challenges for clinical research informatics

As the field of clinical research informatics matures, it will need to move from a mode of primarily reacting to clinical researchers' needs through service provision, to one of active leadership by suggesting directions for standardization. We now identify several challenges for clinical research informatics related to data element and CRF definition and data capture.

Articulating the data-collection standards needs for all of research

The limited focus of disease-specific consortia makes comprehensive coverage of individual areas more likely. However, it may lead to proliferation of multiple, possibly incompatible, definitions for overlapping subject areas, such as tobacco exposure or dietary history. Similarly, researchers would benefit from a clear understanding of the extensive overlap of various clinical terminologies (eg, SNOMED CT and LOINC, SNOMED CT and RxNorm), as well as advice regarding which standards are appropriate for a particular research context.

CDISC's focus on regulated research leaves many standardization issues unaddressed. An AMIA Clinical Research Informatics group could be well poised to identify the gaps and devise strategies to fill them. They would also be able to address relationships between clinical research data collection standards and EHR specifications, as well as the broad issue of secondary use of clinical data for research. Additional tasks could include the review of standards and their scope, and relating them to needs of clinical research.

Banking of research-data elements

Reuse of standard CRF and higher-level groupings can be facilitated by publicly available repositories. A greatly extended database counterpart of the CDISC ODM may possibly meet the requirements of the repository data model. The comprehensive documentation of individual items and groupings, as well as links between these and concepts in standard biomedical terminologies, will increase usability and utility. 44 When additionally supported by robust search tools, the repositories can serve as educational tools for researchers. As suggested by Brandt et al , 45 item repositories can reduce the burden on new investigators to create their own items, because existing, validated items or sets of items can be reused.

We now discuss some significant unsolved challenges for such repositories.

Modeling of questions

Repositories must distinguish between apparently identical items that have different presentations, and provide detailed recommendations for choosing from these. Consider a questionnaire regarding past history of several clinical conditions (eg, diabetes, myocardial infarction, etc), where the response to each can be ‘Yes,’ ‘No,’ or Don't Know.’ A second questionnaire presents the same clinical conditions with check boxes, which can be either checked (Yes) or unchecked (No).

Because both healthcare and research generally require recording unknowns explicitly, CDASH correctly recommends representation 1. However, for paper-based CRFs, if the list of clinical conditions is extensive with most responses expected to be ‘No,’ CDASH recognizes that representation 2 (a series of checkboxes) is significantly more ergonomic, at the risk of introducing some data-capture error for Don't Know's. This risk depends on the patients under study, being less for highly self-knowledgeable patients. (CDASH, however, does not currently document that if primary EDC is an option, one can use representation 1 and still support good ergonomics. An electronic CRF could present all items during initial data entry with the default ‘No’ preselected, with onscreen instructions to click ‘Yes’ or ‘Don't Know’ as applicable.)

While ‘best practice’ recommendations clearly depend on the clinical setting, repositories that are intended to guide investigators must also include recommendations and guidance.

Value set or terminology ‘binding’

The linking of repository elements to concepts in clinical terminologies presents several challenges.

Clarity regarding the motivation and strategy for clinical coding

Best-practice approaches for employing controlled terminologies must be defined and documented. For example, while SNOMED CT has a complex concept model, its use can involve a simple approach if the use case supports it. For example, the Patient Registry Item Specification and Metadata (PRISM) project, which applies SNOMED CT for data elements related to rare disease registries, employs only certain SNOMED CT hierarchies and does not require post coordination for situational context. 46 Other uses cases, particularly those that involve interoperability between disparate systems, could increase mapping complexity and mandate post coordination.

Different ways of encoding the same item

Because many data items contain question and answer components, there are multiple approaches to use them. In SNOMED CT, for example, one could use the Concept ID ‘Abnormal Breath Sounds’ (concept ID #301273002) with the qualifiers ‘Present’ (#52101004) or ‘Absent’ (#2667000). Alternatively, the combination of question-plus-answer (‘Abnormal breath sounds=absent’) could be represented by a single SNOMED CT concept ‘Normal Breath Sounds’ (#48348007).

Any standardization effort will need to specify guidelines for consistent use of SNOMED CT, to help eliminate most of the terminology–information model interactions that plague standards implementation in healthcare. 47–49 However, a fully modeled SNOMED CT expression to represent the question and assign its semantic aspects, relying on existing SNOMED CT modeling guidelines, 50 is probably unwarranted.

Clinical research informaticians may need to create a SNOMED CT extension to support fully modeled expressions. While coordination of multiple parallel efforts may become an issue, identifying and comparing modeling, implementation, and coding strategies is a high priority.

Absence of clear standards for psychosocial assessment items

While most data elements related to clinical disease might be expected to match to SNOMED CT concepts exactly, this is not the case for questionnaires that deal with psychiatric/psycho-social areas. For example, the 24-question Center for Epidemiological Studies Depression (CES-D) Scale 51 is widely used for self-rating by patients undergoing cancer therapy. One CES-D question is ‘You were bothered by things that usually Don't bother you,’ with the four-level ordinal response set, ‘0, Rarely; <1 day/week’; ‘1, Some of the time (1–2 days/week)’; ‘2, Moderately (3–4 days/week)’ and ‘3, Most/all of the time (5–7 days).’ Responses to all items are summed to yield an overall depression score.

Trying to fully define either the question or the answers to existing SNOMED CT concepts using post coordination is a formidable challenge, especially given that SNOMED CT's compositional model does not support the NOT operator (SNOMED CT User Guide 2010, appendix B; Negation). Representing CES-D items through pre coordination would require the creation of 24×4=96 new SNOMED CT concepts. While many applications of SNOMED CT aspire to use it as an ontology—a collection of information that supports reasoning—it is doubtful if either approach would enable useful reasoning either with the concepts themselves or with data indexed by them (eg, ‘identify patients with a score greater than 2’). SNOMED CT, like most terminologies, currently has no idea either of ordered sets, or of numeric operations. Similarly, trivial reasoning problems, such as determining that a score of 3 is worse than a score of 2, are impossible with SNOMED CT's current knowledge representation but would be readily addressed with a modestly augmented 11179-based representation.

An alternative standard for representing observations and measures is LOINC, which currently (June 2010) contains 58 967 terms with 15 608 clinical terms, including those from standardized assessment instruments. While CES-D is not currently included, Bakken et al have verified LOINC's suitability for ordinal scales. 52 Hopefully, the LOINC—IHTSDO cooperation will include coordination of content for such items, as well as discourage redundant efforts by independent researchers and consortia.

Data aggregation

Automated or semiautomated facilitation of meta-analysis of multiple data sets by electronic inspection of element definitions is an open problem. Unless element definitions across two or more studies are determined to be semantically identical—same terminology mapping, same data type, units, enumeration—or allow a mathematical transformation into a common grain, sometimes with minor or major information loss, it is not possible to combine elements across studies. To illustrate a worst-case information-loss scenario, if one study measures smoking by number of years smoked currently and in the past, while another measures the same by cigarettes per week, all that the merged data can tell us is whether a given individual is a non-smoker, ex-smoker, or current smoker.

While terminological mappings can facilitate intervariable comparisons in theory, the practical issues with terminological binding discussed above create formidable challenges. Unless non-redundant terminology subsets are created for clinical research, and the capabilities of terminologies significantly enhanced, this problem cannot even begin to be tackled except in straightforward cases.

Data-capture standards can facilitate efficacious development and implementation of new studies, element reuse, data quality and consistent data collection, and interoperability. Because of the protocol-centric nature of clinical research, opportunities for shared standards at levels higher than individual items are relatively limited compared with item-level standards. Nevertheless, disease-specific CRF standardization efforts have helped identify standard pools of data items within focused research and professional communities, and consequently helped achieve research efficiencies within their application areas. It will be interesting to see whether disease-specific efforts such as the NCI CRF standardization initiatives can remain in harmony with evolving national research standards specifications.

Of more immediate and widespread (pan-disease) relevance are standardization efforts toward the development of sound processes and workflow for CRF and CRF section development, as well as data collection and validation. Such development should also emphasize the use of terminologies to facilitate semantic interoperability. As good CRF design principles and community collaboration become best practices in clinical research, the structure and content of individual CRFs/sections can be left reasonably flexible to allow adaptation to individual protocol requirements.

Acknowledgments

We wish to thank our colleagues at the University of South Florida, M Nahm of Duke University for insightful comments on early drafts, and T Patrick of University of Wisconsin-Milwaukee for his stimulating comments on terminology aspects.

Funding: Funding and/or programmatic support for this project was provided by Grant Numbers RR019259-01 ; and RR019259-02 from the National Center for Research Resources and National Institute of Neurological Disorders and Stroke, respectively, both National Institutes of Health components, and the National Institutes of Health Office of Rare Diseases Research.

Competing interests: None.

Provenance and peer review: Not commissioned; externally peer reviewed.

U.S. flag

An official website of the United States government

Here’s how you know

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Take action

  • Report an antitrust violation
  • File adjudicative documents
  • Find banned debt collectors
  • View competition guidance
  • Competition Matters Blog

New HSR thresholds and filing fees for 2024

View all Competition Matters Blog posts

We work to advance government policies that protect consumers and promote competition.

View Policy

Search or browse the Legal Library

Find legal resources and guidance to understand your business responsibilities and comply with the law.

Browse legal resources

  • Find policy statements
  • Submit a public comment

research and data collection protocols

Vision and Priorities

Memo from Chair Lina M. Khan to commission staff and commissioners regarding the vision and priorities for the FTC.

Technology Blog

Ftc cracks down on mass data collectors: a closer look at avast, x-mode, and inmarket.

View all Technology Blog posts

Advice and Guidance

Learn more about your rights as a consumer and how to spot and avoid scams. Find the resources you need to understand how consumer protection law impacts your business.

  • Report fraud
  • Report identity theft
  • Register for Do Not Call
  • Sign up for consumer alerts

Get Business Blog updates

  • Get your free credit report
  • Find refund cases
  • Order bulk publications
  • Consumer Advice
  • Shopping and Donating
  • Credit, Loans, and Debt
  • Jobs and Making Money
  • Unwanted Calls, Emails, and Texts
  • Identity Theft and Online Security
  • Business Guidance
  • Advertising and Marketing
  • Credit and Finance
  • Privacy and Security
  • By Industry
  • For Small Businesses
  • Browse Business Guidance Resources
  • Business Blog

Servicemembers: Your tool for financial readiness

Visit militaryconsumer.gov

Get consumer protection basics, plain and simple

Visit consumer.gov

Learn how the FTC protects free enterprise and consumers

Visit Competition Counts

Looking for competition guidance?

  • Competition Guidance

News and Events

Latest news, ftc, 10 states take action against operator of sham cancer charity for deceiving donors.

View News and Events

Upcoming Event

Second scams against older adults advisory group meeting.

View more Events

Sign up for the latest news

Follow us on social media

-->   -->   -->   -->   -->  

gaming controller illustration

Playing it Safe: Explore the FTC's Top Video Game Cases

Learn about the FTC's notable video game cases and what our agency is doing to keep the public safe.

Latest Data Visualization

Visualization of FTC Refunds to Consumers

FTC Refunds to Consumers

Explore refund statistics including where refunds were sent and the dollar amounts refunded with this visualization.

About the FTC

Our mission is protecting consumers and competition by preventing anticompetitive, deceptive, and unfair business practices through law enforcement, advocacy, and education without unduly burdening legitimate business activity.

Learn more about the FTC

Lina M. Khan

Meet the Chair

Lina M. Khan was sworn in as Chair of the Federal Trade Commission on June 15, 2021.

Chair Lina M. Khan

Looking for legal documents or records? Search the Legal Library instead.

  • Cases and Proceedings
  • Premerger Notification Program
  • Merger Review
  • Anticompetitive Practices
  • Competition and Consumer Protection Guidance Documents
  • Warning Letters
  • Consumer Sentinel Network
  • Criminal Liaison Unit
  • FTC Refund Programs
  • Notices of Penalty Offenses
  • Advocacy and Research
  • Advisory Opinions
  • Cooperation Agreements
  • Federal Register Notices
  • Public Comments
  • Policy Statements
  • International
  • Military Consumer
  • Consumer.gov
  • Bulk Publications
  • Data and Visualizations
  • Stay Connected
  • Commissioners and Staff
  • Bureaus and Offices
  • Budget and Strategy
  • Office of Inspector General
  • Careers at the FTC

Facebook

Three recent FTC enforcement actions reflect a heightened focus on pervasive extraction and mishandling of consumers’ sensitive personal data.

Proposed Settlements with Avast [1] , X-Mode [2] , and InMarket [3]

In mid February, the FTC announced a proposed settlement to resolve allegations that Avast, a security software company, unfairly sold consumers’ granular and re-identifiable browsing information—information that Avast amassed through its antivirus software and browser extensions after telling consumers that Avast’s software would protect their privacy, and that any disclosure of their browsing information would only be in aggregate and anonymous form.

In January of this year, the FTC announced proposed settlements with two data aggregators, X-Mode Social and InMarket, to resolve a host of allegations stemming from how those companies handled consumers’ location data. Both companies, the FTC alleged, collected precise location data from consumers’ phones through the data aggregators’ own mobile apps and those of third parties (via software development kits, or “SDKs,” provided by the data aggregators). X-Mode, the FTC alleged, sold consumers’ location data to private government contractors without first telling consumers or obtaining consumers’ consent to do so. And InMarket, the agency alleged, used consumers’ location data to sort them into particularized audience segments—like “parents of preschoolers,” “Christian church goers,” “wealthy and not healthy,” etc.—that InMarket then provided to advertisers.

Taken together, these matters reflect several common themes that highlight serious privacy threats imposed on consumers by business models that monetize people’s personal information.

  • Browsing and location data paint an intimate picture of a person’s life, including their religious affiliations, health and medical conditions, financial status, and sexual orientation.

The FTC’s proposed complaint against Avast alleges that a sample of just 100—of the trillions—of data points maintained by Avast showed visits by people to the following websites: an academic paper on a study of symptoms of breast cancer; Sen. Elizabeth Warren’s presidential candidacy announcement; a CLE course on tax exemptions; government jobs in Fort Meade, Maryland with a salary greater than $100,000; a link (then broken) to the mid-point of a FAFSA (financial aid) application; directions on Google Maps from one location to another; a Spanish-language children’s YouTube video; a link to a French dating website, including a unique member ID; and cosplay erotica.

X-Mode, the FTC alleges, ingested more than 10 billion location data points—which the company advertised as being 70% accurate within 20 meters or less—that were linked to timestamps and unique persistent identifiers. Plotting this data on a map reveals each person’s movements, and the unique persistent identifiers make it easy to sync up a person’s movements with information—like each person’s name, email address, etc.—from publicly available sources or other data brokers.

Similarly, the FTC’s proposed complaint against InMarket alleges the company collected the precise geolocation information from 100 million unique devices each year from 2016 to the present, and cross-referenced these location histories with points of interest to identify consumers who had visited particular locations.

Browsing and location data are sensitive. Full stop.  None of the underlying datasets at issue in the FTC’s proposed complaints against Avast, X-Mode, or InMarket are alleged to have contained people’s names, social security numbers, or other traditional standalone elements of personally identifiable information (or “PII”). Indeed, the FTC’s proposed complaint against Avast acknowledges Avast’s use of a proprietary algorithm to find and remove these elements from its users’ browsing data before selling it. What makes the underlying data sensitive springs from the insights they reveal and the ease with which those insights can be attributed to particular people.

Years of research shows that datasets often contain sensitive and personally identifiable information even when they do not contain any traditional standalone elements of PII, [4] and re-identification gets easier every day—especially for datasets with the precision of those at issue in the FTC’s proposed complaints against Avast, X-Mode, and InMarket. Accordingly, the FTC’s proposed orders would require Avast, X-Mode, and InMarket to treat people’s browsing and location information as the sensitive data that it is. These companies, for example, would be subject to bans prohibiting the disclosure or use of browsing (Avast) and location (X-Mode and InMarket) information in various circumstances, and all three companies must establish and maintain robust privacy programs designed to protect their users’ browsing (Avast), location (X-Mode and InMarket

  • People have no way to object to—let alone control—how their data is collected, retained, used, and disclosed when these practices are hidden from them. 

Avast, the FTC alleges, claimed its browser extensions and antivirus software would “block[] annoying tracking cookies that collect data on your browsing activities” and “[p]rotect your privacy by preventing […] web services from tracking your online activity.” But for years, the FTC alleges, Avast sold the very browsing information they promised to protect—often without any notice to users at all. Where Avast did describe its information practices, the FTC’s proposed complaint alleges Avast deceptively promised that any sharing would be in “anonymous and aggregate” form.

The FTC’s proposed complaint against X-Mode alleges in detail how the company misled people by asserting their location data would be used solely for “ad personalization and location-based analytics”—meaning consumers had no way to know that X-Mode also sold their location data to government contractors for national security purposes.

And as the FTC alleges in the proposed InMarket complaint, users of the company’s “CheckPoints” and “ListEase” apps had no way to know InMarket would collect their precise location information (often multiple times per hour) and combine it with data collected from multiple other sources to build extensive profiles for precise ad targeting because the apps’ consent interfaces only told people their data would be used for the app’s functionality:

  • “Allow CheckPoints to access your location? This allows us to award you extra points for walking into stores” (CheckPoints app on iOS)
  • “Allow Location Permissions to unlock reminders. Get a reminder when you’re in the store so you never forget to grab the items you need!” (ListEase app on Android)

Compounding the problem, the FTC alleges, were X-Mode’s and InMarket’s use of SDKs embedded in other developers’ apps to expand X-Mode’s and InMarket’s reach. When a developer incorporates a company’s code into their app through an SDK, that developer amplifies any privacy risks inherent in the SDK by exposing their app’s users to it. Often, such code may have location and other data tracking capabilities and, because the app developer is not the company that created the SDK, the app developer may not know how their users’ data will ultimately be stored, used, and disclosed. The developer, however, will know if an SDK requires access to location permissions before they add the SDK to their app.

Purpose matters.  Data handling must align with the purposes for which it was collected. Helping people prepare their taxes does not mean tax preparation services can use a person’s information to advertise, sell, or promote products or services. [5] Similarly, offering people a flashlight app does not mean app developers can collect, use, store, and share people’s precise geolocation information. [6] The law and the FTC have long recognized that a need to handle a person’s information to provide them a requested product or service does not mean companies are free to collect, keep, use, or share that person’s information for any other purpose—like marketing, profiling, or background screening.

The FTC alleges that Avast, X-Mode, and InMarket each ignored this basic principle, and the proposed orders seek to hold them to account. Under the proposed orders, for example, Avast will have to pay $16.5 million (which the FTC plans to return to affected consumers), and all three companies will have to comply with substantial limits on how they handle people’s browsing (Avast) and location (X-Mode and InMarket) data going forward—including provisions ensuring that people are able to actually consent to how their data is collected and used.

  • Any safeguards used to maintain people’s privacy are often outstripped by companies’ incentives and abilities to match data to particular people.

The value proposition for many data purchasers is often the same thing that exposes people’s privacy: ever-more granular data, and the insights and inferences such data convey. Companies that sell or license data sometimes include language in their contracts prohibiting recipients from re-identifying the people in the data, or restricting how recipients use the data they buy. But not all contracts contain such prohibitions. Those that do are often still insufficient to maintain consumers’ privacy, even when bolstered by technical safeguards.

As the FTC’s proposed complaint against Avast alleges, some of the company’s underlying contracts did not prohibit data buyers from re-identifying Avast users. Under one such contract, for example, the FTC alleges that an Avast subsidiary granted a company specializing in identity services a “world-wide license” to use Avast users’ browsing information for “targeting, messaging and other data driven marketing activities served to consumers and businesses”—including “ID Syncing Services” and “Data Distribution Services.” And even where Avast’s underlying contracts included a re-identification prohibition, the FTC alleges that recipients were still permitted to match information with Avast users’ browsing data so long as the information was not “personally-identifiable,” and Avast never audited or otherwise confirmed that recipients complied with such prohibitions.

While the FTC’s proposed complaint against X-Mode recognizes that the company included some use restrictions in its contracts, [7] even when paired with technical measures and auditing requirements, such use restrictions may not deter misuse by downstream actors. And at least twice, the FTC alleges, X-Mode sold location data to customers who violated restrictions in their contracts by reselling the data they bought from X-Mode to companies even further downstream.

Companies must do better. Honoring privacy promises and obligations means implementing and adhering to safeguards that actually maintain people’s privacy. Promises and contract clauses are important, but they must be backed up by action. Going forward, the FTC’s proposed orders against Avast, X-Mode, and InMarket seek to ensure these companies comply with the law. In addition to prohibiting Avast, X-Mode, and InMarket from misrepresenting how they handle people’s information—including the extent to which consumers’ browsing (Avast) and location (all three) information is aggregated or anonymized (Avast) or deidentified (X-Mode and InMarket)—the FTC’s proposed orders require these companies to design, implement, maintain, and document safeguards to protect the  personal information they handle.

As these actions underscore, the FTC is committed to protecting people from the unlawful collection, retention, use, and disclosure of their information.

  • Browsing and location data are sensitive. Full stop.
  • Purpose matters: Firms do not have free license to market, sell, and monetize people’s information beyond purposes to provide their requested product or service.
  • Companies must do better: Safeguards used to maintain people’s privacy are often outstripped by companies’ incentives and abilities to match data to particular people. Firms should not let business model incentives that focus on the bottom line outweigh the need for meaningful privacy safeguards.

“Across these cases, we have established that businesses by default cannot sell people’s sensitive data or disclose it to third parties for advertising purposes,” Chair Khan emphasized in her statement [8] accompanying the proposed Avast settlement. Collecting, storing, using, and sharing people’s sensitive information without their informed consent violates their privacy, and exposes them to substantial secondary harms like stigma, discrimination, physical violence, and emotional distress. The FTC will not stand for it. The Commission will use all of its tools to continue to protect Americans from abusive data practices and unlawful commercial surveillance. [9]

Thank you to the attorneys who led the investigations, and to all who contributed to this post: Andy Hasty, Noam Kantor, Aaron Alva, Elizabeth Averill, Bhavna Changrani, Simon Fondrie-Teitler, Alex Gaynor, Julia Horwitz, Amritha Jayanti, Nick Jones, Kevin Moriarty, Gorana Neskovic, Stephanie Nguyen, Brian Shull, Ben Swartz, Cathlin Tully, David Walko, Ben Wiseman, and Daniel Zhao. 

[1] FTC Order Will Ban Avast from Selling Browsing Data for Advertising Purposes, Require It to Pay $16.5 Million Over Charges the Firm Sold Browsing Data After Claiming Its Products Would Block Online Tracking (February 22, 2024), available at  https://www.ftc.gov/news-events/news/press-releases/2024/02/ftc-order-will-ban-avast-selling-browsing-data-advertising-purposes-require-it-pay-165-million-over .

[2] FTC Order Prohibits Data Broker X-Mode Social and Outlogic from Selling Sensitive Location Data (January 9, 2024), available at https://www.ftc.gov/news-events/news/press-releases/2024/01/ftc-order-prohibits-data-broker-x-mode-social-outlogic-selling-sensitive-location-data .

[3] FTC Order Will Ban InMarket from Selling Precise Consumer Location Data (January 18, 2024), available at  https://www.ftc.gov/news-events/news/press-releases/2024/01/ftc-order-will-ban-inmarket-selling-precise-consumer-location-data .

[4]  See, e.g., Luc Rocher, Julien M. Hendrickx, and Yves-Alexandre de Montjoye, Estimating The Success of Re-Identifications in Incomplete Datasets Using Generative Models, 10 Nature Commc’ns 3069 (2019), available at  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6650473/ .

[5]  Notice of Penalty Offenses Concerning Misuse of Information Collected in Confidential Contexts (Sept. 18, 2023),  https://www.ftc.gov/system/files/ftc_gov/pdf/NPO-Misuse-Information-Collected-Confidential-Contexts.pdf .

[6]  In the Matter of Goldenshores Technologies, LLC and Erik M. Geidl, FTC File No. 1323087 (2014),  https://www.ftc.gov/legal-library/browse/cases-proceedings/132-3087-goldenshores-technologies-llc-erik-m-geidl-matter .

[7]  For example, purporting to restrict recipients from using “the X-Mode Data (alone or combined with other data) to associate any user, device or individual with any venue that is related to healthcare, addiction, pregnancy or pregnancy termination, or sexual orientation.”

[8]  Statement of Chair Lina M. Khan, Joined by Commissioner Rebecca Kelly Slaughter and Commissioner Alvaro M. Bedoya, In the Matter of Avast Limited Commission File No. 202-3033 (February 21, 2024), available at  https://www.ftc.gov/system/files/ftc_gov/pdf/2024.02.21StatementofChairKhanRegardingAvast.pdf .

[9]   FTC Order Prohibits Data Broker X-Mode Social and Outlogic from Selling Sensitive Location Data (January 9, 2024), available at  https://www.ftc.gov/news-events/news/press-releases/2024/01/ftc-order-prohibits-data-broker-x-mode-social-outlogic-selling-sensitive-location-data .

More from the Technology Blog

Ai (and other) companies: quietly changing your terms of service could be unfair or deceptive, a few key principles: an excerpt from chair khan’s remarks at the january tech summit on ai, keeping your privacy enhancing technology (pet) promises, tick, tick, tick. office of technology’s summit on ai.

This paper is in the following e-collection/theme issue:

Published on 11.3.2024 in Vol 13 (2024)

Development of a Multiplatform Tool for the Prevention of Prevalent Mental Health Pathologies in Adults: Protocol for a Randomized Control Trial

Authors of this article:

Author Orcid Image

  • Nadia Ramos 1, 2 , PhD   ; 
  • Felipe Besoain 3 , PhD   ; 
  • Natalia Cancino 2, 4 , MSc   ; 
  • Ismael Gallardo 1 , PhD   ; 
  • Paula Albornoz 2 , MSc   ; 
  • Andres Fresno 1 , PhD   ; 
  • Rosario Spencer 1 , PhD   ; 
  • Soledad Schott 5 , MSc   ; 
  • Daniel Núñez 1 , PhD   ; 
  • Carolina Salgado 2, 6 , MD   ; 
  • Susana Campos 2 , MSc  

1 Faculty of Psychology, University of Talca, Talca, Chile

2 Center of Applied Psychology, Faculty of Psychology, University of Talca, Talca, Chile

3 Faculty of Engineering, University of Talca, Talca, Chile

4 Doctorate in Psychology, Faculty of Psychology, University of Talca, Talca, Chile

5 Instituto Nacional de Capacitación, Talca, Chile

6 Medical School, Universidad Catolica del Maule, Talca, Chile

Corresponding Author:

Susana Campos, MSc

Center of Applied Psychology

Faculty of Psychology

University of Talca

1 Poniente 1141

Talca, 3460000

Phone: 56 71 2 201783

Email: [email protected]

Background: The prevalence of depression and anxiety has increased in recent years, with many individuals having trouble accessing mental health support. Smartphones have become an integral part of modern life, with apps offering new ways to deliver evidence-based self-help strategies to cope with common mental health symptoms. However, most of them do not have empirical evidence of their overall effectiveness or the effectiveness of their components, which could pose a risk for users.

Objective: The aim of this study is to evaluate the effectiveness of the modules of evaluation, psychoeducation, and emotional regulation strategies in a multiplatform self-help mental health mobile app in the Maule region of Chile.

Methods: A sample of 196 adults will be selected, who will be randomly assigned to different components of the app for a fixed period to assess its ability to reduce symptomatology.

Results: The trial is not yet recruiting and is expected to end in October 2024. The first results are expected in April 2024.

Conclusions: This is the first study in Chile to develop and test the effectiveness of a mobile app to manage anxiety and depression symptoms in adults. The intervention proposed is based on evidence suggesting that the internet or remote intervention tools and self-management of prevalent symptomatology could be the future of mental health care systems in the digital era. If the effects of the intervention are positive, wide implementation in Chile and other Spanish-speaking countries could be possible in the future.

International Registered Report Identifier (IRRID): PRR1-10.2196/52324

Introduction

Background and rationale.

Mental health disorders are common in the general population. In Chile, depression and anxiety are the most frequently observed psychopathologies, with a prevalence of 5% for depression and 6.5% for any form of anxiety [ 1 ]. However, the population affected by depression or anxiety symptoms may be greater, with 49.2% of Chileans reporting some level of depression symptoms [ 2 ]. These prevalences have become a significant issue in public health settings, where waitlists in public mental health services in Chile are mostly stagnant [ 3 ], and in the past 2 years have become extensive due to the COVID-19 pandemic. Thus, it becomes necessary to find alternatives that allow providers to respond to this need for mental health access [ 4 ].

In this context, the use of remote self-guided psychological interventions, such as those found in mental health mobile apps, has gained popularity [ 5 ]. It has been reported that this remote modality improves access, reduces waiting times for appointments, saves time, and reduces the effect of attitudinal barriers in patients seeking psychological support [ 6 ]. Studies indicate a high effectiveness rate of web-based care, with smartphone-based psychological interventions reducing user anxiety [ 7 ] and depression [ 8 , 9 ]. Accordingly, there has been an increased number of mental health mobile apps available to users. However, many of them do not have sufficient theoretical support for their effectiveness. In 2015, a total of 447 apps related to cognitive behavioral therapy (CBT) were found available in any App Store [ 10 ]. Conversely, 9 studies regarding the use of smartphones in cognitive-behavioral interventions were found, of which only 2 had readily available apps in the market [ 10 ]. Moreover, 10% of those who say they use cognitive behavioral interventions as a theoretical basis were not evidence-based [ 11 ]. It has been suggested that, without empirical support, these apps can be potentially harmful to users [ 12 ].

Among the small portion of mobile apps that have empirical support for their tools, 3 specific components or modules have the highest rates of applicability and effectiveness for reducing depression and anxiety: assessment, education, and self-regulation digital tools [ 13 ]. The assessment component allows for the recognition of the current symptomatic state of the user, while the constant evaluation helps the monitoring and the development of awareness of the process of change in the symptomatology [ 14 ]. The psychoeducation process allows users to be informed about what they are experiencing. Although psychoeducation alone is an effective strategy for the improvement of some anxiety and depression symptoms [ 15 , 16 ], its effectiveness increases when it is accompanied by other strategies [ 17 ]. The final component, the use of self-regulation tools, demonstrates a series of brief, easy-to-use self-guided activities that attempt to regulate, express, or otherwise manage psychological distress, low mood, and symptoms of anxiety.

The presentation of these modules normally follows the theoretical framework of CBT. This treatment model can be easily adapted to non–face-to-face environments and has been repeatedly tested in web-based modalities [ 18 , 19 ]. Moreover, other approaches theoretically derived from CBT, such as behavioral activation therapy or mindfulness-based behavioral therapy, also exhibit evidence of their effectiveness in non–face-to-face settings [ 20 ]. The behavioral activation treatment for depression attempts to regulate depression symptoms through the programming of pleasurable activities [ 21 ], while mindfulness-based CBT focuses primarily on breathing techniques, relaxation, self-compassion, and mindfulness abilities [ 22 ], aimed at regulating anxiety states.

According to the literature, there are no mobile apps with scientific support to address depression and anxiety symptoms in the adult population in Chile. The regular and widespread use of such apps could help provide self-care tools to individuals who are reticent about therapy.

The main aim of this study is to develop a multiplatform self-help mental health mobile app for adults and to test the effectiveness of its assessment, psychoeducation, and emotional regulation strategy modules post intervention and at 1-month follow-up. We hypothesize that the combination of psychoeducation and any self-regulatory strategies will prove more effective than the evaluation and psychoeducation components alone in decreasing symptomatology.

Trial Design

This is a protocol for a double-blind, 4-armed randomized controlled trial (ClinicalTrials.gov NCT05997849) evaluating changes in primary and secondary outcomes (symptomatology and well-being variables) post intervention and at follow-up. The 4 arms will be group 1 (control): the participants will have access only to the monitoring and psychoeducation module for 30 days; group 2: participants will have access to the monitoring module, psychoeducation, and mindfulness strategies for 30 days; group 3: participants will have access to the monitoring module, psychoeducation, and behavioral activation strategies for 30 days; and group 4: participants will have access to the monitoring, psychoeducation, and cognitive strategies module for 30 days. Additionally, 1 focus group per condition will be carried out to qualitatively assess the user experience with the app and its overall usability.

Study Setting

Participants will be adults (18 years of age and older) in the Maule Region of Chile and mixed-sex. We expect to recruit 49 participants per arm, configuring 196 participants in total.

Eligibility Criteria

Adult Chilean citizens, 18 years of age or older, with access to a computer, tablet, or smartphone (Android or iOS) with internet, and no untreated mental health diagnosis, will be included in the study. The mental health diagnosis and current treatment will be assessed solely by the participant’s self-report.

Individuals who self-report substance abuse problems or any current serious mental health disorder and participants reporting scores greater than 1 on question 9 (suicidal ideation) on the Patient Health Questionnaire–9 (PHQ-9) will be excluded from the study.

Ethical Considerations

This study and all associated documents were approved by the Scientific Ethics Committee of the University of Talca (14/2022). Informed consent will be obtained from all participants through web-based forms before the allocation process begins. The collected data are completely deidentified and stored in secure servers (see Data Management section). Participants will not receive any type of compensation for their participation in the study.

Recruitment and Informed Consent

For recruitment, invitations will be made through the social networks of the Center for Applied Psychology and the Faculty of Psychology of the University of Talca. In addition, the educational and health services of 4 districts in the region will be contacted to disseminate the multiplatform application Cuidandome to adults. The call for participation will include an explanatory video and an informed consent form, which will be available on and collected through the project website (under development). Participants will be recruited through convenience sampling and will be randomly assigned to 1 of the 4 groups by means of simple randomization (through the PHP: Hypertext Preprocessor rand() function).

Interventions

Description of interventions.

Cuidandome is a multiplatform self-help mental health app that does not constitute a psychological treatment. The mobile app will be available for iOS and Android operating systems, and it is composed of 3 modules: monitoring, psychoeducation, and strategies for people to learn to manage their emotional states and their depression or anxiety symptoms.

The modules of the app are given below.

Monitoring (M) Module

The aim of this assessment module is to acquire information on the symptoms and well-being of the participants in the process of using the app and their satisfaction with the strategies. This transversal component is composed of a weekly assessment and a brief daily mood assessment so that the person can monitor their own mood.

Psychoeducation (P) Module

The aim of this module is to provide information to the user about different aspects of mental health care, understanding depression and anxiety symptoms, and the fundamentals of the app and how to get the most out of it. Several studies have shown the relevance of psychoeducation in mental health care.

Regulation Strategies (RS) Module

This module is based on cognitive behavioral change techniques and strategies originating from the basic principles of CBT for the treatment of depression and generalized anxiety, behavioral activation therapy for depression, and mindfulness strategies for managing anxiety and depression. The module is organized into 3 types of strategies: cognitive (strategies to help modify cognitions that cause discomfort), behavioral activation (strategies focused on modifying behavior that causes discomfort), and mindfulness (strategies to help stay in the present moment and decrease activation of the sympathetic nervous system). Each of the strategies will be developed sequentially through 3 tasks: understanding, learning, and practicing; providing activities to understand why the strategy works and how it should be done; and finally, putting the strategy into practice. Figure 1 shows the overview of the dashboards.

research and data collection protocols

Criteria for Discontinuing or Modifying Allocated Interventions

Participants in any group can leave the study at any time if they wish without any consequences; this means that their information and collected data will not be analyzed.

Strategies to Improve Adherence to Interventions

No group will be monitored regarding adherence, as the app is essentially self-guided. However, the app sends daily messages to participants to remind them to access the app and complete at least 1 activity.

Primary Outcomes

Anxiety symptom is 1 of the 2 primary outcomes, and it will be measured by the Generalized Anxiety Disorder–7 Scale (GAD-7). It consists of 7 items that the person assesses on a scale of 0-3 according to the frequency with which each symptom has disturbed them during the past 2 weeks [ 23 ]. It shows high reliability in Chilean samples (Cronbach α=0.86 [ 24 ]).

Depression symptoms will be measured by the PHQ-9 depression scale, which consists of 9 items that evaluate the presence of depression symptoms present in the past 2 weeks, with a response scale of 0=never, 1=some days, 2=more than half the days, and 3=almost every day. With a Cronbach α of 0.83, its reliability is high in Chilean samples [ 25 ]. Additionally, depression and anxiety symptoms in the past week will be assessed through the Patient Health Questionnaire–4 (PHQ-4) [ 26 ].

Primary and secondary outcomes will be assessed at 3 time points: preintervention (baseline), immediately post intervention (30 days post intervention), and 1-month follow-up in all groups.

Secondary Outcomes

Perceived psychological and overall well-being is measured by the Pemberton Happiness Index scale (PHI), a 21-item instrument evaluated in 2 subscales: psychological well-being (experienced 5 positive and 5 negative experiences) and subjective well-being (remembered). The scale is Likert-type, with scores per item from 0 to 10 [ 27 ].

Perceived resilience is measured by the Brief Resilience Scale (BRS), a 6-item instrument on a Likert scale from 1 to 5. It has 3 inverted items, and the rating results from the average of the scores obtained [ 28 ].

Ruminative Thoughts

Rumination and ruminative thoughts are assessed by the Ruminative Response Scale (RRS, short version), consisting of 10 items that measure ruminant thoughts in 2 dimensions: reflection and restlessness. It has a high level of internal consistency (Cronbach α=0.85). Each item is scored on a 4-point Likert scale from 1 (almost never) to 4 (almost always) [ 29 ].

Emotional Regulation

Emotional regulation is assessed by the Emotional Regulation Questionnaire (ERQ), a 10-item questionnaire designed to measure respondents’ tendency to regulate their emotions in 2 ways: (1) cognitive reappraisal (1, 3, 5, 7, 8, and 10) and (2) expressive suppression (2, 4, 6, and 9) [ 30 ].

Mindfulness Skills

Mindfulness skills are assessed with the Mindfulness Attention Awareness Scale (MAAS), an instrument composed of 14 items on a Likert scale from 1 to 6. The scores obtained are added, and the higher the score, the greater the ability to pay attention fully and consciously. The average scores of the nonclinical participants are 65 points out of a total of 84 [ 31 ].

Participant Timeline

See Table 1 for the participant timeline.

Sample Size

To achieve the aims of the study, we have considered 4 arms. To obtain a statistical power of 0.80, a minimum sample size of 36 participants is required for each of the 4 arms. Controlling for a dropout rate of 26%, as shown by the literature on behavioral experiments linked to mobile apps for symptoms related to mood and anxiety [ 32 , 33 ], a final sample size of 49 participants per group is required. The sample size approximation was made based on the literature on statistical power and sample sizes in clinical and social science research [ 34 ], while the estimate to reduce the effects of the percentage of abandonment in the experimental conditions was calculated based on Wang and Ji [ 33 ]. Thus, an estimate of 196 participants sufficiently obeys such parameters. In addition to the quantitative data collection, a qualitative collection of information will be carried out through focus groups. Regarding the number of focus groups, a total of 4 have been established, 1 for each study group. The size of the focus groups, following the recommendations in the literature [ 35 ], is estimated at a total of 32 participants (reaching an estimate of the saturation point), that is, 8 participants per focus group.

Recruitment

The strategies for achieving adequate participant enrollment to reach the target sample size will include contacting and presenting the study to municipality authorities, who are expected to help establish contact with public health authorities. Interested individuals can sign up to participate through the QR code available in pamphlets delivered to primary health clinics in the Maule area. Health staff in the selected public clinics will also encourage patients to participate in the study.

Assignment of Interventions: Allocation

Sequence generation.

Participants will be randomly assigned to any group with a 1:1 allocation through a website. The web system will generate randomization based on the rand() function of the PHP language. Through this function, a random number between 0 and 3 is generated and subsequently assigned to the participants upon recruitment.

Concealment Mechanism

After the randomization and allocation, participants will not receive information regarding the group they belong to. Instead, they will be instructed to access the app and complete the emotional regulation activities, regardless of the type. The participants will only have access to a version of the app that includes the strategies that correspond with their assigned group. Additionally, this information will not be disclosed to the assessment research team (outcome evaluators) to keep the participant allocation blind.

Allocation Implementation

Participants will be randomly assigned to any group with a 1:1 allocation through a website. The web system will generate randomization based on the rand() function of the PHP language. Through this function, a random number between 0 and 3 is generated and subsequently assigned to the participants upon recruitment. Depending on the random number assigned, a QR code will be sent to the participants, which will allow them to download a version of the app that contains the strategies pertinent to their study group.

Assignment of Interventions: Blinding

Who will be blinded.

This is a double-blinded trial, blinded to the participants and the research team. A data analyst will work with the final data set, where the group condition will be masked.

Procedure of Unblinding If Needed

Unblinding will not occur in this study.

Data Collection and Management

Plans for assessment and collection of outcomes.

Self-report questionnaires assessing primary and secondary outcomes will be administered at baseline (immediately before intervention), post intervention, and follow-up. Primary outcomes will be assessed through the website and the app, and secondary outcomes will be measured only on the website.

Plans to Promote Participant Retention and Complete Follow-Up

Participants will receive extensive information about the study setup and requirements during the recruitment. This information will include and stress the importance of completion of the follow-up.

Data Management

After the participants have completed the web-based questionnaires both on the website and on the app, we will enter the confidential data into a secure platform without identifying information, as each participant will be assigned an encrypted ID number. Only the lead investigator, the research assistants in charge of partial data entry, and the statistician will have access to the complete database. All people with access to the data set will need to sign a confidential agreement to assure the commitment to not reveal identifying information.

Statistical Methods

Statistical methods for primary and secondary outcomes.

We will use descriptive statistics to assess balance across groups at baseline. The primary between-group analysis will be carried out on an intention-to-treat basis for GAD-7 and PHQ-9 scores. We will use a linear mixed model for repeated measures (MMRM) analyses to compare all the intervention and control groups regarding the change in all outcome measures from baseline to post intervention and at follow-up. All the models will be adjusted for sex, age, educational level, and concurrent psychological or psychiatric treatment. Analysis will be carried out with SPSS 26 (IBM Corp) and Stata 15.01 (StataCorp).

Statistical Methods for Additional Analyses

Secondary analyses will mainly explore the complex associations between all outcomes. To this end, correlation analyses will be carried out between symptomatic and other psychological variables. Multiple regression analyses and logistic regression analyses among all variables, particularly between mindfulness skills and anxiety symptoms, rumination and depression symptoms, and gender and symptomatology variables will be carried out. Lastly, group comparisons by gender and hours of app use will also be performed. The analysis will be carried out with SPSS 26.

Interim Analyses

There will not be interim analyses because the data will be analyzed at the end of the trial.

Methods in Analysis to Handle Protocol Nonadherence and Any Statistical Methods to Handle Missing Data

Primary outcomes will be assessed using an intention-to-treat analysis. Missing data will be reduced to a minimum by (1) rendering it impossible to advance on the app if a questionnaire is not completed in full, and (2) using the appropriate measures to encourage participants to fill out the follow-up questionnaires. Multiple imputations will be used to handle any missing data in the analyses.

Plans to Give Access to the Full Protocol, Participant-Level Data, and Statistical Code

The data set produced during the study will be available upon reasonable request from the lead researcher NR.

Oversight and Monitoring

Composition of the coordinating center and trial steering committee.

Daily support for the trial will be provided by the lead investigator, who supervises the trial. Additionally, the study coordinator helps with trial registration and will coordinate and oversee the study visits and reports while organizing data collection and assuring data quality. The data analyst will handle the database and carry out all primary and secondary analyses once all data are collected. Lastly, the app developer will design and implement all content of the app, ensuring the proper functioning of fundamental features such as the randomization of participants, completeness of data collected, automatic creation of the database, etc. The main study team will meet weekly during the duration of the study and monthly with a secondary group of expert collaborators. There is no trial steering committee, stakeholder, or public involvement group. The Ethical Scientific Committee of the Universidad de Talca will check the completeness of the investigation.

Composition of the Data Monitoring Committee, Its Role, and Reporting Structure

A monitor from the Ethical Scientific Committee of the Universidad de Talca will check once a year the presence and completeness of the investigation. This committee is independent of the sponsor and has no competing interests. For further details, please contact [email protected].

Adverse Event Reporting and Harms

The intervention does not cause any harm to the participants. However, if participants experience emotional distress upon using the app, they can contact the lead investigator at any time, who can provide them with options for referral to other public mental health services available. This procedure is explained in detail in the informed consent and initial training videos.

Frequency and Plans for Auditing Trial Conduct

A monitor from the Ethical Scientific Committee will check annually the presence and completeness of the investigation files, such as informed consent, inclusion and exclusion criteria, and data collection and storage.

Plans for Communicating Important Protocol Amendments to Relevant Parties (eg, Trial Participants and Ethical Committees)

All substantial amendments will be notified to the Ethics Committee of the Universidad de Talca. In case amendments concern or affect participants in any way, they will be informed about the changes. If needed, additional consent will be requested and registered. Additionally, web-based trial registries will be updated accordingly.

Dissemination Plans

All results of this research will be disclosed completely in international peer-reviewed journals. Executive summaries of the results will be given to government authorities and public entities acting as stakeholders. Lastly, preliminary relevant results will be presented at an international internet-based mental health seminar organized by the research team, to take place in mid-2024. The full protocol will be made publicly available on the original registry’s website when the study is completed. Publications resulting from this protocol will consider the principal investigators and any participating analyst as authors.

The trial is not yet recruiting and is expected to end in October 2024. The first results are expected in late April 2024.

This study aims to develop a multiplatform self-help mental health mobile app for adults and test the effectiveness of its modules post intervention and at 1-month follow-up. This is the first study in Chile to develop and test the effectiveness of a mobile app to manage anxiety and depression symptoms in adults. The intervention proposed is based on evidence suggesting that the internet or remote intervention tools and self-management of prevalent symptomatology could be the future of mental health care systems in the digital era. If the effects of the intervention are positive, wide implementation in Chile and other Spanish-speaking countries could be possible in the future.

However, this project faces some challenges regarding feasibility and adoption of the intervention, where dropout rates could potentially hinder the results of the experimental procedure, even when accounted for [ 36 ]. Additionally, probable limitations are the exclusion of part of the clinical population from the study sample (ie, individuals not currently in treatment or at suicide risk) and the difficulty of controlling the effects of therapy on the primary outcomes in individuals simultaneously undergoing psychotherapy.

Implementing this research protocol, we expect to provide evidence for the effectiveness of a mental health mobile app in reducing depression and anxiety symptoms in Chilean adults. Moreover, we expect to provide evidence on the associations of these symptom variables and other risk or protective factors for mental health in adults in the context of digital platforms and interventions for health. The potential clinical and research implications derived from this study could help the mental health and well-being of the Chilean population significantly.

Acknowledgments

Funding for this study was granted by Gobierno Regional del Maule, through Fondo de Innovación para la Competitividad initiative number 40027682. NC received sponsorship from Agencia Nacional de Investigación y Desarrollo–Subdirección de Capital Humano/Doctorado Nacional/2022/21221538.

Data Availability

The data sets generated and analyzed during this study will be available from the corresponding author upon reasonable request.

Authors' Contributions

NR conceived, designed, and applied for funding for the study. FB designed the mobile technology. NC contributed to the methodological design and reviewed the app content. IG contributed to the methodological design. PA created the media and content for the app and coordinated the project. AF, RS, SS, DN, and CS acted as theoretical and methodological advisors and reviewed app content. SC created content for the mobile app and wrote the protocol. Funder and committees did not have a role in study design, data management, or other areas.

Conflicts of Interest

None declared.

  • The burden of mental disorders in the region of the Americas. Pan American Health Organization. 2018. URL: https://iris.paho.org/handle/10665.2/49578?show=full [accessed 2023-08-25]
  • Hojman D, Krauze M, Llaupi M, Rojas G, Vergés Á. Resultados Primera Ola Estudio Longitudinal Social de Chile (ELSOC). Módulo 6: salud y bienestar. Salud mental en el Chile de hoy. Notas COES de política pública no.15. Centro de Estudios de Conflicto y Cohesión Social—COES. Santiago, Chile. COES; 2018. URL: https://coes.cl/publicaciones/no15-elsoc-salud-mental-en-el-chile-de-hoy/ [accessed 2024-02-07]
  • Gómez L, Núñez A. Vigilancia del Acceso a la salud en chile: Un sistema de indicadores Para Monitoreo multidimensional. [A system aimed at monitoring healthcare access in Chile]. Rev Med Chil. 2021;149(1):62-75. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Dos Santos Costa AC, Menon V, Phadke R, Dapke K, Miranda AV, Ahmad S, et al. Mental health in the post COVID-19 era: future perspectives. Einstein (Sao Paulo). 2022;20:eCE6760. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Torous J, Myrick KJ, Rauseo-Ricupero N, Firth J. Digital mental health and COVID-19: using technology today to accelerate the curve on access and quality tomorrow. JMIR Ment Health. 2020;7(3):e18848. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Miralles I, Granell C, Díaz-Sanahuja L, Van Woensel W, Bretón-López J, Mira A, et al. Smartphone apps for the treatment of mental disorders: systematic review. JMIR Mhealth Uhealth. 2020;8(4):e14897. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Firth J, Torous J, Nicholas J, Carney R, Rosenbaum S, Sarris J. Can smartphone mental health interventions reduce symptoms of anxiety? A meta-analysis of randomized controlled trials. J Affect Disord. 2017;218:15-22. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bert F, Giacometti M, Gualano MR, Siliquini R. Smartphones and health promotion: a review of the evidence. J Med Syst. 2014;38(1):9995. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Watts S, Mackenzie A, Thomas C, Griskaitis A, Mewton L, Williams A, et al. CBT for depression: a pilot RCT comparing mobile phone vs. computer. BMC Psychiatry. 2013;13:49. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Torous J, Levin ME, Ahern DK, Oser ML. Cognitive behavioral mobile applications: clinical studies, marketplace overview, and research agenda. Cogn Behav Pract. 2017;24(2):215-225. [ CrossRef ]
  • Stawarz K, Preist C, Tallon D, Wiles N, Coyle D. User experience of cognitive behavioral therapy apps for depression: an analysis of app functionality and user reviews. J Med Internet Res. 2018;20(6):e10120. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Baumel A, Torous J, Edan S, Kane JM. There is a non-evidence-based app for that: a systematic review and mixed methods analysis of depression- and anxiety-related apps that incorporate unrecognized techniques. J Affect Disord. 2020;273:410-421. [ CrossRef ] [ Medline ]
  • Lagan S, D'Mello R, Vaidyam A, Bilden R, Torous J. Assessing mental health apps marketplaces with objective metrics from 29,190 data points from 278 apps. Acta Psychiatr Scand. 2021;144(2):201-210. [ CrossRef ] [ Medline ]
  • Ritterband LM, Thorndike FP, Cox DJ, Kovatchev BP, Gonder-Frederick LA. A behavior change model for internet interventions. Ann Behav Med. 2009;38(1):18-27. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Carlbring P, Hägglund M, Luthström A, Dahlin M, Kadowaki Å, Vernmark K, et al. Internet-based behavioral activation and acceptance-based treatment for depression: a randomized controlled trial. J Affect Disord. 2013;148(2-3):331-337. [ CrossRef ] [ Medline ]
  • Qu C, Sas C, Roquet CD, Doherty G. Functionality of top-rated mobile apps for depression: systematic search and evaluation. JMIR Ment Health. 2020;7(1):e15321. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Shen N, Levitan MJ, Johnson A, Bender JL, Hamilton-Page M, Jadad AAR, et al. Finding a depression app: a review and content analysis of the depression app marketplace. JMIR Mhealth Uhealth. 2015;3(1):e16. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Andersson G, Hesser H, Veilord A, Svedling L, Andersson F, Sleman O, et al. Randomised controlled non-inferiority trial with 3-year follow-up of internet-delivered versus face-to-face group cognitive behavioural therapy for depression. J Affect Disord. 2013;151(3):986-994. [ CrossRef ] [ Medline ]
  • Andrews G, Basu A, Cuijpers P, Craske MG, McEvoy P, English CL, et al. Computer therapy for the anxiety and depression disorders is effective, acceptable and practical health care: an updated meta-analysis. J Anxiety Disord. 2018;55:70-78. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Mancinelli E, Dell'Arciprete G, Pattarozzi D, Gabrielli S, Salcuni S. Digital behavioral activation interventions during the perinatal period: scoping review. JMIR Pediatr Parent. 2023;6:e40937. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bianchi J, Henao Á. Activación conductual y depresión: conceptualización, evidencia y aplicaciones en Iberoamérica. [Behavioral activation and depression: conceptualization, evidence and applications in Latin America]. Ter Psicol. 2015;33(2):69-80. [ FREE Full text ] [ CrossRef ]
  • Schanche E, Vøllestad J, Visted E, Svendsen JL, Osnes B, Binder PE, et al. The effects of mindfulness-based cognitive therapy on risk and protective factors of depressive relapse—a randomized wait-list controlled trial. BMC Psychol. 2020;8(1):57. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Spitzer RL, Kroenke K, Williams JBW, Löwe B. A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch Intern Med. 2006;166(10):1092-1097. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Crockett MA, Martínez V, Ordóñez-Carrasco JL. Propiedades psicométricas de la escala Generalized Anxiety Disorder 7-Item (GAD-7) en una muestra comunitaria de adolescentes en Chile. Rev Med Chil. 2022;150(4):458-464. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Baader TM, Molina JLF, Venezian SB, Rojas CC, Farías RS, Fierro-Freixenet C, et al. Validación y utilidad de la encuesta PHQ-9 (Patient Health Questionnaire) en el diagnóstico de depresión en pacientes usuarios de atención primaria en Chile. Rev Chil Neuro Psiquiatr. 2012;50(1):10-22. [ FREE Full text ] [ CrossRef ]
  • Kroenke K, Spitzer RL, Williams JBW, Löwe B. An ultra-brief screening scale for anxiety and depression: the PHQ-4. Psychosomatics. 2009;50(6):613-621. [ CrossRef ] [ Medline ]
  • Paiva BSR, de Camargos MG, Demarzo MMP, Hervás G, Vázquez C, Paiva CE. The Pemberton Happiness Index: validation of the universal Portuguese version in a large Brazilian sample. Medicine (Baltimore). 2016;95(38):e4915. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Smith BW, Dalen J, Wiggins K, Tooley E, Christopher P, Bernard J. The brief resilience scale: assessing the ability to bounce back. Int J Behav Med. 2008;15(3):194-200. [ CrossRef ] [ Medline ]
  • Treynor W, Gonzalez R, Nolen-Hoeksema S. Rumination reconsidered: a psychometric analysis. Cognit Ther Res. 2003;27:247-259. [ CrossRef ]
  • Gross JJ, John OP. Individual differences in two emotion regulation processes: implications for affect, relationships, and well-being. J Pers Soc Psychol. 2003;85(2):348-362. [ CrossRef ] [ Medline ]
  • Brown KW, Ryan RM. The benefits of being present: mindfulness and its role in psychological well-being. J Pers Soc Psychol. 2003;84(4):822-848. [ CrossRef ] [ Medline ]
  • Torous J, Andersson G, Bertagnoli A, Christensen H, Cuijpers P, Firth J, et al. Towards a consensus around standards for smartphone apps and digital mental health. World Psychiatry. 2019;18(1):97-98. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Wang X, Ji X. Sample size estimation in clinical research: from randomized controlled trials to observational studies. Chest. 2020;158(1S):S12-S20. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Faul F, Erdfelder E, Lang AG, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39(2):175-191. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Carlsen B, Glenton C. What about N? A methodological study of sample-size reporting in focus group studies. BMC Med Res Methodol. 2011;11:26. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Hernández-Rodríguez JC, García-Muñoz C, Ortiz-Álvarez J, Saigí-Rubió F, Conejo-Mir J, Pereyra-Rodriguez JJ. Dropout rate in digital health interventions for the prevention of skin cancer: systematic review, meta-analysis, and metaregression. J Med Internet Res. 2022;24(12):e42397. [ FREE Full text ] [ CrossRef ] [ Medline ]

Abbreviations

Edited by A Mavragani; submitted 31.08.23; peer-reviewed by T Benham, M Musker; comments to author 03.01.24; revised version received 18.01.24; accepted 01.02.24; published 11.03.24.

©Nadia Ramos, Felipe Besoain, Natalia Cancino, Ismael Gallardo, Paula Albornoz, Andres Fresno, Rosario Spencer, Soledad Schott, Daniel Núñez, Carolina Salgado, Susana Campos. Originally published in JMIR Research Protocols (https://www.researchprotocols.org), 11.03.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on https://www.researchprotocols.org, as well as this copyright and license information must be included.

IMAGES

  1. A simultaneous, multi-stage data collection protocol.

    research and data collection protocols

  2. Data collection protocol.

    research and data collection protocols

  3. 7 Data Collection Methods & Tools For Research

    research and data collection protocols

  4. How to Collect Data

    research and data collection protocols

  5. Data collection protocol.

    research and data collection protocols

  6. 7 Data Collection Methods & Tools For Research

    research and data collection protocols

VIDEO

  1. QUALITATIVE RESEARCH: Methods of data collection

  2. Methods of data & information collection

  3. Sources of Data Collection

  4. Big data research Assingment-4

  5. 2023 PhD Research Methods: Qualitative Research and PhD Journey

  6. Basics of Research (Data collection and entry)

COMMENTS

  1. 10 Collecting data

    This plan is called a protocol. Definition 10.1 (Protocol) A protocol is a procedure documenting the details of the design and implementation of studies, and for data collection. Unforeseen complications are not unusual, so often a pilot study (or a practice run) is conducted before the real data collection, to identify problems with the study ...

  2. Data Collection

    Data Collection | Definition, Methods & Examples. Published on June 5, 2020 by Pritha Bhandari.Revised on June 21, 2023. Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

  3. 21 Elements of a Research Protocol with Example (WHO Guidelines)

    The methodology should be standardized and clearly defined if multiple sites are engaged in a specified protocol. 6. Safety Considerations. The safety of participants is a top-tier priority while conducting clinical research. Safety aspects of the research should be scrutinized and provided in the research protocol. 7.

  4. Best Practices in Data Collection and Preparation: Recommendations for

    We offer best-practice recommendations for journal reviewers, editors, and authors regarding data collection and preparation. Our recommendations are applicable to research adopting different epistemological and ontological perspectives—including both quantitative and qualitative approaches—as well as research addressing micro (i.e., individuals, teams) and macro (i.e., organizations ...

  5. PDF Methods of Data Collection in Quantitative, Qualitative, and Mixed

    research data. That is, they decide what methods of data collection (i.e., tests, questionnaires, interviews, focus groups, observations, constructed, secondary, and existing data) they will phys-ically use to obtain the research data. As you read this chapter, keep in mind the fundamental principle of mixed research originally defined in ...

  6. Commonly Utilized Data Collection Approaches in Clinical Research

    Abstract. In this article we provide an overview of the different data collection approaches that are commonly utilized in carrying out clinical, public health, and translational research. We discuss several of the factors researchers need to consider in using data collected in questionnaire surveys, from proxy informants, through the review of ...

  7. Design: Selection of Data Collection Methods

    In this Rip Out we focus on data collection, but in qualitative research, the entire project must be considered. 1, 2 Careful design of the data collection phase requires the following: deciding who will do what, ... organization policies and protocols, letters, records, films, photographs, art, meeting notes, or checklists. The development of ...

  8. How to Write a Research Protocol: Tips and Tricks

    Open in a separate window. First section: Description of the core center, contacts of the investigator/s, quantification of the involved centers. A research protocol must start from the definition of the coordinator of the whole study: all the details of the main investigator must be reported in the first paragraph.

  9. Data Collection

    Data collection is the process of gathering and measuring information used for research. Collecting data is one of the most important steps in the research process, and is part of all disciplines including physical and social sciences, humanities, business, etc. Data comes in many forms with different ways to store and record data, either written in a lab notebook and or recorded digitally on ...

  10. Protocols for Data Collection, Management and Treatment

    Objective monitoring of physical activity and sedentary behaviour is a multi-stage process that includes planning and implementing protocols for data collection, management and treatment. Protocols represent the system of rules that epidemiologists and other researchers follow to ensure data quality and ultimately confidence in the study ...

  11. (PDF) Data Collection Methods and Tools for Research; A Step-by-Step

    One of the main stages in a research study is data collection that enables the researcher to find answers to research questions. Data collection is the process of collecting data aiming to gain ...

  12. The Qualitative Research Distress Protocol: A Participant-Centered Tool

    Prior to data collection, we collaborated with the oncology social worker on the study team to co-create two distress protocols tailored to participant groups - patients/care partners and clinicians. For each protocol, we developed three sections: a triage pathway with scripted language, a tiered list of available referral resources for ...

  13. Protocol Template for Data/Sample Collection Studies ...

    Use this protocol template if your study will use ONLY: Data (retrospective or prospective data collection study with no patient contact). Samples from a biobank or leftover clinical samples that typically do not require. obtaining consent/authorization from subjects. If your study involves recruitment of subjects, please use the Prospective ...

  14. Data Collection

    Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes. The data collection component of research is common to all fields of study including physical and social sciences, humanities, business, etc.

  15. Data Collection Methods and Tools for Research; A Step-by-Step Guide to

    Data Collection, Research Methodology, Data Collection Methods, Academic Research Paper, Data Collection Techniques. I. INTRODUCTION Different methods for gathering information regarding specific variables of the study aiming to employ them in the data analysis phase to achieve the results of the study, gain the answer of the research ...

  16. Collect Data

    Collect Data. Data will be collected and subsequently analyzed during the Manage stage, using the protocol for collecting data developed in the Plan/Propose stage and the processes and technical resources established during the Setup stage. Strict adherence to the data collection protocol design is critical to assuring that the data collected ...

  17. PDF DFCI IRB Review of Data and/or Specimen Collection Protocols

    Retrospective vs. Prospective: Research which is entirely retrospective (all data or specimens are in existence prior to the initial IRB submission date), may qualify for exempt review. This does not include research that involves the ongoing collection of data and/or specimens regardless of identifiability or reason for collection.

  18. Data Collection

    Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation. In order for data collection to be effective, it is important to have a clear understanding ...

  19. Data Collection Methods

    Step 2: Choose your data collection method. Based on the data you want to collect, decide which method is best suited for your research. Experimental research is primarily a quantitative method. Interviews, focus groups, and ethnographies are qualitative methods. Surveys, observations, archival research, and secondary data collection can be ...

  20. Data collection protocols for VANETs: a survey

    In this survey, various state-of-the-art data collection protocols for VANETs are discussed, based on three broad categories, i.e., delay-tolerant, best-effort, and real-time protocols. ... In addition, some open research challenges for ITS and VANETs are discussed to highlight research gaps. Our work can thus be employed as a quick guide for ...

  21. The data collection protocol

    The data collection protocol. The data collection protocol is the procedure for executing the above steps of the Plan to collect and record the data. It deals with management and adminstrative issues such as who does what and when. It also includes a plan for monitoring the data as they are collected to ensure quality. Michelson gives us no ...

  22. Qualitative Research: Data Collection, Analysis, and Management

    Doing qualitative research is not easy and may require a complete rethink of how research is conducted, particularly for researchers who are more familiar with quantitative approaches. There are many ways of conducting qualitative research, and this paper has covered some of the practical issues regarding data collection, analysis, and management.

  23. What is Evidence-Based Practice in Nursing?

    Evidence-based practice in nursing involves providing holistic, quality care based on the most up-to-date research and knowledge rather than traditional methods, advice from colleagues, or personal beliefs. Nurses can expand their knowledge and improve their clinical practice experience by collecting, processing, and implementing research findings.

  24. A patient-centred and multi-stakeholder co-designed observational

    The importance of patient centricity and keeping the patient at the heart of research design is now well recognised within the healthcare community. The involvement of patient, caregiver and clinician representatives in the study design process may help researchers to achieve this goal and to ensure robust and meaningful data generation. Real-world data collection allows for a more flexible ...

  25. JMIR Research Protocols

    The final data collection including the health check, focus group, and survey data was completed in November 2023, and data analysis and reporting are expected to conclude in 2024. Conclusions: This study provides a blueprint for PWDWP. Collaborative partnerships with community organizations and stakeholders are crucial for program success and ...

  26. JMIR Research Protocols

    Finally, selected alcohol diagnostic tools will be tested in a district hospital setting with a further 1000 patients alongside community-based participatory research on the use of the selected tools. Results: Pilot data are being collected, and the protocol will be modified based on the results.

  27. JMIR Research Protocols

    Support and coordination provided by CLHSS include the structure of engagement, survey development, data collection, statistical analysis, and dissemination. The project was initially started in August 2022. The pilot was launched in February 2023 and is still running, with the data last counted in August 2023.

  28. Data standards for clinical research data collection forms: current

    Abstract. Case report forms (CRFs) are used for structured-data collection in clinical research studies. Existing CRF-related standards encompass structural features of forms and data items, content standards, and specifications for using terminologies. This paper reviews existing standards and discusses their current limitations.

  29. FTC Cracks Down on Mass Data Collectors: A Closer Look at Avast, X-Mode

    Any safeguards used to maintain people's privacy are often outstripped by companies' incentives and abilities to match data to particular people. The value proposition for many data purchasers is often the same thing that exposes people's privacy: ever-more granular data, and the insights and inferences such data convey.

  30. JMIR Research Protocols

    Background: The prevalence of depression and anxiety has increased in recent years, with many individuals having trouble accessing mental health support. Smartphones have become an integral part of modern life, with apps offering new ways to deliver evidence-based self-help strategies to cope with common mental health symptoms. However, most of them do not have empirical evidence of their ...