FOR EMPLOYERS

Top 10 real-world data science case studies.

Data Science Case Studies

Aditya Sharma

Aditya is a content writer with 5+ years of experience writing for various industries including Marketing, SaaS, B2B, IT, and Edtech among others. You can find him watching anime or playing games when he’s not writing.

Frequently Asked Questions

Real-world data science case studies differ significantly from academic examples. While academic exercises often feature clean, well-structured data and simplified scenarios, real-world projects tackle messy, diverse data sources with practical constraints and genuine business objectives. These case studies reflect the complexities data scientists face when translating data into actionable insights in the corporate world.

Real-world data science projects come with common challenges. Data quality issues, including missing or inaccurate data, can hinder analysis. Domain expertise gaps may result in misinterpretation of results. Resource constraints might limit project scope or access to necessary tools and talent. Ethical considerations, like privacy and bias, demand careful handling.

Lastly, as data and business needs evolve, data science projects must adapt and stay relevant, posing an ongoing challenge.

Real-world data science case studies play a crucial role in helping companies make informed decisions. By analyzing their own data, businesses gain valuable insights into customer behavior, market trends, and operational efficiencies.

These insights empower data-driven strategies, aiding in more effective resource allocation, product development, and marketing efforts. Ultimately, case studies bridge the gap between data science and business decision-making, enhancing a company's ability to thrive in a competitive landscape.

Key takeaways from these case studies for organizations include the importance of cultivating a data-driven culture that values evidence-based decision-making. Investing in robust data infrastructure is essential to support data initiatives. Collaborating closely between data scientists and domain experts ensures that insights align with business goals.

Finally, continuous monitoring and refinement of data solutions are critical for maintaining relevance and effectiveness in a dynamic business environment. Embracing these principles can lead to tangible benefits and sustainable success in real-world data science endeavors.

Data science is a powerful driver of innovation and problem-solving across diverse industries. By harnessing data, organizations can uncover hidden patterns, automate repetitive tasks, optimize operations, and make informed decisions.

In healthcare, for example, data-driven diagnostics and treatment plans improve patient outcomes. In finance, predictive analytics enhances risk management. In transportation, route optimization reduces costs and emissions. Data science empowers industries to innovate and solve complex challenges in ways that were previously unimaginable.

Hire remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

For enquiries call:

+1-469-442-0620

banner-in1

  • Data Science

Top 12 Data Science Case Studies: Across Various Industries

Home Blog Data Science Top 12 Data Science Case Studies: Across Various Industries

Play icon

Data science has become popular in the last few years due to its successful application in making business decisions. Data scientists have been using data science techniques to solve challenging real-world issues in healthcare, agriculture, manufacturing, automotive, and many more. For this purpose, a data enthusiast needs to stay updated with the latest technological advancements in AI . An excellent way to achieve this is through reading industry data science case studies. I recommend checking out Data Science With Python course syllabus to start your data science journey. In this discussion, I will present some case studies to you that contain detailed and systematic data analysis of people, objects, or entities focusing on multiple factors present in the dataset. Aspiring and practising data scientists can motivate themselves to learn more about the sector, an alternative way of thinking, or methods to improve their organization based on comparable experiences. Almost every industry uses data science in some way. You can learn more about data science fundamentals in this data science course content . From my standpoint, data scientists may use it to spot fraudulent conduct in insurance claims. Automotive data scientists may use it to improve self-driving cars. In contrast, e-commerce data scientists can use it to add more personalization for their consumers—the possibilities are unlimited and unexplored. Let’s look at the top eight data science case studies in this article so you can understand how businesses from many sectors have benefitted from data science to boost productivity, revenues, and more. Read on to explore more or use the following links to go straight to the case study of your choice.

case study for data scientist

Examples of Data Science Case Studies

  • Hospitality:  Airbnb focuses on growth by  analyzing  customer voice using data science.  Qantas uses predictive analytics to mitigate losses  
  • Healthcare:  Novo Nordisk  is  Driving innovation with NLP.  AstraZeneca harnesses data for innovation in medicine  
  • Covid 19:  Johnson and Johnson use s  d ata science  to fight the Pandemic  
  • E-commerce:  Amazon uses data science to personalize shop p ing experiences and improve customer satisfaction  
  • Supply chain management :  UPS optimizes supp l y chain with big data analytics
  • Meteorology:  IMD leveraged data science to achieve a rec o rd 1.2m evacuation before cyclone ''Fani''  
  • Entertainment Industry:  Netflix  u ses data science to personalize the content and improve recommendations.  Spotify uses big   data to deliver a rich user experience for online music streaming  
  • Banking and Finance:  HDFC utilizes Big  D ata Analytics to increase income and enhance  the  banking experience  

Top 8 Data Science Case Studies  [For Various Industries]

1. data science in hospitality industry.

In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing , tracking market trends, and many more.

Airbnb focuses on growth by analyzing customer voice using data science.  A famous example in this sector is the unicorn '' Airbnb '', a startup that focussed on data science early to grow and adapt to the market faster. This company witnessed a 43000 percent hypergrowth in as little as five years using data science. They included data science techniques to process the data, translate this data for better understanding the voice of the customer, and use the insights for decision making. They also scaled the approach to cover all aspects of the organization. Airbnb uses statistics to analyze and aggregate individual experiences to establish trends throughout the community. These analyzed trends using data science techniques impact their business choices while helping them grow further.  

Travel industry and data science

Predictive analytics benefits many parameters in the travel industry. These companies can use recommendation engines with data science to achieve higher personalization and improved user interactions. They can study and cross-sell products by recommending relevant products to drive sales and increase revenue. Data science is also employed in analyzing social media posts for sentiment analysis, bringing invaluable travel-related insights. Whether these views are positive, negative, or neutral can help these agencies understand the user demographics, the expected experiences by their target audiences, and so on. These insights are essential for developing aggressive pricing strategies to draw customers and provide better customization to customers in the travel packages and allied services. Travel agencies like Expedia and Booking.com use predictive analytics to create personalized recommendations, product development, and effective marketing of their products. Not just travel agencies but airlines also benefit from the same approach. Airlines frequently face losses due to flight cancellations, disruptions, and delays. Data science helps them identify patterns and predict possible bottlenecks, thereby effectively mitigating the losses and improving the overall customer traveling experience.  

How Qantas uses predictive analytics to mitigate losses  

Qantas , one of Australia's largest airlines, leverages data science to reduce losses caused due to flight delays, disruptions, and cancellations. They also use it to provide a better traveling experience for their customers by reducing the number and length of delays caused due to huge air traffic, weather conditions, or difficulties arising in operations. Back in 2016, when heavy storms badly struck Australia's east coast, only 15 out of 436 Qantas flights were cancelled due to their predictive analytics-based system against their competitor Virgin Australia, which witnessed 70 cancelled flights out of 320.  

2. Data Science in Healthcare

The  Healthcare sector  is immensely benefiting from the advancements in AI. Data science, especially in medical imaging, has been helping healthcare professionals come up with better diagnoses and effective treatments for patients. Similarly, several advanced healthcare analytics tools have been developed to generate clinical insights for improving patient care. These tools also assist in defining personalized medications for patients reducing operating costs for clinics and hospitals. Apart from medical imaging or computer vision,  Natural Language Processing (NLP)  is frequently used in the healthcare domain to study the published textual research data.     

A. Pharmaceutical

Driving innovation with NLP: Novo Nordisk.  Novo Nordisk  uses the Linguamatics NLP platform from internal and external data sources for text mining purposes that include scientific abstracts, patents, grants, news, tech transfer offices from universities worldwide, and more. These NLP queries run across sources for the key therapeutic areas of interest to the Novo Nordisk R&D community. Several NLP algorithms have been developed for the topics of safety, efficacy, randomized controlled trials, patient populations, dosing, and devices. Novo Nordisk employs a data pipeline to capitalize the tools' success on real-world data and uses interactive dashboards and cloud services to visualize this standardized structured information from the queries for exploring commercial effectiveness, market situations, potential, and gaps in the product documentation. Through data science, they are able to automate the process of generating insights, save time and provide better insights for evidence-based decision making.  

How AstraZeneca harnesses data for innovation in medicine.  AstraZeneca  is a globally known biotech company that leverages data using AI technology to discover and deliver newer effective medicines faster. Within their R&D teams, they are using AI to decode the big data to understand better diseases like cancer, respiratory disease, and heart, kidney, and metabolic diseases to be effectively treated. Using data science, they can identify new targets for innovative medications. In 2021, they selected the first two AI-generated drug targets collaborating with BenevolentAI in Chronic Kidney Disease and Idiopathic Pulmonary Fibrosis.   

Data science is also helping AstraZeneca redesign better clinical trials, achieve personalized medication strategies, and innovate the process of developing new medicines. Their Center for Genomics Research uses  data science and AI  to analyze around two million genomes by 2026. Apart from this, they are training their AI systems to check these images for disease and biomarkers for effective medicines for imaging purposes. This approach helps them analyze samples accurately and more effortlessly. Moreover, it can cut the analysis time by around 30%.   

AstraZeneca also utilizes AI and machine learning to optimize the process at different stages and minimize the overall time for the clinical trials by analyzing the clinical trial data. Summing up, they use data science to design smarter clinical trials, develop innovative medicines, improve drug development and patient care strategies, and many more.

C. Wearable Technology  

Wearable technology is a multi-billion-dollar industry. With an increasing awareness about fitness and nutrition, more individuals now prefer using fitness wearables to track their routines and lifestyle choices.  

Fitness wearables are convenient to use, assist users in tracking their health, and encourage them to lead a healthier lifestyle. The medical devices in this domain are beneficial since they help monitor the patient's condition and communicate in an emergency situation. The regularly used fitness trackers and smartwatches from renowned companies like Garmin, Apple, FitBit, etc., continuously collect physiological data of the individuals wearing them. These wearable providers offer user-friendly dashboards to their customers for analyzing and tracking progress in their fitness journey.

3. Covid 19 and Data Science

In the past two years of the Pandemic, the power of data science has been more evident than ever. Different  pharmaceutical companies  across the globe could synthesize Covid 19 vaccines by analyzing the data to understand the trends and patterns of the outbreak. Data science made it possible to track the virus in real-time, predict patterns, devise effective strategies to fight the Pandemic, and many more.  

How Johnson and Johnson uses data science to fight the Pandemic   

The  data science team  at  Johnson and Johnson  leverages real-time data to track the spread of the virus. They built a global surveillance dashboard (granulated to county level) that helps them track the Pandemic's progress, predict potential hotspots of the virus, and narrow down the likely place where they should test its investigational COVID-19 vaccine candidate. The team works with in-country experts to determine whether official numbers are accurate and find the most valid information about case numbers, hospitalizations, mortality and testing rates, social compliance, and local policies to populate this dashboard. The team also studies the data to build models that help the company identify groups of individuals at risk of getting affected by the virus and explore effective treatments to improve patient outcomes.

4. Data Science in E-commerce  

In the  e-commerce sector , big data analytics can assist in customer analysis, reduce operational costs, forecast trends for better sales, provide personalized shopping experiences to customers, and many more.  

Amazon uses data science to personalize shopping experiences and improve customer satisfaction.  Amazon  is a globally leading eCommerce platform that offers a wide range of online shopping services. Due to this, Amazon generates a massive amount of data that can be leveraged to understand consumer behavior and generate insights on competitors' strategies. Amazon uses its data to provide recommendations to its users on different products and services. With this approach, Amazon is able to persuade its consumers into buying and making additional sales. This approach works well for Amazon as it earns 35% of the revenue yearly with this technique. Additionally, Amazon collects consumer data for faster order tracking and better deliveries.     

Similarly, Amazon's virtual assistant, Alexa, can converse in different languages; uses speakers and a   camera to interact with the users. Amazon utilizes the audio commands from users to improve Alexa and deliver a better user experience. 

5. Data Science in Supply Chain Management

Predictive analytics and big data are driving innovation in the Supply chain domain. They offer greater visibility into the company operations, reduce costs and overheads, forecasting demands, predictive maintenance, product pricing, minimize supply chain interruptions, route optimization, fleet management , drive better performance, and more.     

Optimizing supply chain with big data analytics: UPS

UPS  is a renowned package delivery and supply chain management company. With thousands of packages being delivered every day, on average, a UPS driver makes about 100 deliveries each business day. On-time and safe package delivery are crucial to UPS's success. Hence, UPS offers an optimized navigation tool ''ORION'' (On-Road Integrated Optimization and Navigation), which uses highly advanced big data processing algorithms. This tool for UPS drivers provides route optimization concerning fuel, distance, and time. UPS utilizes supply chain data analysis in all aspects of its shipping process. Data about packages and deliveries are captured through radars and sensors. The deliveries and routes are optimized using big data systems. Overall, this approach has helped UPS save 1.6 million gallons of gasoline in transportation every year, significantly reducing delivery costs.    

6. Data Science in Meteorology

Weather prediction is an interesting  application of data science . Businesses like aviation, agriculture and farming, construction, consumer goods, sporting events, and many more are dependent on climatic conditions. The success of these businesses is closely tied to the weather, as decisions are made after considering the weather predictions from the meteorological department.   

Besides, weather forecasts are extremely helpful for individuals to manage their allergic conditions. One crucial application of weather forecasting is natural disaster prediction and risk management.  

Weather forecasts begin with a large amount of data collection related to the current environmental conditions (wind speed, temperature, humidity, clouds captured at a specific location and time) using sensors on IoT (Internet of Things) devices and satellite imagery. This gathered data is then analyzed using the understanding of atmospheric processes, and machine learning models are built to make predictions on upcoming weather conditions like rainfall or snow prediction. Although data science cannot help avoid natural calamities like floods, hurricanes, or forest fires. Tracking these natural phenomena well ahead of their arrival is beneficial. Such predictions allow governments sufficient time to take necessary steps and measures to ensure the safety of the population.  

IMD leveraged data science to achieve a record 1.2m evacuation before cyclone ''Fani''   

Most  d ata scientist’s responsibilities  rely on satellite images to make short-term forecasts, decide whether a forecast is correct, and validate models. Machine Learning is also used for pattern matching in this case. It can forecast future weather conditions if it recognizes a past pattern. When employing dependable equipment, sensor data is helpful to produce local forecasts about actual weather models. IMD used satellite pictures to study the low-pressure zones forming off the Odisha coast (India). In April 2019, thirteen days before cyclone ''Fani'' reached the area,  IMD  (India Meteorological Department) warned that a massive storm was underway, and the authorities began preparing for safety measures.  

It was one of the most powerful cyclones to strike India in the recent 20 years, and a record 1.2 million people were evacuated in less than 48 hours, thanks to the power of data science.   

7. Data Science in the Entertainment Industry

Due to the Pandemic, demand for OTT (Over-the-top) media platforms has grown significantly. People prefer watching movies and web series or listening to the music of their choice at leisure in the convenience of their homes. This sudden growth in demand has given rise to stiff competition. Every platform now uses data analytics in different capacities to provide better-personalized recommendations to its subscribers and improve user experience.   

How Netflix uses data science to personalize the content and improve recommendations  

Netflix  is an extremely popular internet television platform with streamable content offered in several languages and caters to various audiences. In 2006, when Netflix entered this media streaming market, they were interested in increasing the efficiency of their existing ''Cinematch'' platform by 10% and hence, offered a prize of $1 million to the winning team. This approach was successful as they found a solution developed by the BellKor team at the end of the competition that increased prediction accuracy by 10.06%. Over 200 work hours and an ensemble of 107 algorithms provided this result. These winning algorithms are now a part of the Netflix recommendation system.  

Netflix also employs Ranking Algorithms to generate personalized recommendations of movies and TV Shows appealing to its users.   

Spotify uses big data to deliver a rich user experience for online music streaming  

Personalized online music streaming is another area where data science is being used.  Spotify  is a well-known on-demand music service provider launched in 2008, which effectively leveraged big data to create personalized experiences for each user. It is a huge platform with more than 24 million subscribers and hosts a database of nearly 20million songs; they use the big data to offer a rich experience to its users. Spotify uses this big data and various algorithms to train machine learning models to provide personalized content. Spotify offers a "Discover Weekly" feature that generates a personalized playlist of fresh unheard songs matching the user's taste every week. Using the Spotify "Wrapped" feature, users get an overview of their most favorite or frequently listened songs during the entire year in December. Spotify also leverages the data to run targeted ads to grow its business. Thus, Spotify utilizes the user data, which is big data and some external data, to deliver a high-quality user experience.  

8. Data Science in Banking and Finance

Data science is extremely valuable in the Banking and  Finance industry . Several high priority aspects of Banking and Finance like credit risk modeling (possibility of repayment of a loan), fraud detection (detection of malicious or irregularities in transactional patterns using machine learning), identifying customer lifetime value (prediction of bank performance based on existing and potential customers), customer segmentation (customer profiling based on behavior and characteristics for personalization of offers and services). Finally, data science is also used in real-time predictive analytics (computational techniques to predict future events).    

How HDFC utilizes Big Data Analytics to increase revenues and enhance the banking experience    

One of the major private banks in India,  HDFC Bank , was an early adopter of AI. It started with Big Data analytics in 2004, intending to grow its revenue and understand its customers and markets better than its competitors. Back then, they were trendsetters by setting up an enterprise data warehouse in the bank to be able to track the differentiation to be given to customers based on their relationship value with HDFC Bank. Data science and analytics have been crucial in helping HDFC bank segregate its customers and offer customized personal or commercial banking services. The analytics engine and SaaS use have been assisting the HDFC bank in cross-selling relevant offers to its customers. Apart from the regular fraud prevention, it assists in keeping track of customer credit histories and has also been the reason for the speedy loan approvals offered by the bank.  

9. Data Science in Urban Planning and Smart Cities  

Data Science can help the dream of smart cities come true! Everything, from traffic flow to energy usage, can get optimized using data science techniques. You can use the data fetched from multiple sources to understand trends and plan urban living in a sorted manner.  

The significant data science case study is traffic management in Pune city. The city controls and modifies its traffic signals dynamically, tracking the traffic flow. Real-time data gets fetched from the signals through cameras or sensors installed. Based on this information, they do the traffic management. With this proactive approach, the traffic and congestion situation in the city gets managed, and the traffic flow becomes sorted. A similar case study is from Bhubaneswar, where the municipality has platforms for the people to give suggestions and actively participate in decision-making. The government goes through all the inputs provided before making any decisions, making rules or arranging things that their residents actually need.  

10. Data Science in Agricultural Yield Prediction   

Have you ever wondered how helpful it can be if you can predict your agricultural yield? That is exactly what data science is helping farmers with. They can get information about the number of crops they can produce in a given area based on different environmental factors and soil types. Using this information, the farmers can make informed decisions about their yield and benefit the buyers and themselves in multiple ways.  

Data Science in Agricultural Yield Prediction

Farmers across the globe and overseas use various data science techniques to understand multiple aspects of their farms and crops. A famous example of data science in the agricultural industry is the work done by Farmers Edge. It is a company in Canada that takes real-time images of farms across the globe and combines them with related data. The farmers use this data to make decisions relevant to their yield and improve their produce. Similarly, farmers in countries like Ireland use satellite-based information to ditch traditional methods and multiply their yield strategically.  

11. Data Science in the Transportation Industry   

Transportation keeps the world moving around. People and goods commute from one place to another for various purposes, and it is fair to say that the world will come to a standstill without efficient transportation. That is why it is crucial to keep the transportation industry in the most smoothly working pattern, and data science helps a lot in this. In the realm of technological progress, various devices such as traffic sensors, monitoring display systems, mobility management devices, and numerous others have emerged.  

Many cities have already adapted to the multi-modal transportation system. They use GPS trackers, geo-locations and CCTV cameras to monitor and manage their transportation system. Uber is the perfect case study to understand the use of data science in the transportation industry. They optimize their ride-sharing feature and track the delivery routes through data analysis. Their data science approach enabled them to serve more than 100 million users, making transportation easy and convenient. Moreover, they also use the data they fetch from users daily to offer cost-effective and quickly available rides.  

12. Data Science in the Environmental Industry    

Increasing pollution, global warming, climate changes and other poor environmental impacts have forced the world to pay attention to environmental industry. Multiple initiatives are being taken across the globe to preserve the environment and make the world a better place. Though the industry recognition and the efforts are in the initial stages, the impact is significant, and the growth is fast.  

The popular use of data science in the environmental industry is by NASA and other research organizations worldwide. NASA gets data related to the current climate conditions, and this data gets used to create remedial policies that can make a difference. Another way in which data science is actually helping researchers is they can predict natural disasters well before time and save or at least reduce the potential damage considerably. A similar case study is with the World Wildlife Fund. They use data science to track data related to deforestation and help reduce the illegal cutting of trees. Hence, it helps preserve the environment.  

Where to Find Full Data Science Case Studies?  

Data science is a highly evolving domain with many practical applications and a huge open community. Hence, the best way to keep updated with the latest trends in this domain is by reading case studies and technical articles. Usually, companies share their success stories of how data science helped them achieve their goals to showcase their potential and benefit the greater good. Such case studies are available online on the respective company websites and dedicated technology forums like Towards Data Science or Medium.  

Additionally, we can get some practical examples in recently published research papers and textbooks in data science.  

What Are the Skills Required for Data Scientists?  

Data scientists play an important role in the data science process as they are the ones who work on the data end to end. To be able to work on a data science case study, there are several skills required for data scientists like a good grasp of the fundamentals of data science, deep knowledge of statistics, excellent programming skills in Python or R, exposure to data manipulation and data analysis, ability to generate creative and compelling data visualizations, good knowledge of big data, machine learning and deep learning concepts for model building & deployment. Apart from these technical skills, data scientists also need to be good storytellers and should have an analytical mind with strong communication skills.    

Opt for the best business analyst training  elevating your expertise. Take the leap towards becoming a distinguished business analysis professional

Conclusion  

These were some interesting  data science case studies  across different industries. There are many more domains where data science has exciting applications, like in the Education domain, where data can be utilized to monitor student and instructor performance, develop an innovative curriculum that is in sync with the industry expectations, etc.   

Almost all the companies looking to leverage the power of big data begin with a swot analysis to narrow down the problems they intend to solve with data science. Further, they need to assess their competitors to develop relevant data science tools and strategies to address the challenging issue. This approach allows them to differentiate themselves from their competitors and offer something unique to their customers.  

With data science, the companies have become smarter and more data-driven to bring about tremendous growth. Moreover, data science has made these organizations more sustainable. Thus, the utility of data science in several sectors is clearly visible, a lot is left to be explored, and more is yet to come. Nonetheless, data science will continue to boost the performance of organizations in this age of big data.  

Frequently Asked Questions (FAQs)

A case study in data science requires a systematic and organized approach for solving the problem. Generally, four main steps are needed to tackle every data science case study: 

  • Defining the problem statement and strategy to solve it  
  • Gather and pre-process the data by making relevant assumptions  
  • Select tool and appropriate algorithms to build machine learning /deep learning models 
  • Make predictions, accept the solutions based on evaluation metrics, and improve the model if necessary. 

Getting data for a case study starts with a reasonable understanding of the problem. This gives us clarity about what we expect the dataset to include. Finding relevant data for a case study requires some effort. Although it is possible to collect relevant data using traditional techniques like surveys and questionnaires, we can also find good quality data sets online on different platforms like Kaggle, UCI Machine Learning repository, Azure open data sets, Government open datasets, Google Public Datasets, Data World and so on.  

Data science projects involve multiple steps to process the data and bring valuable insights. A data science project includes different steps - defining the problem statement, gathering relevant data required to solve the problem, data pre-processing, data exploration & data analysis, algorithm selection, model building, model prediction, model optimization, and communicating the results through dashboards and reports.  

Profile

Devashree Madhugiri

Devashree holds an M.Eng degree in Information Technology from Germany and a background in Data Science. She likes working with statistics and discovering hidden insights in varied datasets to create stunning dashboards. She enjoys sharing her knowledge in AI by writing technical articles on various technological platforms. She loves traveling, reading fiction, solving Sudoku puzzles, and participating in coding competitions in her leisure time.

Avail your free 1:1 mentorship session.

Something went wrong

Upcoming Data Science Batches & Dates

Course advisor icon

6 of my favorite case studies in Data Science!

Data scientists are numbers people. They have a deep understanding of statistics and algorithms, programming and hacking, and communication skills. Data science is about applying these three skill sets in a disciplined and systematic manner, with the goal of improving an aspect of the business. That’s the data science process . In order to stay abreast of industry trends, data scientists often turn to case studies. Reviewing these is a helpful way for both aspiring and working data scientists to challenge themselves and learn more about a particular field, a different way of thinking, or ways to better their own company based on similar experiences. If you’re not familiar with case studies , they’ve been described as “an intensive, systematic investigation of a single individual, group, community or some other unit in which the researcher examines in-depth data relating to several variables.” Data science is used by pretty much every industry out there. Insurance claims analysts can use data science to identify fraudulent behavior, e-commerce data scientists can build personalized experiences for their customers, music streaming companies can use it to create different genres of playlists—the possibilities are endless. Allow us to share a few of our favorite data science case studies with you so you can see first hand how companies across a variety of industries leveraged big data to drive productivity, profits, and more.

6 case studies in Data Science

  • How Airbnb characterizes data science
  • How data science is involved in decision-making at Airbnb
  • How Airbnb has scaled its data science efforts across all aspects of the company

Airbnb says that “we’re at a point where our infrastructure is stable, our tools are sophisticated, and our warehouse is clean and reliable. We’re ready to take on exciting new problems.” 3. Spotify’s “This Is” Playlists: The Ultimate Song Analysis For 50 Mainstream Artists If you’re a music lover, you’ve probably used Spotify at least once. If you’re a regular user, you’ve likely taken note of their personalized playlists and been impressed at how well the songs catered to your music preferences. But have you ever thought about how Spotify categorizes their music? You can thank their data science teams for that. The goal of the “This Is” case study is to analyze the music of various Spotify artists, segment the styles, and categorize them into by loudness, danceability, energy, and more. To start, a data scientist looked at Spotify’s API, which collects and provides data from Spotify’s music catalog. Once the data researcher accessed the data from Spotify’s API, he:

  • Processed the data to extract audio features for each artist
  • Visualized the data using D3.js.
  • Applied k-means clustering to separate the artists into different groups
  • Analyzed each feature for all the artists

Want a sneak peek at the results? James Arthur and Post Malone are in the same cluster, Kendrick Lamar is the “fastest” artist, and Marshmello beat Martin Garrix in the energy category. 4. A Leading Online Travel Agency Increases Revenues by 16 Percent with Actionable Analytics One of the largest online travel agencies in the world generated the majority of its revenue through its website and directed most of its resources there, but its clients were still using offline channels such as faxes and phone calls to ask questions. The agency brought in WNS, a travel-focused business process management company, to help it determine how to rethink and redesign its roadmap to capture missed revenue opportunities. WNS determined that the agency lacked an adequate offline strategy, which resulted in a dip in revenue and market share. After a deep dive into customer segments, the performance of offline sales agents, ideal hours for sales agents, and more, WNS was able to help the agency increase offline revenue by 16 percent and increase conversion rates by 21 percent. 5. How Mint.com Grew from Zero to 1 Million Users Mint.com is a free personal finance management service that asks users to input their personal spending data to generate insights about where their money goes. When Noah Kagan joined Mint.com as its marketing director, his goal was to find 100,000 new members in just six months. He didn’t just meet that goal. He destroyed it, generating one million members. How did he do it? Kagan says his success was two-fold. This first part was having a product he believed in. The second he attributes to “reverse engineering marketing.” “The key focal point to this strategy is to work backward,” Kagan explained. “Instead of starting with an intimidating zero playing on your mind, start at the solution and map your plan back from there.” He went on: “Think of it as a road trip. You start with a set destination in mind and then plan your route there. You don’t get in your car and start driving without in the hope that you magically end up where you wanted to be.” 6. Netflix: Using Big Data to Drive Big Engagement One of the best ways to explain the benefits of data science to people who don’t quite grasp the industry is by using Netflix-focused examples. Yes, Netflix is the largest internet-television network in the world. But what most people don’t realize is that, at its core, Netflix is a customer-focused, data-driven business. Founded in 1997 as a mail-order DVD company, it now boasts more than 53 million members in approximately 50 countries. If you watch The Fast and The Furious on Friday night, Netflix will likely serve up a Mark Wahlberg movie among your personalized recommendations for Saturday night. This is due to data science. But did you know that the company also uses its data insights to inform the way it buys, licenses, and creates new content? House of Cards and Orange is the New Black are two examples of how the company leveraged big data to understand its subscribers and cater to their needs. The company’s most-watched shows are generated from recommendations, which in turn foster consumer engagement and loyalty. This is why the company is constantly working on its recommendation engines. The Netflix story is a perfect case study for those who require engaged audiences in order to survive. In summary, data scientists are companies’ secret weapons when it comes to understanding customer behavior and levering it to drive conversion, loyalty, and profits. These six data science case studies show you how a variety of organizations—from a nature conservation group to a finance company to a media company—leveraged their big data to not only survive but to beat out the competition.

Recent Blogs

Why Invest In Data?

Why Invest In Data?

Data Science

How big data and product analytics are impacting the fintech industry

How big data and product analytics are impacting the fintech industry

How Even the Most World-Weary Investors are Leveraging the Power of Big Data to Make Trades

How Even the Most World-Weary Investors are Leveraging the Power of Big Data to Make Trades

What you need to build and implement an enterprise big data strategy

What you need to build and implement an enterprise big data strategy

Enterprise...

Big data challenges and how to overcome them

Big data challenges and how to overcome them

Big Data and blockchain are a perfect match. So what's keeping them apart?

Big Data and blockchain are a perfect match. So what's keeping them apart?

Not that...

4 applications of big data in Supply Chain Management

How to help high schoolers understand big data

How to help high schoolers understand big data

Data Science , Tech and Tools

The use of big data in manufacturing industry

The use of big data in manufacturing industry

Approximat...

The importance of big data and open source for the blockchain

The importance of big data and open source for the blockchain

Challenges of maintaining a traditional data warehouse

Challenges of maintaining a traditional data warehouse

5 reasons why big data initiatives fail

5 reasons why big data initiatives fail

5 data science books every beginner should read

5 data science books every beginner should read

Books , Data Science

How the evolution of data analytics impacts the digital marketing industry

How the evolution of data analytics impacts the digital marketing industry

Data analytics: How is it saving lives

Data analytics: How is it saving lives

Benefits and advantages of data cleansing techniques

Benefits and advantages of data cleansing techniques

How to use big data for business development

How to use big data for business development

7 Best practices to help secure big data

7 Best practices to help secure big data

others , Data Science

The Role of Big Data in Mobile App Development

The Role of Big Data in Mobile App Development

Data matters: Just being a visionary is not enough for new entrepreneurs

Data matters: Just being a visionary is not enough for new entrepreneurs

“Without...

Why improved connectivity is boosted by big data

Why improved connectivity is boosted by big data

According...

How big data is battling child abuse

How big data is battling child abuse

Technology...

How small businesses can harness the power of big data and data analytics

How small businesses can harness the power of big data and data analytics

API testing tutorial: How does it work?

API testing tutorial: How does it work?

Big data in auditing and analytics: How is it helping?

Big data in auditing and analytics: How is it helping?

Why customer data collection is important for effective marketing strategies?

Why customer data collection is important for effective marketing strategies?

Customer...

Subscribe to the Crayon Blog

Get the latest posts in your inbox!

Data Science Interview Case Studies: How to Prepare and Excel

Cover image for

In the realm of Data Science Interviews , case studies play a crucial role in assessing a candidate's problem-solving skills and analytical mindset . To stand out and excel in these scenarios, thorough preparation is key. Here's a comprehensive guide on how to prepare and shine in data science interview case studies.

Understanding the Basics

Before delving into case studies, it's essential to have a solid grasp of fundamental data science concepts. Review key topics such as statistical analysis, machine learning algorithms, data manipulation, and data visualization. This foundational knowledge will form the basis of your approach to solving case study problems.

Deconstructing the Case Study

When presented with a case study during the interview, take a structured approach to deconstructing the problem. Begin by defining the business problem or question at hand. Break down the problem into manageable components and identify the key variables involved. This analytical framework will guide your problem-solving process.

🚀 Read more on: "Ultimate Guide: Crafting an Impressive UI/UX Design Portfolio for Success"

Utilizing Data Science Techniques

Apply your data science skills to analyze the provided data and derive meaningful insights. Utilize statistical methods, predictive modeling, and data visualization techniques to explore patterns and trends within the dataset. Clearly communicate your methodology and reasoning to demonstrate your analytical capabilities.

Problem-Solving Strategy

Develop a systematic problem-solving strategy to tackle case study challenges effectively. Start by outlining your approach and assumptions before proceeding to data analysis and interpretation. Implement a logical and structured process to arrive at well-supported conclusions.

Practice Makes Perfect

Engage in regular practice sessions with mock case studies to hone your problem-solving skills. Participate in data science forums and communities to discuss case studies with peers and gain diverse perspectives. The more you practice, the more confident and proficient you will become in tackling complex data science challenges.

Communicating Your Findings

Effectively communicating your findings and insights is crucial in a data science interview case study. Present your analysis in a clear and concise manner, highlighting key takeaways and recommendations. Demonstrate your storytelling ability by structuring your presentation in a logical and engaging manner.

💡 Are you a job seeker in San Francisco? Check out these fresh jobs in your area!

Exceling in data science interview case studies requires a combination of technical proficiency, analytical thinking, and effective communication . By mastering the art of case study preparation and problem-solving, you can showcase your data science skills and secure coveted job opportunities in the field.

Explore, Engage, Elevate: Discover Unlimited Stories on Rise Blog

Let us know your email to read this article and many more, plus get fresh jobs delivered to your inbox every week 🎉

Featured Jobs ⭐️

Get Featured ⭐️ jobs delivered straight to your inbox 📬

Get Fresh Jobs Delivered Straight to Your Inbox

Join our newsletter for free job alerts every Monday!

Mailbox with a star behind

Jump to explore jobs

Sign up for our weekly newsletter of fresh jobs

Get fresh jobs delivered to your inbox every week 🎉

Next Gen Data Learning – Amplify Your Skills

Blog Home

Data Science Case Study Interview: Your Guide to Success

by Enterprise DNA Experts | Careers

Data Science Case Study Interview: Your Guide to Success

Ready to crush your next data science interview? Well, you’re in the right place.

This type of interview is designed to assess your problem-solving skills, technical knowledge, and ability to apply data-driven solutions to real-world challenges.

So, how can you master these interviews and secure your next job?

To master your data science case study interview:

Practice Case Studies: Engage in mock scenarios to sharpen problem-solving skills.

Review Core Concepts: Brush up on algorithms, statistical analysis, and key programming languages.

Contextualize Solutions: Connect findings to business objectives for meaningful insights.

Clear Communication: Present results logically and effectively using visuals and simple language.

Adaptability and Clarity: Stay flexible and articulate your thought process during problem-solving.

This article will delve into each of these points and give you additional tips and practice questions to get you ready to crush your upcoming interview!

After you’ve read this article, you can enter the interview ready to showcase your expertise and win your dream role.

Let’s dive in!

Data Science Case Study Interview

Table of Contents

What to Expect in the Interview?

Data science case study interviews are an essential part of the hiring process. They give interviewers a glimpse of how you, approach real-world business problems and demonstrate your analytical thinking, problem-solving, and technical skills.

Furthermore, case study interviews are typically open-ended , which means you’ll be presented with a problem that doesn’t have a right or wrong answer.

Instead, you are expected to demonstrate your ability to:

Break down complex problems

Make assumptions

Gather context

Provide data points and analysis

This type of interview allows your potential employer to evaluate your creativity, technical knowledge, and attention to detail.

But what topics will the interview touch on?

Topics Covered in Data Science Case Study Interviews

Topics Covered in Data Science Case Study Interviews

In a case study interview , you can expect inquiries that cover a spectrum of topics crucial to evaluating your skill set:

Topic 1: Problem-Solving Scenarios

In these interviews, your ability to resolve genuine business dilemmas using data-driven methods is essential.

These scenarios reflect authentic challenges, demanding analytical insight, decision-making, and problem-solving skills.

Real-world Challenges: Expect scenarios like optimizing marketing strategies, predicting customer behavior, or enhancing operational efficiency through data-driven solutions.

Analytical Thinking: Demonstrate your capacity to break down complex problems systematically, extracting actionable insights from intricate issues.

Decision-making Skills: Showcase your ability to make informed decisions, emphasizing instances where your data-driven choices optimized processes or led to strategic recommendations.

Your adeptness at leveraging data for insights, analytical thinking, and informed decision-making defines your capability to provide practical solutions in real-world business contexts.

Problem-Solving Scenarios in Data Science Interview

Topic 2: Data Handling and Analysis

Data science case studies assess your proficiency in data preprocessing, cleaning, and deriving insights from raw data.

Data Collection and Manipulation: Prepare for data engineering questions involving data collection, handling missing values, cleaning inaccuracies, and transforming data for analysis.

Handling Missing Values and Cleaning Data: Showcase your skills in managing missing values and ensuring data quality through cleaning techniques.

Data Transformation and Feature Engineering: Highlight your expertise in transforming raw data into usable formats and creating meaningful features for analysis.

Mastering data preprocessing—managing, cleaning, and transforming raw data—is fundamental. Your proficiency in these techniques showcases your ability to derive valuable insights essential for data-driven solutions.

Topic 3: Modeling and Feature Selection

Data science case interviews prioritize your understanding of modeling and feature selection strategies.

Model Selection and Application: Highlight your prowess in choosing appropriate models, explaining your rationale, and showcasing implementation skills.

Feature Selection Techniques: Understand the importance of selecting relevant variables and methods, such as correlation coefficients, to enhance model accuracy.

Ensuring Robustness through Random Sampling: Consider techniques like random sampling to bolster model robustness and generalization abilities.

Excel in modeling and feature selection by understanding contexts, optimizing model performance, and employing robust evaluation strategies.

Become a master at data modeling using these best practices:

Topic 4: Statistical and Machine Learning Approach

These interviews require proficiency in statistical and machine learning methods for diverse problem-solving. This topic is significant for anyone applying for a machine learning engineer position.

Using Statistical Models: Utilize logistic and linear regression models for effective classification and prediction tasks.

Leveraging Machine Learning Algorithms: Employ models such as support vector machines (SVM), k-nearest neighbors (k-NN), and decision trees for complex pattern recognition and classification.

Exploring Deep Learning Techniques: Consider neural networks, convolutional neural networks (CNN), and recurrent neural networks (RNN) for intricate data patterns.

Experimentation and Model Selection: Experiment with various algorithms to identify the most suitable approach for specific contexts.

Combining statistical and machine learning expertise equips you to systematically tackle varied data challenges, ensuring readiness for case studies and beyond.

Topic 5: Evaluation Metrics and Validation

In data science interviews, understanding evaluation metrics and validation techniques is critical to measuring how well machine learning models perform.

Choosing the Right Metrics: Select metrics like precision, recall (for classification), or R² (for regression) based on the problem type. Picking the right metric defines how you interpret your model’s performance.

Validating Model Accuracy: Use methods like cross-validation and holdout validation to test your model across different data portions. These methods prevent errors from overfitting and provide a more accurate performance measure.

Importance of Statistical Significance: Evaluate if your model’s performance is due to actual prediction or random chance. Techniques like hypothesis testing and confidence intervals help determine this probability accurately.

Interpreting Results: Be ready to explain model outcomes, spot patterns, and suggest actions based on your analysis. Translating data insights into actionable strategies showcases your skill.

Finally, focusing on suitable metrics, using validation methods, understanding statistical significance, and deriving actionable insights from data underline your ability to evaluate model performance.

Evaluation Metrics and Validation for case study interview

Also, being well-versed in these topics and having hands-on experience through practice scenarios can significantly enhance your performance in these case study interviews.

Prepare to demonstrate technical expertise and adaptability, problem-solving, and communication skills to excel in these assessments.

Now, let’s talk about how to navigate the interview.

Here is a step-by-step guide to get you through the process.

Steps by Step Guide Through the Interview

Steps by Step Guide Through the Interview

This section’ll discuss what you can expect during the interview process and how to approach case study questions.

Step 1: Problem Statement: You’ll be presented with a problem or scenario—either a hypothetical situation or a real-world challenge—emphasizing the need for data-driven solutions within data science.

Step 2: Clarification and Context: Seek more profound clarity by actively engaging with the interviewer. Ask pertinent questions to thoroughly understand the objectives, constraints, and nuanced aspects of the problem statement.

Step 3: State your Assumptions: When crucial information is lacking, make reasonable assumptions to proceed with your final solution. Explain these assumptions to your interviewer to ensure transparency in your decision-making process.

Step 4: Gather Context: Consider the broader business landscape surrounding the problem. Factor in external influences such as market trends, customer behaviors, or competitor actions that might impact your solution.

Step 5: Data Exploration: Delve into the provided datasets meticulously. Cleanse, visualize, and analyze the data to derive meaningful and actionable insights crucial for problem-solving.

Step 6: Modeling and Analysis: Leverage statistical or machine learning techniques to address the problem effectively. Implement suitable models to derive insights and solutions aligning with the identified objectives.

Step 7: Results Interpretation: Interpret your findings thoughtfully. Identify patterns, trends, or correlations within the data and present clear, data-backed recommendations relevant to the problem statement.

Step 8: Results Presentation: Effectively articulate your approach, methodologies, and choices coherently. This step is vital, especially when conveying complex technical concepts to non-technical stakeholders.

Remember to remain adaptable and flexible throughout the process and be prepared to adapt your approach to each situation.

Now that you have a guide on navigating the interview, let us give you some tips to help you stand out from the crowd.

Top 3 Tips to Master Your Data Science Case Study Interview

Tips to Master Data Science Case Study Interviews

Approaching case study interviews in data science requires a blend of technical proficiency and a holistic understanding of business implications.

Here are practical strategies and structured approaches to prepare effectively for these interviews:

1. Comprehensive Preparation Tips

To excel in case study interviews, a blend of technical competence and strategic preparation is key.

Here are concise yet powerful tips to equip yourself for success:

Practice with Mock Case Studies : Familiarize yourself with the process through practice. Online resources offer example questions and solutions, enhancing familiarity and boosting confidence.

Review Your Data Science Toolbox: Ensure a strong foundation in fundamentals like data wrangling, visualization, and machine learning algorithms. Comfort with relevant programming languages is essential.

Simplicity in Problem-solving: Opt for clear and straightforward problem-solving approaches. While advanced techniques can be impressive, interviewers value efficiency and clarity.

Interviewers also highly value someone with great communication skills. Here are some tips to highlight your skills in this area.

2. Communication and Presentation of Results

Communication and Presentation of Results in interview

In case study interviews, communication is vital. Present your findings in a clear, engaging way that connects with the business context. Tips include:

Contextualize results: Relate findings to the initial problem, highlighting key insights for business strategy.

Use visuals: Charts, graphs, or diagrams help convey findings more effectively.

Logical sequence: Structure your presentation for easy understanding, starting with an overview and progressing to specifics.

Simplify ideas: Break down complex concepts into simpler segments using examples or analogies.

Mastering these techniques helps you communicate insights clearly and confidently, setting you apart in interviews.

Lastly here are some preparation strategies to employ before you walk into the interview room.

3. Structured Preparation Strategy

Prepare meticulously for data science case study interviews by following a structured strategy.

Here’s how:

Practice Regularly: Engage in mock interviews and case studies to enhance critical thinking and familiarity with the interview process. This builds confidence and sharpens problem-solving skills under pressure.

Thorough Review of Concepts: Revisit essential data science concepts and tools, focusing on machine learning algorithms, statistical analysis, and relevant programming languages (Python, R, SQL) for confident handling of technical questions.

Strategic Planning: Develop a structured framework for approaching case study problems. Outline the steps and tools/techniques to deploy, ensuring an organized and systematic interview approach.

Understanding the Context: Analyze business scenarios to identify objectives, variables, and data sources essential for insightful analysis.

Ask for Clarification: Engage with interviewers to clarify any unclear aspects of the case study questions. For example, you may ask ‘What is the business objective?’ This exhibits thoughtfulness and aids in better understanding the problem.

Transparent Problem-solving: Clearly communicate your thought process and reasoning during problem-solving. This showcases analytical skills and approaches to data-driven solutions.

Blend technical skills with business context, communicate clearly, and prepare to systematically ace your case study interviews.

Now, let’s really make this specific.

Each company is different and may need slightly different skills and specializations from data scientists.

However, here is some of what you can expect in a case study interview with some industry giants.

Case Interviews at Top Tech Companies

Case Interviews at Top Tech Companies

As you prepare for data science interviews, it’s essential to be aware of the case study interview format utilized by top tech companies.

In this section, we’ll explore case interviews at Facebook, Twitter, and Amazon, and provide insight into what they expect from their data scientists.

Facebook predominantly looks for candidates with strong analytical and problem-solving skills. The case study interviews here usually revolve around assessing the impact of a new feature, analyzing monthly active users, or measuring the effectiveness of a product change.

To excel during a Facebook case interview, you should break down complex problems, formulate a structured approach, and communicate your thought process clearly.

Twitter , similar to Facebook, evaluates your ability to analyze and interpret large datasets to solve business problems. During a Twitter case study interview, you might be asked to analyze user engagement, develop recommendations for increasing ad revenue, or identify trends in user growth.

Be prepared to work with different analytics tools and showcase your knowledge of relevant statistical concepts.

Amazon is known for its customer-centric approach and data-driven decision-making. In Amazon’s case interviews, you may be tasked with optimizing customer experience, analyzing sales trends, or improving the efficiency of a certain process.

Keep in mind Amazon’s leadership principles, especially “Customer Obsession” and “Dive Deep,” as you navigate through the case study.

Remember, practice is key. Familiarize yourself with various case study scenarios and hone your data science skills.

With all this knowledge, it’s time to practice with the following practice questions.

Mockup Case Studies and Practice Questions

Mockup Case Studies and Practice Questions

To better prepare for your data science case study interviews, it’s important to practice with some mockup case studies and questions.

One way to practice is by finding typical case study questions.

Here are a few examples to help you get started:

Customer Segmentation: You have access to a dataset containing customer information, such as demographics and purchase behavior. Your task is to segment the customers into groups that share similar characteristics. How would you approach this problem, and what machine-learning techniques would you consider?

Fraud Detection: Imagine your company processes online transactions. You are asked to develop a model that can identify potentially fraudulent activities. How would you approach the problem and which features would you consider using to build your model? What are the trade-offs between false positives and false negatives?

Demand Forecasting: Your company needs to predict future demand for a particular product. What factors should be taken into account, and how would you build a model to forecast demand? How can you ensure that your model remains up-to-date and accurate as new data becomes available?

By practicing case study interview questions , you can sharpen problem-solving skills, and walk into future data science interviews more confidently.

Remember to practice consistently and stay up-to-date with relevant industry trends and techniques.

Final Thoughts

Data science case study interviews are more than just technical assessments; they’re opportunities to showcase your problem-solving skills and practical knowledge.

Furthermore, these interviews demand a blend of technical expertise, clear communication, and adaptability.

Remember, understanding the problem, exploring insights, and presenting coherent potential solutions are key.

By honing these skills, you can demonstrate your capability to solve real-world challenges using data-driven approaches. Good luck on your data science journey!

Frequently Asked Questions

How would you approach identifying and solving a specific business problem using data.

To identify and solve a business problem using data, you should start by clearly defining the problem and identifying the key metrics that will be used to evaluate success.

Next, gather relevant data from various sources and clean, preprocess, and transform it for analysis. Explore the data using descriptive statistics, visualizations, and exploratory data analysis.

Based on your understanding, build appropriate models or algorithms to address the problem, and then evaluate their performance using appropriate metrics. Iterate and refine your models as necessary, and finally, communicate your findings effectively to stakeholders.

Can you describe a time when you used data to make recommendations for optimization or improvement?

Recall a specific data-driven project you have worked on that led to optimization or improvement recommendations. Explain the problem you were trying to solve, the data you used for analysis, the methods and techniques you employed, and the conclusions you drew.

Share the results and how your recommendations were implemented, describing the impact it had on the targeted area of the business.

How would you deal with missing or inconsistent data during a case study?

When dealing with missing or inconsistent data, start by assessing the extent and nature of the problem. Consider applying imputation methods, such as mean, median, or mode imputation, or more advanced techniques like k-NN imputation or regression-based imputation, depending on the type of data and the pattern of missingness.

For inconsistent data, diagnose the issues by checking for typos, duplicates, or erroneous entries, and take appropriate corrective measures. Document your handling process so that stakeholders can understand your approach and the limitations it might impose on the analysis.

What techniques would you use to validate the results and accuracy of your analysis?

To validate the results and accuracy of your analysis, use techniques like cross-validation or bootstrapping, which can help gauge model performance on unseen data. Employ metrics relevant to your specific problem, such as accuracy, precision, recall, F1-score, or RMSE, to measure performance.

Additionally, validate your findings by conducting sensitivity analyses, sanity checks, and comparing results with existing benchmarks or domain knowledge.

How would you communicate your findings to both technical and non-technical stakeholders?

To effectively communicate your findings to technical stakeholders, focus on the methodology, algorithms, performance metrics, and potential improvements. For non-technical stakeholders, simplify complex concepts and explain the relevance of your findings, the impact on the business, and actionable insights in plain language.

Use visual aids, like charts and graphs, to illustrate your results and highlight key takeaways. Tailor your communication style to the audience, and be prepared to answer questions and address concerns that may arise.

How do you choose between different machine learning models to solve a particular problem?

When choosing between different machine learning models, first assess the nature of the problem and the data available to identify suitable candidate models. Evaluate models based on their performance, interpretability, complexity, and scalability, using relevant metrics and techniques such as cross-validation, AIC, BIC, or learning curves.

Consider the trade-offs between model accuracy, interpretability, and computation time, and choose a model that best aligns with the problem requirements, project constraints, and stakeholders’ expectations.

Keep in mind that it’s often beneficial to try several models and ensemble methods to see which one performs best for the specific problem at hand.

case study for data scientist

Related Posts

How To Leverage Expert Guidance for Your Career in AI

How To Leverage Expert Guidance for Your Career in AI

So, you’re considering a career in AI. With so much buzz around the industry, it’s no wonder you’re...

Continuous Learning in AI – How To Stay Ahead Of The Curve

AI , Careers

Artificial Intelligence (AI) is one of the most dynamic and rapidly evolving fields in the tech...

Learning Interpersonal Skills That Elevate Your Data Science Role

Learning Interpersonal Skills That Elevate Your Data Science Role

Data science has revolutionized the way businesses operate. It’s not just about the numbers anymore;...

How To Network And Create Connections in Data Science and AI

How To Network And Create Connections in Data Science and AI

Careers , Power BI

The field of data science and artificial intelligence (AI) is constantly evolving, and the demand for...

Top 20+ Data Visualization Interview Questions Explained

Top 20+ Data Visualization Interview Questions Explained

So, you’re applying for a data visualization or data analytics job? We get it, job interviews can be...

Master’s in Data Science Salary Expectations Explained

Master’s in Data Science Salary Expectations Explained

Are you pursuing a Master's in Data Science or recently graduated? Great! Having your Master's offers...

33 Important Data Science Manager Interview Questions

33 Important Data Science Manager Interview Questions

As an aspiring data science manager, you might wonder about the interview questions you'll face. We get...

Top 22 Data Analyst Behavioural Interview Questions & Answers

Top 22 Data Analyst Behavioural Interview Questions & Answers

Data analyst behavioral interviews can be a valuable tool for hiring managers to assess your skills,...

Top 22 Database Design Interview Questions Revealed

Top 22 Database Design Interview Questions Revealed

Database design is a crucial aspect of any software development process. Consequently, companies that...

Data Analyst Salary in New York: How Much?

Data Analyst Salary in New York: How Much?

Are you looking at becoming a data analyst in New York? Want to know how much you can possibly earn? In...

Top 30 Python Interview Questions for Data Engineers

Top 30 Python Interview Questions for Data Engineers

Careers , Python

Going for a job as a data engineer? Need to nail your Python proficiency? Well, you're in the right...

Facebook (Meta) SQL Career Questions: Interview Prep Guide

Facebook (Meta) SQL Career Questions: Interview Prep Guide

Careers , SQL

So, you want to land a great job at Facebook (Meta)? Well, as a data professional exploring potential...

case study for data scientist

Data science case interviews (what to expect & how to prepare)

Data science case study

Data science case studies are tough to crack: they’re open-ended, technical, and specific to the company. Interviewers use them to test your ability to break down complex problems and your use of analytical thinking to address business concerns.

So we’ve put together this guide to help you familiarize yourself with case studies at companies like Amazon, Google, and Meta (Facebook), as well as how to prepare for them, using practice questions and a repeatable answer framework.

Here’s the first thing you need to know about tackling data science case studies: always start by asking clarifying questions, before jumping in to your plan.

Let’s get started.

  • What to expect in data science case study interviews
  • How to approach data science case studies
  • Sample cases from FAANG data science interviews
  • How to prepare for data science case interviews

Click here to practice 1-on-1 with ex-FAANG interviewers

1. what to expect in data science case study interviews.

Before we get into an answer method and practice questions for data science case studies, let’s take a look at what you can expect in this type of interview.

Of course, the exact interview process for data scientist candidates will depend on the company you’re applying to, but case studies generally appear in both the pre-onsite phone screens and during the final onsite or virtual loop.

These questions may take anywhere from 10 to 40 minutes to answer, depending on the depth and complexity that the interviewer is looking for. During the initial phone screens, the case studies are typically shorter and interspersed with other technical and/or behavioral questions. During the final rounds, they will likely take longer to answer and require a more detailed analysis.

While some candidates may have the opportunity to prepare in advance and present their conclusions during an interview round, most candidates work with the information the interviewer offers on the spot.

1.1 The types of data science case studies

Generally, there are two types of case studies:

  • Analysis cases , which focus on how you translate user behavior into ideas and insights using data. These typically center around a product, feature, or business concern that’s unique to the company you’re interviewing with.
  • Modeling cases , which are more overtly technical and focus on how you build and use machine learning and statistical models to address business problems.

The number of case studies that you’ll receive in each category will depend on the company and the position that you’ve applied for. Facebook , for instance, typically doesn’t give many machine learning modeling cases, whereas Amazon does.

Also, some companies break these larger groups into smaller subcategories. For example, Facebook divides its analysis cases into two types: product interpretation and applied data . 

You may also receive in-depth questions similar to case studies, which test your technical capabilities (e.g. coding, SQL), so if you’d like to learn more about how to answer coding interview questions, take a look here .

We’ll give you a step-by-step method that can be used to answer analysis and modeling cases in section 2 . But first, let’s look at how interviewers will assess your answers.

1.2 What interviewers are looking for

We’ve researched accounts from ex-interviewers and data scientists to pinpoint the main criteria that interviewers look for in your answers. While the exact grading rubric will vary per company, this list from an ex-Google data scientist is a good overview of the biggest assessment areas:

  • Structure : candidate can break down an ambiguous problem into clear steps
  • Completeness : candidate is able to fully answer the question
  • Soundness : candidate’s solution is feasible and logical
  • Clarity : candidate’s explanations and methodology are easy to understand
  • Speed : candidate manages time well and is able to come up with solutions quickly

You’ll be able to improve your skills in each of these categories by practicing data science case studies on your own, and by working with an answer framework. We’ll get into that next.

2. How to approach data science case studies

Approaching data science cases with a repeatable framework will not only add structure to your answer, but also help you manage your time and think clearly under the stress of interview conditions.

Let’s go over a framework that you can use in your interviews, then break it down with an example answer.

2.1 Data science case framework: CAPER

We've researched popular frameworks used by real data scientists, and consolidated them to be as memorable and useful in an interview setting as possible.

Try using the framework below to structure your thinking during the interview. 

  • Clarify : Start by asking questions. Case questions are ambiguous, so you’ll need to gather more information from the interviewer, while eliminating irrelevant data. The types of questions you’ll ask will depend on the case, but consider: what is the business objective? What data can I access? Should I focus on all customers or just in X region?
  • Assume : Narrow the problem down by making assumptions and stating them to the interviewer for confirmation. (E.g. the statistical significance is X%, users are segmented based on XYZ, etc.) By the end of this step you should have constrained the problem into a clear goal.
  • Plan : Now, begin to craft your solution. Take time to outline a plan, breaking it into manageable tasks. Once you’ve made your plan, explain each step that you will take to the interviewer, and ask if it sounds good to them.
  • Execute : Carry out your plan, walking through each step with the interviewer. Depending on the type of case, you may have to prepare and engineer data, code, apply statistical algorithms, build a model, etc. In the majority of cases, you will need to end with business analysis.
  • Review : Finally, tie your final solution back to the business objectives you and the interviewer had initially identified. Evaluate your solution, and whether there are any steps you could have added or removed to improve it. 

Now that you’ve seen the framework, let’s take a look at how to implement it.

2.2 Sample answer using the CAPER framework

Below you’ll find an answer to a Facebook data science interview question from the Applied Data loop. This is an example that comes from Facebook’s data science interview prep materials, which you can find here .

Try this question:

Imagine that Facebook is building a product around high schools, starting with about 300 million users who have filled out a field with the name of their current high school. How would you find out how much of this data is real?

First, we need to clarify the question, eliminating irrelevant data and pinpointing what is the most important. For example:

  • What exactly does “real” mean in this context?
  • Should we focus on whether the high school itself is real, or whether the user actually attended the high school they’ve named?

After discussing with the interviewer, we’ve decided to focus on whether the high school itself is real first, followed by whether the user actually attended the high school they’ve named.

Next, we’ll narrow the problem down and state our assumptions to the interviewer for confirmation. Here are some assumptions we could make in the context of this problem:

  • The 300 million users are likely teenagers, given that they’re listing their current high school
  • We can assume that a high school that is listed too few times is likely fake
  • We can assume that a high school that is listed too many times (e.g. 10,000+ students) is likely fake

The interviewer has agreed with each of these assumptions, so we can now move on to the plan.

Next, it’s time to make a list of actionable steps and lay them out for the interviewer before moving on.

First, there are two approaches that we can identify:

  • A high precision approach, which provides a list of people who definitely went to a confirmed high school
  • A high recall approach, more similar to market sizing, which would provide a ballpark figure of people who went to a confirmed high school

As this is for a product that Facebook is currently building, the product use case likely calls for an estimate that is as accurate as possible. So we can go for the first approach, which will provide a more precise estimate of confirmed users listing a real high school. 

Now, we list the steps that make up this approach:

  • To find whether a high school is real: Draw a distribution with the number of students on the X axis, and the number of high schools on the Y axis, in order to find and eliminate the lower and upper bounds
  • To find whether a student really went to a high school: use a user’s friend graph and location to determine the plausibility of the high school they’ve named

The interviewer has approved the plan, which means that it’s time to execute.

4. Execute 

Step 1: Determining whether a high school is real

Going off of our plan, we’ll first start with the distribution.

We can use x1 to denote the lower bound, below which the number of times a high school is listed would be too small for a plausible school. x2 then denotes the upper bound, above which the high school has been listed too many times for a plausible school.

Here is what that would look like:

Data science case study illustration

Be prepared to answer follow up questions. In this case, the interviewer may ask, “looking at this graph, what do you think x1 and x2 would be?”

Based on this distribution, we could say that x1 is approximately the 5th percentile, or somewhere around 100 students. So, out of 300 million students, if fewer than 100 students list “Applebee” high school, then this is most likely not a real high school.

x2 is likely around the 95th percentile, or potentially as high as the 99th percentile. Based on intuition, we could estimate that number around 10,000. So, if more than 10,000 students list “Applebee” high school, then this is most likely not real. Here is how that looks on the distribution:

Data science case study illustration 2

At this point, the interviewer may ask more follow-up questions, such as “how do we account for different high schools that share the same name?”

In this case, we could group by the schools’ name and location, rather than name alone. If the high school does not have a dedicated page that lists its location, we could deduce its location based on the city of the user that lists it. 

Step 2: Determining whether a user went to the high school

A strong signal as to whether a user attended a specific high school would be their friend graph: a set number of friends would have to have listed the same current high school. For now, we’ll set that number at five friends.

Don’t forget to call out trade-offs and edge cases as you go. In this case, there could be a student who has recently moved, and so the high school they’ve listed does not reflect their actual current high school. 

To solve this, we could rely on users to update their location to reflect the change. If users do not update their location and high school, this would present an edge case that we would need to work out later.

To conclude, we could use the data from both the friend graph and the initial distribution to confirm the two signifiers: a high school is real, and the user really went there.

If enough users in the same location list the same high school, then it is likely that the high school is real, and that the users really attend it. If there are not enough users in the same location that list the same high school, then it is likely that the high school is not real, and the users do not actually attend it.

3. Sample cases from FAANG data science interviews

Having worked through the sample problem above, try out the different kinds of case studies that have been asked in data science interviews at FAANG companies. We’ve divided the questions into types of cases, as well as by company.

For more information about each of these companies’ data science interviews, take a look at these guides:

  • Facebook data scientist interview guide
  • Amazon data scientist interview guide
  • Google data scientist interview guide

Now let’s get into the questions. This is a selection of real data scientist interview questions, according to data from Glassdoor.

Data science case studies

Facebook - Analysis (product interpretation)

  • How would you measure the success of a product?
  • What KPIs would you use to measure the success of the newsfeed?
  • Friends acceptance rate decreases 15% after a new notifications system is launched - how would you investigate?

Facebook - Analysis (applied data)

  • How would you evaluate the impact for teenagers when their parents join Facebook?
  • How would you decide to launch or not if engagement within a specific cohort decreased while all the rest increased?
  • How would you set up an experiment to understand feature change in Instagram stories?

Amazon - modeling

  • How would you improve a classification model that suffers from low precision?
  • When you have time series data by month, and it has large data records, how will you find significant differences between this month and previous month?

Google - Analysis

  • You have a google app and you make a change. How do you test if a metric has increased or not?
  • How do you detect viruses or inappropriate content on YouTube?
  • How would you compare if upgrading the android system produces more searches?

4. How to prepare for data science case interviews

Understanding the process and learning a method for data science cases will go a long way in helping you prepare. But this information is not enough to land you a data science job offer. 

To succeed in your data scientist case interviews, you're also going to need to practice under realistic interview conditions so that you'll be ready to perform when it counts. 

For more information on how to prepare for data science interviews as a whole, take a look at our guide on data science interview prep .

4.1 Practice on your own

Start by answering practice questions alone. You can use the list in section 3 , and interview yourself out loud. This may sound strange, but it will significantly improve the way you communicate your answers during an interview. 

Play the role of both the candidate and the interviewer, asking questions and answering them, just like two people would in an interview. This will help you get used to the answer framework and get used to answering data science cases in a structured way.

4.2 Practice with peers

Once you’re used to answering questions on your own , then a great next step is to do mock interviews with friends or peers. This will help you adapt your approach to accommodate for follow-ups and answer questions you haven’t already worked through.

This can be especially helpful if your friend has experience with data scientist interviews, or is at least familiar with the process.

4.3 Practice with ex-interviewers

Finally, you should also try to practice data science mock interviews with expert ex-interviewers, as they’ll be able to give you much more accurate feedback than friends and peers.

If you know a data scientist or someone who has experience running interviews at a big tech company, then that's fantastic. But for most of us, it's tough to find the right connections to make this happen. And it might also be difficult to practice multiple hours with that person unless you know them really well.

Here's the good news. We've already made the connections for you. We’ve created a coaching service where you can practice 1-on-1 with ex-interviewers from leading tech companies. Learn more and start scheduling sessions today .

Interview coach and candidate conduct a video call

20+ Data Science Case Study Interview Questions (with Solutions)

2024 Guide: 20+ Essential Data Science Case Study Interview Questions

Case studies are often the most challenging aspect of data science interview processes. They are crafted to resemble a company’s existing or previous projects, assessing a candidate’s ability to tackle prompts, convey their insights, and navigate obstacles.

To excel in data science case study interviews, practice is crucial. It will enable you to develop strategies for approaching case studies, asking the right questions to your interviewer, and providing responses that showcase your skills while adhering to time constraints.

The best way of doing this is by using a framework for answering case studies. For example, you could use the product metrics framework and the A/B testing framework to answer most case studies that come up in data science interviews.

There are four main types of data science case studies:

  • Product Case Studies - This type of case study tackles a specific product or feature offering, often tied to the interviewing company. Interviewers are generally looking for a sense of business sense geared towards product metrics.
  • Data Analytics Case Study Questions - Data analytics case studies ask you to propose possible metrics in order to investigate an analytics problem. Additionally, you must write a SQL query to pull your proposed metrics, and then perform analysis using the data you queried, just as you would do in the role.
  • Modeling and Machine Learning Case Studies - Modeling case studies are more varied and focus on assessing your intuition for building models around business problems.
  • Business Case Questions - Similar to product questions, business cases tackle issues or opportunities specific to the organization that is interviewing you. Often, candidates must assess the best option for a certain business plan being proposed, and formulate a process for solving the specific problem.

How Case Study Interviews Are Conducted

Oftentimes as an interviewee, you want to know the setting and format in which to expect the above questions to be asked. Unfortunately, this is company-specific: Some prefer real-time settings, where candidates actively work through a prompt after receiving it, while others offer some period of days (say, a week) before settling in for a presentation of your findings.

It is therefore important to have a system for answering these questions that will accommodate all possible formats, such that you are prepared for any set of circumstances (we provide such a framework below).

Why Are Case Study Questions Asked?

Case studies assess your thought process in answering data science questions. Specifically, interviewers want to see that you have the ability to think on your feet, and to work through real-world problems that likely do not have a right or wrong answer. Real-world case studies that are affecting businesses are not binary; there is no black-and-white, yes-or-no answer. This is why it is important that you can demonstrate decisiveness in your investigations, as well as show your capacity to consider impacts and topics from a variety of angles. Once you are in the role, you will be dealing directly with the ambiguity at the heart of decision-making.

Perhaps most importantly, case interviews assess your ability to effectively communicate your conclusions. On the job, data scientists exchange information across teams and divisions, so a significant part of the interviewer’s focus will be on how you process and explain your answer.

Quick tip: Because case questions in data science interviews tend to be product- and company-focused, it is extremely beneficial to research current projects and developments across different divisions , as these initiatives might end up as the case study topic.

case study for data scientist

How to Answer Data Science Case Study Questions (The Framework)

image

There are four main steps to tackling case questions in Data Science interviews, regardless of the type: clarify, make assumptions, gather context, and provide data points and analysis.

Step 1: Clarify

Clarifying is used to gather more information . More often than not, these case studies are designed to be confusing and vague. There will be unorganized data intentionally supplemented with extraneous or omitted information, so it is the candidate’s responsibility to dig deeper, filter out bad information, and fill gaps. Interviewers will be observing how an applicant asks questions and reach their solution.

For example, with a product question, you might take into consideration:

  • What is the product?
  • How does the product work?
  • How does the product align with the business itself?

Step 2: Make Assumptions

When you have made sure that you have evaluated and understand the dataset, start investigating and discarding possible hypotheses. Developing insights on the product at this stage complements your ability to glean information from the dataset, and the exploration of your ideas is paramount to forming a successful hypothesis. You should be communicating your hypotheses with the interviewer, such that they can provide clarifying remarks on how the business views the product, and to help you discard unworkable lines of inquiry. If we continue to think about a product question, some important questions to evaluate and draw conclusions from include:

  • Who uses the product? Why?
  • What are the goals of the product?
  • How does the product interact with other services or goods the company offers?

The goal of this is to reduce the scope of the problem at hand, and ask the interviewer questions upfront that allow you to tackle the meat of the problem instead of focusing on less consequential edge cases.

Step 3: Propose a Solution

Now that a hypothesis is formed that has incorporated the dataset and an understanding of the business-related context, it is time to apply that knowledge in forming a solution. Remember, the hypothesis is simply a refined version of the problem that uses the data on hand as its basis to being solved. The solution you create can target this narrow problem, and you can have full faith that it is addressing the core of the case study question.

Keep in mind that there isn’t a single expected solution, and as such, there is a certain freedom here to determine the exact path for investigation.

Step 4: Provide Data Points and Analysis

Finally, providing data points and analysis in support of your solution involves choosing and prioritizing a main metric. As with all prior factors, this step must be tied back to the hypothesis and the main goal of the problem. From that foundation, it is important to trace through and analyze different examples– from the main metric–in order to validate the hypothesis.

Quick tip: Every case question tends to have multiple solutions. Therefore, you should absolutely consider and communicate any potential trade-offs of your chosen method. Be sure you are communicating the pros and cons of your approach.

Note: In some special cases, solutions will also be assessed on the ability to convey information in layman’s terms. Regardless of the structure, applicants should always be prepared to solve through the framework outlined above in order to answer the prompt.

The Role of Effective Communication

There have been multiple articles and discussions conducted by interviewers behind the Data Science Case Study portion, and they all boil down success in case studies to one main factor: effective communication.

All the analysis in the world will not help if interviewees cannot verbally work through and highlight their thought process within the case study. Again, interviewers are keyed at this stage of the hiring process to look for well-developed “soft-skills” and problem-solving capabilities. Demonstrating those traits is key to succeeding in this round.

To this end, the best advice possible would be to practice actively going through example case studies, such as those available in the Interview Query questions bank . Exploring different topics with a friend in an interview-like setting with cold recall (no Googling in between!) will be uncomfortable and awkward, but it will also help reveal weaknesses in fleshing out the investigation.

Don’t worry if the first few times are terrible! Developing a rhythm will help with gaining self-confidence as you become better at assessing and learning through these sessions.

Product Case Study Questions

image

With product data science case questions , the interviewer wants to get an idea of your product sense intuition. Specifically, these questions assess your ability to identify which metrics should be proposed in order to understand a product.

1. How would you measure the success of private stories on Instagram, where only certain close friends can see the story?

Start by answering: What is the goal of the private story feature on Instagram? You can’t evaluate “success” without knowing what the initial objective of the product was, to begin with.

One specific goal of this feature would be to drive engagement. A private story could potentially increase interactions between users, and grow awareness of the feature.

Now, what types of metrics might you propose to assess user engagement? For a high-level overview, we could look at:

  • Average stories per user per day
  • Average Close Friends stories per user per day

However, we would also want to further bucket our users to see the effect that Close Friends stories have on user engagement. By bucketing users by age, date joined, or another metric, we could see how engagement is affected within certain populations, giving us insight on success that could be lost if looking at the overall population.

2. How would you measure the success of acquiring new users through a 30-day free trial at Netflix?

More context: Netflix is offering a promotion where users can enroll in a 30-day free trial. After 30 days, customers will automatically be charged based on their selected package. How would you measure acquisition success, and what metrics would you propose to measure the success of the free trial?

One way we can frame the concept specifically to this problem is to think about controllable inputs, external drivers, and then the observable output . Start with the major goals of Netflix:

  • Acquiring new users to their subscription plan.
  • Decreasing churn and increasing retention.

Looking at acquisition output metrics specifically, there are several top-level stats that we can look at, including:

  • Conversion rate percentage
  • Cost per free trial acquisition
  • Daily conversion rate

With these conversion metrics, we would also want to bucket users by cohort. This would help us see the percentage of free users who were acquired, as well as retention by cohort.

3. How would you measure the success of Facebook Groups?

Start by considering the key function of Facebook Groups . You could say that Groups are a way for users to connect with other users through a shared interest or real-life relationship. Therefore, the user’s goal is to experience a sense of community, which will also drive our business goal of increasing user engagement.

What general engagement metrics can we associate with this value? An objective metric like Groups monthly active users would help us see if Facebook Groups user base is increasing or decreasing. Plus, we could monitor metrics like posting, commenting, and sharing rates.

There are other products that Groups impact, however, specifically the Newsfeed. We need to consider Newsfeed quality and examine if updates from Groups clog up the content pipeline and if users prioritize those updates over other Newsfeed items. This evaluation will give us a better sense of if Groups actually contribute to higher engagement levels.

4. How would you analyze the effectiveness of a new LinkedIn chat feature that shows a “green dot” for active users?

Note: Given engineering constraints, the new feature is impossible to A/B test before release. When you approach case study questions, remember always to clarify any vague terms. In this case, “effectiveness” is very vague. To help you define that term, you would want first to consider what the goal is of adding a green dot to LinkedIn chat.

Data Science Product Case Study (LinkedIn InMail, Facebook Chat)

5. How would you diagnose why weekly active users are up 5%, but email notification open rates are down 2%?

What assumptions can you make about the relationship between weekly active users and email open rates? With a case question like this, you would want to first answer that line of inquiry before proceeding.

Hint: Open rate can decrease when its numerator decreases (fewer people open emails) or its denominator increases (more emails are sent overall). Taking these two factors into account, what are some hypotheses we can make about our decrease in the open rate compared to our increase in weekly active users?

Data Analytics Case Study Questions

Data analytics case studies ask you to dive into analytics problems. Typically these questions ask you to examine metrics trade-offs or investigate changes in metrics. In addition to proposing metrics, you also have to write SQL queries to generate the metrics, which is why they are sometimes referred to as SQL case study questions .

6. Using the provided data, generate some specific recommendations on how DoorDash can improve.

In this DoorDash analytics case study take-home question you are provided with the following dataset:

  • Customer order time
  • Restaurant order time
  • Driver arrives at restaurant time
  • Order delivered time
  • Customer ID
  • Amount of discount
  • Amount of tip

With a dataset like this, there are numerous recommendations you can make. A good place to start is by thinking about the DoorDash marketplace, which includes drivers, riders and merchants. How could you analyze the data to increase revenue, driver/user retention and engagement in that marketplace?

7. After implementing a notification change, the total number of unsubscribes increases. Write a SQL query to show how unsubscribes are affecting login rates over time.

This is a Twitter data science interview question , and let’s say you implemented this new feature using an A/B test. You are provided with two tables: events (which includes login, nologin and unsubscribe ) and variants (which includes control or variant ).

We are tasked with comparing multiple different variables at play here. There is the new notification system, along with its effect of creating more unsubscribes. We can also see how login rates compare for unsubscribes for each bucket of the A/B test.

Given that we want to measure two different changes, we know we have to use GROUP BY for the two variables: date and bucket variant. What comes next?

8. Write a query to disprove the hypothesis: Data scientists who switch jobs more often end up getting promoted faster.

More context: You are provided with a table of user experiences representing each person’s past work experiences and timelines.

This question requires a bit of creative problem-solving to understand how we can prove or disprove the hypothesis. The hypothesis is that a data scientist that ends up switching jobs more often gets promoted faster.

Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments on how often they jump in their careers.

For example, if we looked at the number of job switches for data scientists that have been in their field for five years, we could prove the hypothesis that the number of data science managers increased as the number of career jumps also rose.

  • Never switched jobs: 10% are managers
  • Switched jobs once: 20% are managers
  • Switched jobs twice: 30% are managers
  • Switched jobs three times: 40% are managers

9. Write a SQL query to investigate the hypothesis: Click-through rate is dependent on search result rating.

More context: You are given a table with search results on Facebook, which includes query (search term), position (the search position), and rating (human rating from 1 to 5). Each row represents a single search and includes a column has_clicked that represents whether a user clicked or not.

This question requires us to formulaically do two things: create a metric that can analyze a problem that we face and then actually compute that metric.

Think about the data we want to display to prove or disprove the hypothesis. Our output metric is CTR (clickthrough rate). If CTR is high when search result ratings are high and CTR is low when the search result ratings are low, then our hypothesis is proven. However, if the opposite is true, CTR is low when the search result ratings are high, or there is no proven correlation between the two, then our hypothesis is not proven.

With that structure in mind, we can then look at the results split into different search rating buckets. If we measure the CTR for queries that all have results rated at 1 and then measure CTR for queries that have results rated at lower than 2, etc., we can measure to see if the increase in rating is correlated with an increase in CTR.

10. How would you help a supermarket chain determine which product categories should be prioritized in their inventory restructuring efforts?

You’re working as a Data Scientist in a local grocery chain’s data science team. The business team has decided to allocate store floor space by product category (e.g., electronics, sports and travel, food and beverages). Help the team understand which product categories to prioritize as well as answering questions such as how customer demographics affect sales, and how each city’s sales per product category differs.

Check out our Data Analytics Learning Path .

Modeling and Machine Learning Case Questions

Machine learning case questions assess your ability to build models to solve business problems. These questions can range from applying machine learning to solve a specific case scenario to assessing the validity of a hypothetical existing model . The modeling case study requires a candidate to evaluate and explain any certain part of the model building process.

11. Describe how you would build a model to predict Uber ETAs after a rider requests a ride.

Common machine learning case study problems like this are designed to explain how you would build a model. Many times this can be scoped down to specific parts of the model building process. Examining the example above, we could break it up into:

How would you evaluate the predictions of an Uber ETA model?

What features would you use to predict the Uber ETA for ride requests?

Our recommended framework breaks down a modeling and machine learning case study to individual steps in order to tackle each one thoroughly. In each full modeling case study, you will want to go over:

  • Data processing
  • Feature Selection
  • Model Selection
  • Cross Validation
  • Evaluation Metrics
  • Testing and Roll Out

12. How would you build a model that sends bank customers a text message when fraudulent transactions are detected?

Additionally, the customer can approve or deny the transaction via text response.

Let’s start out by understanding what kind of model would need to be built. We know that since we are working with fraud, there has to be a case where either a fraudulent transaction is or is not present .

Hint: This problem is a binary classification problem. Given the problem scenario, what considerations do we have to think about when first building this model? What would the bank fraud data look like?

13. How would you design the inputs and outputs for a model that detects potential bombs at a border crossing?

Additional questions. How would you test the model and measure its accuracy? Remember the equation for precision:

image

Because we can not have high TrueNegatives, recall should be high when assessing the model.

14. Which model would you choose to predict Airbnb booking prices: Linear regression or random forest regression?

Start by answering this question: What are the main differences between linear regression and random forest?

Random forest regression is based on the ensemble machine learning technique of bagging . The two key concepts of random forests are:

  • Random sampling of training observations when building trees.
  • Random subsets of features for splitting nodes.

Random forest regressions also discretize continuous variables, since they are based on decision trees and can split categorical and continuous variables.

Linear regression, on the other hand, is the standard regression technique in which relationships are modeled using a linear predictor function, the most common example represented as y = Ax + B.

Let’s see how each model is applicable to Airbnb’s bookings. One thing we need to do in the interview is to understand more context around the problem of predicting bookings. To do so, we need to understand which features are present in our dataset.

We can assume the dataset will have features like:

  • Location features.
  • Seasonality.
  • Number of bedrooms and bathrooms.
  • Private room, shared, entire home, etc.
  • External demand (conferences, festivals, sporting events).

Which model would be the best fit for this feature set?

15. Using a binary classification model that pre-approves candidates for a loan, how would you give each rejected application a rejection reason?

More context: You do not have access to the feature weights. Start by thinking about the problem like this: How would the problem change if we had ten, one thousand, or ten thousand applicants that had gone through the loan qualification program?

Pretend that we have three people: Alice, Bob, and Candace that have all applied for a loan. Simplifying the financial lending loan model, let us assume the only features are the total number of credit cards , the dollar amount of current debt , and credit age . Here is a scenario:

Alice: 10 credit cards, 5 years of credit age, $\$20K$ in debt

Bob: 10 credit cards, 5 years of credit age, $\$15K$ in debt

Candace: 10 credit cards, 5 years of credit age, $\$10K$ in debt

If Candace is approved, we can logically point to the fact that Candace’s $\$10K$ in debt swung the model to approve her for a loan. How did we reason this out?

If the sample size analyzed was instead thousands of people who had the same number of credit cards and credit age with varying levels of debt, we could figure out the model’s average loan acceptance rate for each numerical amount of current debt. Then we could plot these on a graph to model the y-value (average loan acceptance) versus the x-value (dollar amount of current debt). These graphs are called partial dependence plots.

Business Case Questions

In data science interviews, business case study questions task you with addressing problems as they relate to the business. You might be asked about topics like estimation and calculation, as well as applying problem-solving to a larger case. One tip: Be sure to read up on the company’s products and ventures before your interview to expose yourself to possible topics.

16. How would you estimate the average lifetime value of customers at a business that has existed for just over one year?

More context: You know that the product costs $\$100$ per month, averages 10% in monthly churn, and the average customer stays for 3.5 months.

Remember that lifetime value is defined by the prediction of the net revenue attributed to the entire future relationship with all customers averaged. Therefore, $\$100$ * 3.5 = $\$350$… But is it that simple?

Because this company is so new, our average customer length (3.5 months) is biased from the short possible length of time that anyone could have been a customer (one year maximum). How would you then model out LTV knowing the churn rate and product cost?

17. How would you go about removing duplicate product names (e.g. iPhone X vs. Apple iPhone 10) in a massive database?

See the full solution for this Amazon business case question on YouTube:

case study for data scientist

18. What metrics would you monitor to know if a 50% discount promotion is a good idea for a ride-sharing company?

This question has no correct answer and is rather designed to test your reasoning and communication skills related to product/business cases. First, start by stating your assumptions. What are the goals of this promotion? It is likely that the goal of the discount is to grow revenue and increase retention. A few other assumptions you might make include:

  • The promotion will be applied uniformly across all users.
  • The 50% discount can only be used for a single ride.

How would we be able to evaluate this pricing strategy? An A/B test between the control group (no discount) and test group (discount) would allow us to evaluate Long-term revenue vs average cost of the promotion. Using these two metrics how could we measure if the promotion is a good idea?

19. A bank wants to create a new partner card, e.g. Whole Foods Chase credit card). How would you determine what the next partner card should be?

More context: Say you have access to all customer spending data. With this question, there are several approaches you can take. As your first step, think about the business reason for credit card partnerships: they help increase acquisition and customer retention.

One of the simplest solutions would be to sum all transactions grouped by merchants. This would identify the merchants who see the highest spending amounts. However, the one issue might be that some merchants have a high-spend value but low volume. How could we counteract this potential pitfall? Is the volume of transactions even an important factor in our credit card business? The more questions you ask, the more may spring to mind.

20. How would you assess the value of keeping a TV show on a streaming platform like Netflix?

Say that Netflix is working on a deal to renew the streaming rights for a show like The Office , which has been on Netflix for one year. Your job is to value the benefit of keeping the show on Netflix.

Start by trying to understand the reasons why Netflix would want to renew the show. Netflix mainly has three goals for what their content should help achieve:

  • Acquisition: To increase the number of subscribers.
  • Retention: To increase the retention of active subscribers and keep them on as paying members.
  • Revenue: To increase overall revenue.

One solution to value the benefit would be to estimate a lower and upper bound to understand the percentage of users that would be affected by The Office being removed. You could then run these percentages against your known acquisition and retention rates.

21. How would you determine which products are to be put on sale?

Let’s say you work at Amazon. It’s nearing Black Friday, and you are tasked with determining which products should be put on sale. You have access to historical pricing and purchasing data from items that have been on sale before. How would you determine what products should go on sale to best maximize profit during Black Friday?

To start with this question, aggregate data from previous years for products that have been on sale during Black Friday or similar events. You can then compare elements such as historical sales volume, inventory levels, and profit margins.

Learn More About Feature Changes

This course is designed teach you everything you need to know about feature changes:

More Data Science Interview Resources

Case studies are one of the most common types of data science interview questions . Practice with the data science course from Interview Query, which includes product and machine learning modules.

Top 10 Data Science Case Study Interview Questions for 2024

Data Science Case Study Interview Questions and Answers to Crack Your next Data Science Interview.

Top 10 Data Science Case Study Interview Questions for 2024

According to Harvard business review, data scientist jobs have been termed “The Sexist job of the 21st century” by Harvard business review . Data science has gained widespread importance due to the availability of data in abundance. As per the below statistics, worldwide data is expected to reach 181 zettabytes by 2025

case study interview questions for data scientists

Source: statists 2021

data_science_project

Build a Churn Prediction Model using Ensemble Learning

Downloadable solution code | Explanatory videos | Tech Support

“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” — Clive Humby, 2006

Table of Contents

What is a data science case study, why are data scientists tested on case study-based interview questions, research about the company, ask questions, discuss assumptions and hypothesis, explaining the data science workflow, 10 data science case study interview questions and answers.

ProjectPro Free Projects on Big Data and Data Science

A data science case study is an in-depth, detailed examination of a particular case (or cases) within a real-world context. A data science case study is a real-world business problem that you would have worked on as a data scientist to build a machine learning or deep learning algorithm and programs to construct an optimal solution to your business problem.This would be a portfolio project for aspiring data professionals where they would have to spend at least 10-16 weeks solving real-world data science problems. Data science use cases can be found in almost every industry out there e-commerce , music streaming, stock market,.etc. The possibilities are endless. 

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Data Science Interview Preparation

A case study evaluation allows the interviewer to understand your thought process. Questions on case studies can be open-ended; hence you should be flexible enough to accept and appreciate approaches you might not have taken to solve the business problem. All interviews are different, but the below framework is applicable for most data science interviews. It can be a good starting point that will allow you to make a solid first impression in your next data science job interview. In a data science interview, you are expected to explain your data science project lifecycle , and you must choose an approach that would broadly cover all the data science lifecycle activities. The below seven steps would help you get started in the right direction. 

data scientist case study interview questions and answers

Source: mindsbs

Business Understanding — Explain the business problem and the objectives for the problem you solved.

Data Mining — How did you scrape the required data ? Here you can talk about the connections(can be database connections like oracle, SAP…etc.) you set up to source your data.

Data Cleaning — Explaining the data inconsistencies and how did you handle them.

Data Exploration — Talk about the exploratory data analysis you performed for the initial investigation of your data to spot patterns and anomalies.

Feature Engineering — Talk about the approach you took to select the essential features and how you derived new ones by adding more meaning to the dataset flow.

Predictive Modeling — Explain the machine learning model you trained, how did you finalized your machine learning algorithm, and talk about the evaluation techniques you performed on your accuracy score.

Data Visualization — Communicate the findings through visualization and what feedback you received.

New Projects

How to Answer Case Study-Based Data Science Interview Questions?

During the interview, you can also be asked to solve and explain open-ended, real-world case studies. This case study can be relevant to the organization you are interviewing for. The key to answering this is to have a well-defined framework in your mind that you can implement in any case study, and we uncover that framework here.

Ensure that you read about the company and its work on its official website before appearing for the data science job interview . Also, research the position you are interviewing for and understand the JD (Job description). Read about the domain and businesses they are associated with. This will give you a good idea of what questions to expect.

As case study interviews are usually open-ended, you can solve the problem in many ways. A general mistake is jumping to the answer straight away.

Try to understand the context of the business case and the key objective. Uncover the details kept intentionally hidden by the interviewer. Here is a list of questions you might ask if you are being interviewed for a financial institution -

Does the dataset include all transactions from Bank or transactions from some specific department like loans, insurance, etc.?

Is the customer data provided pre-processed, or do I need to run a statistical test to check data quality?

Which segment of borrower’s your business is targeting/focusing on? Which parameter can be used to avoid biases during loan dispersion?

Make informed or well-thought assumptions to simplify the problem. Talk about your assumption with the interviewer and explain why you would want to make such an assumption. Try to narrow down to key objectives which you can solve. Here is a list of a few instances — 

As car sales increase consistently over time with no significant spikes, I assume seasonal changes do not impact your car sales. Hence I would prefer the modeling excluding the seasonality component.

As confirmed by you, the incoming data does not require any preprocessing. Hence I will skip the part of running statistical tests to check data quality and perform feature selection.

As IoT devices are capturing temperature data at every minute, I am required to predict weather daily. I would prefer averaging out the minute data to a day to have data daily.

Get Closer To Your Dream of Becoming a Data Scientist with 150+ Solved End-to-End ML Projects

Now that you have a clear and focused objective to solve the business case. You can start leveraging the 7-step framework we briefed upon above. Think of the mining and cleaning activities that you are required to perform. Talk about feature selection and why you would prefer some features over others, and lastly, how you would select the right machine learning model for the business problem. Here is an example for car purchase prediction from auctions -

First, Prepare the relevant data by accessing the data available from various auctions. I will selectively choose the data from those auctions which are completed. At the same time, when selecting the data, I need to ensure that the data is not imbalanced.

Now I will implement feature engineering and selection to create and select relevant features like a car manufacturer, year of purchase, automatic or manual transmission…etc. I will continue this process if the results are not good on the test set.

Since this is a classification problem, I will check the prediction using the Decision trees and Random forest as this algorithm tends to do better for classification problems. If the results score is unsatisfactory, I can perform hyper parameterization to fine-tune the model and achieve better accuracy scores.

In the end, summarise the answer and explain how your solution is best suited for this business case. How the team can leverage this solution to gain more customers. For instance, building on the car sales prediction analogy, your response can be

For the car predicted as a good car during an auction, the dealers can purchase those cars and minimize the overall losses they incur upon buying a bad car. 

Data Science Case Study Interview Questions and Answers

Often, the company you are being interviewed for would select case study questions based on a business problem they are trying to solve or have already solved. Here we list down a few case study-based data science interview questions and the approach to answering those in the interviews. Note that these case studies are often open-ended, so there is no one specific way to approach the problem statement.

1. How would you improve the bank's existing state-of-the-art credit scoring of borrowers? How will you predict someone can face financial distress in the next couple of years?

Consider the interviewer has given you access to the dataset. As explained earlier, you can think of taking the following approach. 

Ask Questions — 

Q: What parameter does the bank consider the borrowers while calculating the credit scores? Do these parameters vary among borrowers of different categories based on age group, income level, etc.?

Q: How do you define financial distress? What features are taken into consideration?

Q: Banks can lend different types of loans like car loans, personal loans, bike loans, etc.  Do you want me to focus on any one loan category?

Discuss the Assumptions  — 

As debt ratio is proportional to monthly income, we assume that people with a high debt ratio(i.e., their loan value is much higher than the monthly income) will be an outlier.

Monthly income tends to vary (mainly on the upside) over two years. Cases, where the monthly income is constant can be considered data entry issues and should not be considered for analysis. I will choose the regression model to fill up the missing values.

Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization

Building end-to-end Data Science Workflows — 

Firstly, I will carefully select the relevant data for my analysis. I will deselect records with insane values like people with high debt ratios or inconsistent monthly income.

Identifying essential features and ensuring they do not contain missing values. If they do, fill them up. For instance, Age seems to be a necessary feature for accepting or denying a mortgage. Also, ensuring data is not imbalanced as a meager percentage of borrowers will be defaulter when compared to the complete dataset.

As this is a binary classification problem, I will start with logistic regression and slowly progress towards complex models like decision trees and random forests.

Conclude — 

Banks play a crucial role in country economies. They decide who can get finance and on what terms and can make or break investment decisions. Individuals and companies need access to credit for markets and society to function.

You can leverage this credit scoring algorithm to determine whether or not a loan should be granted by predicting the probability that somebody will experience financial distress in the next two years.

2. At an e-commerce platform, how would you classify fruits and vegetables from the image data?

Q: Do the images in the dataset contain multiple fruits and vegetables, or would each image have a single fruit or a vegetable?

Q: Can you help me understand the number of estimated classes for this classification problem?

Q: What would be an ideal dimension of an image? Do the images vary within the dataset? Are these color images or grey images?

Upon asking the above questions, let us assume the interviewer confirms that each image would contain either one fruit or one vegetable. Hence there won't be multiple classes in a single image, and our website has roughly 100 different varieties of fruits and vegetables. For simplicity, the dataset contains 50,000 images each the dimensions are 100 X 100 pixels.

Assumptions and Preprocessing—

I need to evaluate the training and testing sets. Hence I will check for any imbalance within the dataset. The number of training images for each class should be consistent. So, if there are n number of images for class A, then class B should also have n number of training images (or a variance of 5 to 10 %). Hence if we have 100 classes, the number of training images under each class should be consistent. The dataset contains 50,000 images average image per class is close to 500 images.

I will then divide the training and testing sets into 80: 20 ratios (or 70:30, whichever suits you best). I assume that the images provided might not cover all possible angles of fruits and vegetables; hence such a dataset can cause overfitting issues once the training gets completed. I will keep techniques like Data augmentation handy in case I face overfitting issues while training the model.

End to End Data Science Workflow — 

As this is a larger dataset, I would first check the availability of GPUs as processing 50,000 images would require high computation. I will use the Cuda library to move the training set to GPU for training.

I choose to develop a convolution neural network (CNN) as these networks tend to extract better features from the images when compared to the feed-forward neural network. Feature extraction is quite essential while building the deep neural network. Also, CNN requires way less computation requirement when compared to the feed-forward neural networks.

I will also consider techniques like Batch normalization and learning rate scheduling to improve the accuracy of the model and improve the overall performance of the model. If I face the overfitting issue on the validation set, I will choose techniques like dropout and color normalization to over those.

Once the model is trained, I will test it on sample test images to see its behavior. It is quite common to model that doing well on training sets does not perform well on test sets. Hence, testing the test set model is an important part of the evaluation.

The fruit classification model can be helpful to the e-commerce industry as this would help them classify the images and tag the fruit and vegetables belonging to their category.The fruit and vegetable processing industries can use the model to organize the fruits to the correct categories and accordingly instruct the device to place them on the cover belts involved in packaging and shipping to customers.

Explore Categories

3. How would you determine whether Netflix focuses more on TV shows or Movies?

Q: Should I include animation series and movies while doing this analysis?

Q: What is the business objective? Do you want me to analyze a particular genre like action, thriller, etc.?

Q: What is the targeted audience? Is this focus on children below a certain age or for adults?

Let us assume the interview responds by confirming that you must perform the analysis on both movies and animation data. The business intends to perform this analysis over all the genres, and the targeted audience includes both adults and children.

Assumptions — 

It would be convenient to do this analysis over geographies. As US and India are the highest content generator globally, I would prefer to restrict the initial analysis over these countries. Once the initial hypothesis is established, you can scale the model to other countries.

While analyzing movies in India, understanding the movie release over other months can be an important metric. For example, there tend to be many releases in and around the holiday season (Diwali and Christmas) around November and December which should be considered. 

End to End  Data Science Workflow — 

Firstly, we need to select only the relevant data related to movies and TV shows among the entire dataset. I would also need to ensure the completeness of the data like this has a relevant year of release, month-wise release data, Country-wise data, etc.

After preprocessing the dataset, I will do feature engineering to select the data for only those countries/geographies I am interested in. Now you can perform EDA to understand the correlation of Movies and TV shows with ratings, Categories (drama, comedies…etc.), actors…etc.

Lastly, I would focus on Recommendation clicks and revenues to understand which of the two generate the most revenues. The company would likely prefer the categories generating the highest revenue ( TV Shows vs. Movies) over others.

This analysis would help the company invest in the right venture and generate more revenue based on their customer preference. This analysis would also help understand the best or preferred categories, time in the year to release, movie directors, and actors that their customers would like to see.

Explore More  Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

4. How would you detect fake news on social media?

Q: When you say social media, does it mean all the apps available on the internet like Facebook, Instagram, Twitter, YouTub, etc.?

Q: Does the analysis include news titles? Does the news description carry significance?

Q: As these platforms contain content from multiple languages? Should the analysis be multilingual?

Let us assume the interviewer responds by confirming that the news feeds are available only from Facebook. The new title and the news details are available in the same block and are not segregated. For simplicity, we would prefer to categorize the news available in the English language.

Assumptions and Data Preprocessing — 

I would first prefer to segregate the news title and description. The news title usually contains the key phrases and the intent behind the news. Also, it would be better to process news titles as that would require low computing than processing the whole text as a data scientist. This will lead to an efficient solution.

Also, I would also check for data imbalance. An imbalanced dataset can cause the model to be biased to a particular class. 

I would also like to take a subset of news that may focus on a specific category like sports, finance , etc. Gradually, I will increase the model scope, and this news subset would help me set up my baseline model, which can be tweaked later based on the requirement.

Firstly, it would be essential to select the data based on the chosen category. I take up sports as a category I want to start my analysis with.

I will first clean the dataset by checking for null records. Once this check is done, data formatting is required before you can feed to a natural network. I will write a function to remove characters like !”#$%&’()*+,-./:;<=>?@[]^_`{|}~ as their character does not add any value for deep neural network learning. I will also implement stopwords to remove words like ‘and’, ‘is”, etc. from the vocabulary. 

Then I will employ the NLP techniques like Bag of words or TFIDF based on the significance. The bag of words can be faster, but TF IDF can be more accurate and slower. Selecting the technique would also depend upon the business inputs.

I will now split the data in training and testing, train a machine learning model, and check the performance. Since the data set is heavy on text models like naive bayes tends to perform better in these situations.

Conclude  — 

Social media and news outlets publish fake news to increase readership or as part of psychological warfare. In general, the goal is profiting through clickbait. Clickbaits lure users and entice curiosity with flashy headlines or designs to click links to increase advertisements revenues. The trained model will help curb such news and add value to the reader's time.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

5. How would you forecast the price of a nifty 50 stock?

Q: Do you want me to forecast the nifty 50 indexes/tracker or stock price of a specific stock within nifty 50?

Q: What do you want me to forecast? Is it the opening price, closing price, VWAP, highest of the day, etc.?

Q: Do you want me to forecast daily prices /weekly/monthly prices?

Q: Can you tell me more about the historical data available? Do we have ten years or 15 years of recorded data?

With all these questions asked to the interviewer, let us assume the interviewer responds by saying that you should pick one stock among nifty 50 stocks and forecast their average price daily. The company has historical data for the last 20 years.

Assumptions and Data preprocessing — 

As we forecast the average price daily, I would consider VWAP my target or predictor value. VWAP stands for Volume Weighted Average Price, and it is a ratio of the cumulative share price to the cumulative volume traded over a given time.

Solving this data science case study requires tracking the average price over a period, and it is a classical time series problem. Hence I would refrain from using the classical regression model on the time series data as we have a separate set of machine learning models (like ARIMA , AUTO ARIMA, SARIMA…etc.) to work with such datasets.

Like any other dataset, I will first check for null and understand the % of null values. If they are significantly less, I would prefer to drop those records.

Now I will perform the exploratory data analysis to understand the average price variation from the last 20 years. This would also help me understand the tread and seasonality component of the time series data. Alternatively, I will use techniques like the Dickey-Fuller test to know if the time series is stationary or not. 

Usually, such time series is not stationary, and then I can now decompose the time series to understand the additive or multiplicative nature of time series. Now I can use the existing techniques like differencing, rolling stats, or transformation to make the time series non-stationary.

Lastly, once the time series is non-stationary, I will separate train and test data based on the dates and implement techniques like ARIMA or Facebook prophet to train the machine learning model .

Some of the major applications of such time series prediction can occur in stocks and financial trading, analyzing online and offline retail sales, and medical records such as heart rate, EKG, MRI, and ECG.

Time series datasets invoke a lot of enthusiasm between data scientists . They are many different ways to approach a Time series problem, and the process mentioned above is only one of the know techniques.

Access Job Recommendation System Project with Source Code

6. How would you forecast the weekly sales of Walmart? Which department impacted most during the holidays?

Q: Walmart usually operates three different stores - supermarkets, discount stores, and neighborhood stores. Which store data shall I pick to get started with my analysis? Are the sales tracked in US dollars?

Q: How would I identify holidays in the historical data provided? Is the store closed on Black Friday week, super bowl week, or Christmas week?

Q: What are the evaluation or the loss criteria? How many departments are present across all store types?

Let us assume the interviewer responds by saying you must forecast weekly sales department-wise and not store type-wise in US dollars. You would be provided with a flag within the dataset to inform weeks having holidays. There are over 80 departments across three types of stores.

As we predict the weekly sales, I would assume weekly sales to be the target or the predictor for our data model before training.

We are tracking sales price weekly, We will use a regression model to predict our target variable, “Weekly_Sales,” a grouped/hierarchical time series. We will explore the following categories of models, engineer features, and hyper-tune parameters to choose a model with the best fit.

- Linear models

- Tree models

- Ensemble models

I will consider MEA, RMSE, and R2 as evaluation criteria.

End to End Data Science Workflow-

The foremost step is to figure out essential features within the dataset. I would explore store information regarding their size, type, and the total number of stores present within the historical dataset.

The next step would be to perform feature engineering; as we have weekly sales data available, I would prefer to extract features like ‘WeekofYear’, ‘Month’, ‘Year’, and ‘Day’. This would help the model to learn general trends.

Now I will create store and dept rank features as this is one of the end goals of the given problem. I would create these features by calculating the average weekly sales.

Now I will perform the exploratory data analysis (a.k.a EDA) to understand what story does the data has to say? I will analyze the stores and weekly dept sales for the historical data to foresee the seasonality and trends. Weekly sales against the store and weekly sales against the department to understand their significance and whether these features must be retained that will be passed to the machine learning models.

After feature engineering and selection, I will set up a baseline model and run the evaluation considering MAE, RMSE and R2. As this is a regression problem, I will begin with simple models like linear regression and SGD regressor. Later, I will move towards complex models, like Decision Trees Regressor, if the need arises. LGBM Regressor and SGB regressor.

Sales forecasting can play a significant role in the company’s success. Accurate sales forecasts allow salespeople and business leaders to make smarter decisions when setting goals, hiring, budgeting, prospecting, and other revenue-impacting factors. The solution mentioned above is one of the many ways to approach this problem statement.

With this, we come to the end of the post. But let us do a quick summary of the techniques we learned and how they can be implemented. We would also like to provide you with some practice case studies questions to help you build up your thought process for the interview.

7. Considering an organization has a high attrition rate, how would you predict if an employee is likely to leave the organization?

8. How would you identify the best cities and countries for startups in the world?

9. How would you estimate the impact on Air Quality across geographies during Covid 19?

10. A Company often faces machine failures at its factory. How would you develop a model for predictive maintenance?

Do not get intimated by the problem statement; focus on your approach -

Ask questions to get clarity

Discuss assumptions, don't assume things. Let the data tell the story or get it verified by the interviewer.

Build Workflows — Take a few minutes to put together your thoughts; start with a more straightforward approach.

Conclude — Summarize your answer and explain how it best suits the use case provided.

We hope these case study-based data scientist interview questions will give you more confidence to crack your next data science interview.

Access Solved Big Data and Data Science Projects

About the Author

author profile

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

arrow link

© 2024

© 2024 Iconiq Inc.

Privacy policy

User policy

Write for ProjectPro

  • Open access
  • Published: 10 July 2023

The evolution of Big Data in neuroscience and neurology

  • Laura Dipietro 1 ,
  • Paola Gonzalez-Mego 2 ,
  • Ciro Ramos-Estebanez 3 ,
  • Lauren Hana Zukowski 4 ,
  • Rahul Mikkilineni 4 ,
  • Richard Jarrett Rushmore 5 &
  • Timothy Wagner 1 , 6  

Journal of Big Data volume  10 , Article number:  116 ( 2023 ) Cite this article

5231 Accesses

6 Citations

7 Altmetric

Metrics details

This article has been updated

Neurological diseases are on the rise worldwide, leading to increased healthcare costs and diminished quality of life in patients. In recent years, Big Data has started to transform the fields of Neuroscience and Neurology. Scientists and clinicians are collaborating in global alliances, combining diverse datasets on a massive scale, and solving complex computational problems that demand the utilization of increasingly powerful computational resources. This Big Data revolution is opening new avenues for developing innovative treatments for neurological diseases. Our paper surveys Big Data’s impact on neurological patient care, as exemplified through work done in a comprehensive selection of areas, including Connectomics, Alzheimer’s Disease, Stroke, Depression, Parkinson’s Disease, Pain, and Addiction (e.g., Opioid Use Disorder). We present an overview of research and the methodologies utilizing Big Data in each area, as well as their current limitations and technical challenges. Despite the potential benefits, the full potential of Big Data in these fields currently remains unrealized. We close with recommendations for future research aimed at optimizing the use of Big Data in Neuroscience and Neurology for improved patient outcomes.

Introduction

The field of Neuroscience was formalized in 1965 when the “Neuroscience Research Program” was established at the Massachusetts Institute of Technology with the objective of bringing together several varied disciplines including molecular biology, biophysics, and psychology to study the complexity of brain and behavior [ 1 ]. The methods employed by the group were largely data driven, with a foundation based on the integration of multiple unique data sets across numerous disciplines. As Neuroscience has advanced as a field, appreciation of the nervous system’s complexity has grown with the acquisition and analysis of larger and more complex datasets. Today, many Neuroscience subfields are implementing Big Data approaches, such as Computational Neuroscience [ 2 ], Neuroelectrophysiology [ 3 , 4 , 5 , 6 ], and Connectomics [ 7 ] to elucidate the structure and function of the brain. Modern Neuroscience technology allows for the acquisition of massive, heterogeneous data sets whose analysis requires a new set of computational tools and resources for managing computationally intensive problems [ 7 , 8 , 9 ]. Studies have advanced from small labs using a single outcome measure to large teams using multifaceted data (e.g., combined imaging, behavioral, and genetics data) collected across multiple international sites via numerous technologies and analyzed with high-performance computational methods and Artificial Intelligence (AI) algorithms. These Big Data approaches are being used to characterize the intricate structural and functional morphology of healthy nervous systems, and to describe and treat neurological disorders.

Jean-Martin Charcot (1825–1893), considered the father of Neurology, was a pioneering figure in utilizing a scientific, data-driven approach to innovate neurological treatments [ 10 ]. For example, in the study of multiple sclerosis (MS), once considered a general "nervous disorder" [ 10 ], Charcot's approach integrated multiple facets of anatomical and clinical data to delineate MS as a distinct disease. By connecting pathoanatomical data with behavioral and functional data, Charcot's work ultimately transformed our understanding and treatment of MS. Furthermore, Charcot’s use of medical photographs in his practice was an early instance of incorporating ‘imaging’ data in Neurology and Psychiatry [ 11 ]. Today, Neuroimaging, spurred on by new technologies, computational methods, and data types, is at the forefront of Big Data in Neurology [ 9 , 12 ]—see Fig.  1 . Current neurology initiatives commonly use large, highly heterogeneous datasets (e.g., neuroimaging, genetic testing, or clinical assessments from 1000s to 100,000s patients [ 13 , 14 , 15 , 16 , 17 , 18 ]) and acquire data with increasing velocity (e.g., using wearable sensors [ 6 ]) and technologies adapted from other Big Data fields (e.g., automatized clinical note assessment [ 19 ], social media-based infoveillance applications [ 16 , 20 ]). Similar to how Big Data has spurred on Neuroscience, the exponentially growing size, variety, and collection speed of datasets combined with the need to investigate their correlations is revolutionizing Neurology and patient care (see Fig.  1 ).

figure 1

Evolution of data types [ 21 ]. The evolution of Data types in the development of Computational Neuroscience can be traced from Golgi and Ramón y Cajal’s structural data descriptions of the neuron in the nineteenth century [ 22 ]; to Hodgkin, Huxley, and Ecceles’s biophysical data characterization of the “all-or-none” action potential during the early to mid-twentieth century [ 23 ]; to McCulloch and Pitts’ work on the use of ‘the "all-or-none" character of nervous activity’ to model neural networks descriptive of fundamentals of nervous system [ 24 ]. Similarly, Connectomics’ Data evolution [ 25 ] can be traced from Galen’s early dissection studies [ 26 ], to Wernicke’s and Broca’s postulations on structure and function [ 27 ], to imaging of the nervous system [ 28 , 29 ], and brain atlases (e.g., Brodmann, Talairach) and databases [ 30 , 31 ] into the Big Data field that is today as characterized by the Human Connectome Project [ 32 ] and massive whole brain connectome models [ 7 , 33 ]. Behavioral Neuroscience and Neurology can be tracked from early brain injury studies [ 34 ] to stimulation and surgical studies [ 35 , 36 ], to Big Data assessments in cognition and behavior [ 37 ]. All these fields are prime examples of the transformative impact of the Big Data revolution on Neuroscience and Neurology sub-fields

This paper examines the evolving impact of Big Data in Neuroscience and Neurology, with a focus on treating neurological disorders. We critically evaluate available solutions and limitations, propose methods to overcome these limitations, and highlight potential innovations that will shape the fields' future.

Problem definition

According to the United States (US) National Institutes of Health (NIH), neurological disorders affect ~ 50 M/yr. people in the US, with a total annual cost of hundreds of billions of dollars [ 38 ]. Globally, neurological disorders are responsible for the highest incidence of disability and rank as the second leading cause of death [ 39 ]. These numbers are expected to grow over time as the global population ages. The need for new and innovative treatments is of critical and growing importance given the tremendous personal and societal impact of diseases of the nervous system and brain.

Big Data holds great potential for advancing the understanding of neurological diseases and the development of new treatments. To comprehend how such advancements can occur and have been occurring, it is important to appreciate how this type of research is enabled, not only through methods classically used in clinical research in Neurology such as clinical trials but also via advancing Neuroscience research.

This paper aims to review how Big Data is currently used and transforming the fields of Neuroscience and Neurology to advance the treatment of neurological disorders. Our intent is not merely to survey the most prominent research in each area, but to give the reader a historical perspective on how key areas moved from an earlier Small Data phase to the current Big Data phase. For applications in Neurology, while numerous clinical areas are evolving with Big Data and exemplified herein (e.g., Depression, Stroke, Alzheimer’s Disease (AD)), we highlight its impact on Parkinson’s Disease (PD), Substance Use Disorders (SUD), and Pain to provide a varied, yet manageable, review of the impact of Big Data on patient care. To balance brevity and completeness, we summarize a fair amount of general information in tabular form and limit our narrative to exemplify the Big Data trajectories of Neurology and Neuroscience. Additionally, in surveying this literature, we have identified a common limitation; specifically, the conventional application of Big Data, as characterized by the 5 V’s (see Fig.  2 ), is often unevenly or insufficiently applied in Neurology and Neuroscience. The lack of standardization for the Big Data in studies across Neurology and Neuroscience as well as field-specific and study-specific differences in application limit the reach of Big Data for improving patient treatments. We will examine the reasons that contribute to any mismatch and areas where past studies have not reached their potential. Finally, we identify the limitations of current Big Data approaches and discuss possible solutions and opportunities for future research.

figure 2

The 5 V’s. While the 5 V’s of Big Data (“Volume, Variety, Velocity, Veracity, and Value”) are clearly found in certain fields (e.g., social media) there are many "Big Data" Neuroscience and Neurology projects where categories are not explored or are underexplored. Many self-described “Big Data” studies are limited to Volume and/or Variety. Furthermore, most “Big Data” clinical trial speeds move at the variable pace of patient recruitment which can pale in comparison to the speeds of Big Data Velocity in the finance and social media spaces. “Big Data” acquisition and processing times are also sporadically detailed in the fields. Finally, there is not an accepted definition of data Veracity as it pertains to healthcare (e.g., error, bias, incompleteness, inconsistency) and Veracity can be assessed on multiple levels (e.g., from data harmonization techniques to limitations in experimental methods used in studies)

Our paper differs from other Big Data review papers in Neuroscience and/or Neurology (e.g., [ 12 ], [ 40 , 41 , 42 , 43 ]) as it specifically examines the crucial role of Big Data in transforming the clinical treatment of neurological disorders. We go beyond previous papers that have focused on specific subfields (such as network data (e.g., [ 44 ]), neuroimaging (e.g., [ 12 ]), stroke (e.g., [ 45 ]), or technical methodologies related to data processing (e.g., [ 46 , 47 ]) and/or sharing (e.g., [ 48 , 49 ]). Furthermore, our review spans a broad range of treatments, from traditional pharmacotherapy to neuromodulation and personalized therapy guided by Big Data methods. This approach allows for a comparison of the evolving impact of Big Data across Neurology sub-specialties, such as Pain versus PD. Additionally, we take a cross-disciplinary approach to analyze applications in both Neuroscience and Neurology, synthesizing and categorizing available resources to facilitate insights between neuroscientists and neurologists. Finally, our study appraises the present implementation of the Big Data definition within the fields of Neuroscience and Neurology. Overall, we differentiate ourselves in terms of scope, breadth, and interdisciplinary analysis.

Existing solutions

Big Data use in Neuroscience and Neurology has matured as a result of national and multi-national projects [ 40 , 41 , 42 , 43 ]. In the early to mid-2000’s, several governments started national initiatives aimed at understanding brain function, such as the NIH Brain Initiative in the US [ 50 ], the Brain Project in Europe [ 51 , 52 ], and the Brain Mapping by Integrated Neurotechnologies for Disease Studies (Brain/MINDS) project in Japan [ 53 ]. Although not always without controversy [ 40 , 51 , 52 ], many initiatives soon became global and involved increasingly larger groups of scientists and institutions focused on collecting and analyzing voluminous data including neuroimaging, genetic, biospecimen, and/or clinical assessments to unlock the secrets of the nervous system (the reader is referred to Table 1 and Additional file 1 : Table S1 for exemplary projects or reviews [ 40 , 41 , 42 , 43 ]). These projects spurred the creation of open-access databases and resource depositories (the reader is referred to Table 2 and Additional file 1 : Table S2 for exemplary databases or reviews [ 41 , 42 ]). The specific features of the collected data sets, such as large volume, high heterogeneity/variety, and inconsistencies across sites/missing data, necessitated the development of ad-hoc resources, procedures, and standards for data collection and processing. Moreover, these datasets created the need for hardware and software for data-intensive computing, such as supercomputers and machine learning techniques, which were not conventionally used in Neuroscience and Neurology [ 54 , 55 , 56 , 57 , 58 ]. Most significantly, the Big Data revolution is improving our understanding and treatment of neurological diseases, see Tables 3 – 6 and Additional file 1 : Tables S3-S6.

National projects and big data foundations: Connectomes, neuroimaging, and genetics

The human brain contains ~ 100 billion neurons connected via ~ 10 14 synapses, through which electrochemical data is transmitted [ 59 ]. Neurons are organized into discrete regions or nuclei and connect in precise and specific ways to neurons in other regions; the aggregated connections between all neurons in an individual comprises their connectome. The connectome is a term coined by Sporns et al. designed to be analogous to the genome; like the genome, the connectome is a large and complex dataset characterized by tremendous interindividual variability [ 60 ]. Connectomes, at the level of the individual or as aggregate data from many individuals, have the potential to produce a better understanding of how brains are wired as well as to unravel the “basic network causes of brain diseases” for prevention and treatment [ 60 , 61 , 62 , 63 ]. Major investments in human connectome studies in health and disease came in ~ 2009, when the NIH Blueprint for Neuroscience Research launched the Blueprint Grand Challenges to catalyze research. As part of this initiative, the Human Connectome Project (HCP) was launched to chart human brain connectivity, with two research consortia awarded approximately $40 M. The Wu-Minn-Ox consortium sought to map the brain connectivity (structural and functional) of 1200 healthy young adults and investigate the associations between behavior, lifestyle, and neuroimaging outcomes. The MGH-UCLA (Massachusetts General Hospital-University of California Los Angeles) consortium aimed to build a specialized magnetic resonance imager optimized for measuring connectome data. The Brain Activity Map (BAM) Project was later conceived during the 2011 London workshop “Opportunities at the Interface of Neuroscience and Nanoscience.” The BAM group proposed the initiation of a technology-building research program to investigate brain activity from every neuron within a neural circuit. Recordings of neurons would be carried out with timescales over which behavioral outputs or mental states occur [ 64 , 65 ]. Following up on this idea, in 2013, the NIH BRAIN Initiative was initiated by the Obama administration, to “accelerate the development and application of new technologies that will enable researchers to produce dynamic pictures of the brain that show how individual brain cells and complex neural circuits interact at the speed of thought”. Other countries and consortia generated their own initiatives, such as the European Human Brain Project, the Japan Brain/MINDS project, Alzheimer’s Disease Neuroimaging Initiative (ADNI), Enhancing Neuroimaging Genetics through Meta-analysis (ENIGMA), and the China Brain Project. These projects aimed to explore brain structure and function, with the goal of guiding the development of new treatments for neurological diseases. The scale of these endeavors, and the insights they generated into the nervous system, were made possible by the collection and analysis of Big Data (see Table 1 ). Below, we succinctly exemplify ways in which Big Data is transforming Neuroscience and Neurology through the HCP (and similar initiatives), ADNI, and ENIGMA projects.

Ways in which Big Data is transforming Neuroscience and Neurology are exemplified through advancements in elucidating the connectome (see for example Table 3 and Additional file 1 : Table S3). Early studies in organisms such as the nematode C. elegans used electron microscopy (EM) to image all 302 neurons and 5000 connections of the animal [ 66 ], while analyses on animals with larger nervous systems collated neuroanatomical tracer studies to extract partial cerebral cortex connectivity matrices, e.g., cat [ 67 ] and macaque monkey [ 68 , 69 ]. More recently, advancements in imaging and automation techniques, including EM and two-photon (2P) fluorescence microscopy, have enabled the creation of more complete maps of the nervous system in zebrafish and drosophila [ 7 , 33 , 70 , 71 ]. Despite the diminutive size of their nervous systems, the amount of data is enormous. Scheffer and colleagues generated a connectome for portion of the central brain of the fruit fly “encompassing 25,000 neurons and 20 million chemical synapses” [ 7 ]. This effort required “numerous machine-learning algorithms and over 50 person-years of proofreading effort over ≈2 calendar years” processing > 20 TB of raw data into a 26 MB connectivity graph, “roughly a million fold reduction in data size” (note, a review of the specific computational techniques is outside this paper’s scope, see [ 7 , 33 , 58 , 70 , 71 ] for more examples). Thus, connectomes can be delineated in simple animal models; however, without automation and the capacity to acquire Big Data of this type, such a precise reconstruction could not be accomplished. Extending this detailed analysis to the human brain will be a larger challenge, as evidenced by the stark contrast between the 25,000 neurons analyzed in the above work and the 100 billion neurons and ~ 10 14 synapses present in the human brain.

At present, the study of the human connectome has principally relied on clinical neuroimaging methods, including Diffusion Tensor Imaging (DTI) and Magnetic Resonance Imaging (MRI), to generate anatomical connectomes, and on neuroimaging techniques such as functional MRI (fMRI), to generate functional connectomes [ 9 , 12 ]. For example, in what might be considered a “Small Data” step, P. van den Heuvel and Sporns, demonstrated “rich-club” organization in the human brain (“tendency for high-degree nodes to be more densely connected among themselves than nodes of a lower degree, providing important information on the higher-level topology of the brain”) via DTI and simulation studies based on imaging from 21 subjects focused on 12 brain regions [ 72 ]. This type of work has quickly become “Big Data” science, as exemplified by Bethlehem et al.’s study of “Brain charts for the human lifespan” which was based on 123,984 aggregated MRI scans, “across more than 100 primary studies, from 101,457 human participants between 115 days post-conception and 100 years of age” [ 13 ]. The study provides instrumental evidence towards neuroimaging phenotypes and developmental trajectories via MRI imaging. Human connectome studies are also characterized by highly heterogeneous datasets, owing to the use of multimodal imaging, which are often integrated with clinical and/or biospecimen datasets. For example, studies conducted under the HCP [ 32 ] have implemented structural MRI (sMRI), task fMRI (tfMRI), resting-state fMRI (rs-fMRI), and diffusion MRI (dMRI) imaging modalities, with subsets undergoing Magnetoencephalography (MEG) and Electroencephalography (EEG). These studies usually involve hundreds to thousands of subjects, such as the Healthy Adult and HCP Lifespan Studies [ 73 ]. While the above connectome studies have primarily focused on anatomical, functional, and behavioral questions, connectome studies are used across the biological sciences (e.g., study evolution by comparing mouse, non-human primates, and human connectomes [ 74 ]) and as an aid in assessing and treating neuropathologies (as will be elaborated on further below).

In the same period that the NIH was launching its Neuroscience Blueprint Program (2005), it also helped launch the ADNI in collaboration with industry and non-profit organizations. The primary objectives of ADNI are to develop “biomarkers for early detection” and monitoring of AD; support “intervention, prevention, and treatment” through early diagnostics; and share data worldwide [ 75 , 76 , 77 ]. Its Informatics Core [ 78 ], which was established for data integration, analysis, and dissemination, was hosted at University of Southern California, and highlights the Big Data underpinnings of ADNI ( https://adni.loni.usc.edu ). ADNI was originally designed to last 5 years with bi-annual data collection of cognition; brain structural and metabolic changes via Positron Emission Technology (PET) and MRIs; genetic data; “and biochemical changes in blood, cerebrospinal fluid (CSF), and urine in a cohort of 200 elderly control subjects, 400 Mild Cognitive Impairment patients, and 200 mild AD patients" [ 75 , 76 , 79 ]. The project is currently in its fourth iteration, ADNI4, with funding through 2027 [ 80 , 81 ]. To date, ADNI has enrolled > 2000 participants who undergo continuing longitudinal assessments. The ADNI study has paved the way for the diagnosis of AD through the usage of biomarker tests such as amyloid PET scans and lumbar punctures for CSF, and demonstrated that ~ 25% of people in their mid-70’s has a very early stage of AD (“preclinical AD”), which would have previously gone undetected. These results have helped encourage prevention and early treatment as the most effective approach to the disease.

During the same period that major investments were beginning in connectome projects (2009), the ENIGMA Consortium was established [ 82 , 83 ]. It was founded with the initial aim of combining neuroimaging and genetic data to determine genotype–phenotype brain relationships. As of 2022, the consortium included > 2000 scientists hailing from 45 countries and collaborating across more than 50 working groups [ 82 ]. These efforts helped spur on many discoveries, including genome-wide variants associated with human brain imaging phenotypes (see, the 60 + center large-scale study with  >  30,000 subjects that provided evidence of the genetic impact on hippocampal volume [ 84 , 85 ], whose reduction is possibly a risk factor for developing AD). The group has also conducted large scale MRI studies in multiple pathologies and showed imaging-based abnormalities or structural changes [ 82 , 83 ] in numerous conditions, such as major depressive disorder (MDD) [ 86 ] and bipolar disorder [ 87 ]. Other genetics/imaging-based initiatives have made parallel advancements, such as the genome-wide association studies of UK Biobank [ 88 , 89 , 90 ], Japan’s Brain/MINDS work [ 53 ], and the Brainstorm Consortium [ 91 ]. For example, the Brainstorm Consortium assessed “25 brain disorders from genome-wide association studies of 265,218 patients and 784,643 control participants and assessed their relationship to 17 phenotypes from 1,191,588 individuals.” Ultimately, Big Data-based genetic and imaging assessments have permeated the Neurology space, significantly impacting patient care through enhanced diagnostics and prognostics, as will be discussed further below.

From discovery research to improved neurological disease treatment

The explosive development of studies spurred on by these national projects with growing size, variety, and speed of data, combined with the development of new technologies and analytics, has provoked a paradigm shift in our understanding of brain changes through lifespan and disease [ 7 , 92 , 93 , 94 , 95 , 96 ], leading to changes in the investigation and treatment development for neurological diseases and profoundly impacting the field of Neurology. Over the past decade, such impact has occurred in multiple ways. First, Big Data has opened the opportunity to analyze combined large, incomplete, disorganized, and heterogenous datasets [ 97 ], which may yield more impactful results as compared to clean curated, small datasets (with all their external validity questions and additional limitations). Second, Big Data studies have improved our basic understanding (i.e., mechanisms of disease) of numerous neurological conditions. Third, Big Data has aided diagnosis improvement (including phenotyping) and subsequently refined the determination of a presumptive prognosis. Fourth, Big Data has enhanced treatment monitoring, which further aids treatment outcome prediction. Fifth, Big Data studies have recently started to change clinical research methodology and design and thus directly impact the development of novel therapies. In the remainder of this section, we will elaborate on the aforementioned topics, followed by the presentation of particular case studies in select areas of Neurology.

Opportunities and improved understanding

As introduced above, Big Data solutions have impacted our understanding of the fundamentals of brain sciences and disease, such as brain structure and function (e.g., HCP) and the genetic basis of disease (e.g., ENIGMA). Advancements in connectome and genetics studies, along with improved analytics, have advanced our understanding of brain changes throughout the lifespan and supported hypotheses linking abnormal connectomes to many neurological diseases [ 13 , 72 , 92 , 98 ]. Studies have consistently shown that architecture and properties of functional brain networks (which can be quantified in many ways, e.g., with graph theoretical approaches [ 94 ]) correlate with individual cognitive performance and dynamically change through development, aging, and neurological disease states including neurodegenerative diseases, autism, schizophrenia, and cancer (see, e.g., [ 92 , 93 , 95 , 96 ]). Beyond genetics and connectomes, Big Data methods are used in vast ways in brain research and the understanding of diseases, such as from brain electrophysiology [ 99 ], brain blood-flow [ 100 ], brain material properties [ 101 ], perceptual processing [ 102 , 103 ], and motor control [ 104 ].

Diagnostics/prognostics/monitoring

Big Data methods are also increasing in prevalence in diagnostics and prognostics. For example, the US Veterans Administration recently reported on the genetic basis of depression based on analysis from  > 1.2 M individuals, identifying 178 genomic risk loci, and confirming it in a large independent cohort (n > 1.3 M) [ 105 ]. Subsequent to the European Union (EU) neuGRID and neuGRID4You projects, Munir et. al. used fuzzy logic methods to derive a single “Alzheimer’s Disease Identification Number” for tracking disease severity [ 106 ]. Eshaghi et. al. identified MS subtypes via MRI Data and unsupervised machine learning [ 107 ] and Mitelpunkt et al. used multimodal data from the ADNI registry to identify dementia subtypes [ 108 ]. Big Data methods have also been used to identify common clinical risk factors for disease, such as gender, age, and geographic location for stroke [ 109 ] (and/or its genetic risk factors [ 110 ]). Big Data approaches to predict response to treatment are also increasing in frequency. For example, for depression, therapy choice often involves identifying subtypes of patients based on co-occurring symptoms or clinical history, but these variables are often not sufficient for Precision Medicine (i.e., predict unique patient response to specific treatment) nor even at times to differentiate patients from healthy controls [ 17 , 111 ]. Noteworthy progress has been made in depression research, such as successful prediction of treatment response using connectome gradient dysfunction and gene expression [ 18 ], through resting state connectivity markers of Transcranial Magnetic Stimulation (TMS) response [ 17 ], and via a sertraline-response EEG signature [ 111 ]. As another example, the Italian I-GRAINE registry is being developed as a source of clinical, biological, and epidemiologic Big Data on migraine used to address therapeutic response rates and efficiencies in treatment [ 112 ].

Additionally, Big Data approaches of combining high volumes of varied data at high velocities are offering the potential for new "real-time" biomarkers [ 113 ]. For instance, data collected with wearable sensors has been increasingly used in clinical studies to monitor patient behavior at home or in real-world settings. While the classic example is the use of EEG for epilepsy [ 114 ], numerous other embodiments can be found in the literature. For example, another developing approach is utilizing smartphone data to evaluate daily changes in symptom severity and sensitivity to medication in PD patients [ 115 ]. This approach has led to a memory test and simple finger tapping and to track the status of study participants [ 116 ]. Collectively, these examples highlight Big Data’s potential for facilitating participatory Precision Medicine (i.e., tailored to each patient) in trials and clinical practice (which is covered in more detail in Sect. “ Proposed Solutions ”).

Evolving evaluation methods

The way in which new potential neurological therapies are being developed is also changing. Traditionally, Randomized Controlled Trials (RCTs) evaluate the safety and efficacy of potential new treatments. In an RCT the treatment group is compared to a control or placebo group, in terms of outcome measures, at predefined observation points. While RCTs are the gold standard for developing new treatments, they have several limitations [ 117 ], which can include high cost, lengthy completion times, limited generalizability of results, and restricted observations (e.g., made at a limited number of predefined time points in a protocol (e.g., baseline, end of treatment)). Thereby, clinical practice is currently limited by RCT and evidence-based medicine interpretations and limitations [ 118 ], which are largely responsible for a predominant physician’s responsive mindset. A wealth of recent manuscripts on Big Data analysis facilitates a potential solution for individual patient behavior prediction and proactive Precision Medicine management [ 119 ] by augmenting and extending RCT design [ 117 ]. Standardization and automation of procedures using Big Data make entering and extracting data easier and could reduce the effort and cost of running an RCT. They can also be used to formulate hypotheses fueled by large, preliminary observational studies and/or carry out virtual trials. For example, Peter et al. showed how Big Data could be used to move from basic scientific discovery to translation to patients in a non-linear fashion [ 120 ]. Given the potential pathophysiological connection between PD and inflammatory bowel disease (IBD), they evaluated the incidence of PD in IBD patients and investigated whether anti-tumor necrosis factor (anti-TNF) treatment for IBD affected the risk of developing PD. Rather than a traditional RCT, they ran a virtual repurposing trial using data from 170 million people in two large administrative claims databases. The study observed a 28% higher incidence rate of PD in IBD patients than in unaffected matched controls. In IBD patients, anti-TNF treatment resulted in 78% reduction in the rate of PD incidence relative to patients that did not receive the treatment [ 120 , 121 ]. A similar approach was reported by Slade et al. They conducted experiments on rats to investigate the effects of Attention Deficit Hyperactivity Disorder (ADHD) medication (type and timing) on the “rats’ propensity to exhibit addiction-like behavior”, which led to the hypothesis that initiating ADHD medication in adolescence “may increase the risk for SUD in adulthood”. To test this hypothesis in humans, rather than running a traditional RCT, they used healthcare Big Data from a large claim database and, indeed, found that “temporal features of ADHD medication prescribing”, not subject demographics, predicted SUD development in adolescents on ADHD medication [ 122 ]. A hybrid approach was used in the study by Yu et al. [ 123 ]. Their study examined the potential of vitamin K2 (VK2) to reduce the risk of PD, given its anti-inflammatory properties and inflammation's role in PD pathogenesis. Initially, Yu et al. assessed 93 PD patients and 95 controls and determined that the former group had lower serum VK2 levels compared to the healthy controls. To confirm the connection between PD and inflammation, the study then analyzed data from a large public database, which revealed that PD patients exhibit dysregulated inflammatory responses and coagulation cascades that correlate with decreased VK2 levels [ 123 ].

Even though these pioneering studies demonstrate potential ways in which Big Data can be used to perform virtual RCT trials, several challenges remain. The processing pipeline of Big Data, from collection to analysis, has still to be refined. Moreover, it is still undetermined how regulatory bodies will ultimately utilize this type of data. In the US, the Food and Drug Administration (FDA) has acknowledged the future potential of “Big Data” approaches, such as using data that could be gathered from Electronic Health Records (EHRs), pharmacy dispensing, and payor records, to help evaluate the safety and efficacy of therapeutics [ 124 ]. Furthermore, the FDA has begun the exploration and use of High-Performance Computing (HPC) to internally tackle Big Data problems [ 125 ] and concluded that Big Data methodologies could broaden “the range of investigations that can be performed in silico” and potentially improve “confidence in devices and drug regulatory decisions using novel evidence obtained through efficient big data processing”. The FDA is also employing Big Data based on Real World Evidence (RWE), such as with their Sentinel Innovation Center, which will implement data science advances (e.g., machine learning, natural language processing) to expand EHR data use for medical product surveillance [ 126 , 127 ]. Lastly, the exploration of crowdsourcing of data acquisition and analysis is an area still to be explored and outside the scope of this review [ 128 ].

Big Data case studies in neurology

To provide the reader with a sample of existing Big Data solutions for improving patient care (beyond those surveyed above), we focus on three separate disorders, PD, SUD, and Pain. While Big Data has positively impacted numerous other neuropathologies (e.g., [ 129 , 130 , 131 , 132 ]), we have chosen these three disorders due to their significant societal impact and their representation of varying stages of maturity in the application of Big Data to Neurology. Finally, we exemplify Big Data’s foreseeable role in therapeutic technology via brain stimulation, which is used in the aforementioned disorders and is particularly suitable for Precision Medicine.

After AD, PD is the second most prevalent neurodegenerative disorder [ 133 , 134 , 135 ]. About 10,000 million people live with PD worldwide, with  ~ 1 million cases in the US. The loss of dopamine-producing neurons leads to symptoms such as tremor, rigidity, bradykinesia, and postural instability [ 136 ]. Traditional treatments include levodopa, physical therapy, and neuromodulation (including Deep Brain Stimulation (DBS) and Noninvasive Brain Stimulation (NIBS) [ 36 , 137 , 138 ].

The increasing significance of Big Data in both PD research and patient care can be measured by the rising number of published papers over the past decade (Fig.  3 ). Several national initiatives have been aimed at building public databases to facilitate research. For example, the Michael J. Fox Foundation’s Parkinson’s Progression Markers Initiative (PPMI) gathers data from about 50 sites in several nations including the US, Europe, Israel, and Australia with the objective of identifying potential biomarkers of disease progression [ 139 , 140 ]. A major area of research involving Big Data analytics focuses on PD’s risk factors, particularly through genetic data analysis. The goal is to enhance our comprehension of the causes of the disease and develop preventive treatments. The meta-analysis of PD genome-wide association studies by Nalls et al. illustrates this approach, which involved the examination of “7,893,274 variants” among “13,708 cases and 95,282 controls”. The findings revealed and confirmed “28 independent risk variants” for PD “across 24 loci” [ 141 ]. Patient phenotyping for treatment outcome prediction is another research area that utilizes Big Data analytics. Wong et al.’s paper discusses this approach, reviewing the use of structural and functional connectivity studies to enhance the efficacy of DBS treatment for PD and other neurological diseases [ 142 ]. An emerging area of patient assessment is wearable sensors and/or apps for potential real-time monitoring of symptoms and response to treatment [ 143 ]. A major project in this area is the iPrognosis mobile app, which was funded by the EU Research Programme Horizon 2020 and aimed at accelerating PD diagnosis and developing strategies to help improve and maintain the quality of life of PD patients via capturing data during user interaction with smart devices, including smartphones and smartwatches [ 144 ]. Similar to other diseases, PD analysis is also being conducted via social media (e.g., [ 16 , 145 ]) and EHR [ 146 , 147 ] analyses. See Table 4 and Additional file 1 : Table S4 or review articles in [ 148 , 149 , 150 , 151 , 152 , 153 , 154 ] for further examples of Big Data research in PD.

figure 3

Cumulative number of papers on Big Data over time for different areas, as per Pubmed. The panels illustrate when Big Data started to impact the area and allow a comparison across areas As graphs were simply created by using the keywords “Big Data” AND “area”, with "area" being “Parkinson’s Disease”, “Addiction”, etc. as opposed to using multiple keywords that may be used to describe each field, actual numbers are likely to be underestimated

SUD and Opioid Use Disorder (OUD)

The economic and social burden associated with SUDs is enormous. OUD is the leading cause of overdoses due to substance abuse disorders, where death rates have drastically increased, with over 68,000 people in 2020 [ 155 ]. The US economic cost of OUD alone and fatal opioid overdoses was $471 billion and $550 billion, respectively, in 2017 [ 156 ]. Treatments focus on replacement (e.g., nicotine and opioid replacement) and abstinence and are often combined with self-help groups or psychotherapy [ 157 , 158 ].

Like PD, the increasing impact of Big Data in SUD and OUD research and patients care can be measured by the increased number of papers published in Pubmed over the past decade (Fig.  3 ). Several national initiatives have been aimed at building public databases to facilitate SUD research. For example, since 2009, the ENIGMA project includes a working group specifically focused on addiction, which has gathered genetic, epigenetic, and/or imaging data from 1000’s of SUD subjects from 33 sites as of 2020 [ 37 ]. As part of this research, Mackey et al. have been investigating the association between dependence and regional brain volumes, both substance-specific and general [ 159 ]. Similarly, studies implementing data sets from the UK BioBank and 23andMe (representing  > 140,000 subjects) have been used for developing the Alcohol Use Disorder Identification Test (AUDIT) to identify the genetic basis of alcohol consumption and alcohol use disorder [ 160 ]. Big Data is also being used to devise strategies for retaining patients on medication for OUD, as roughly 50% of persons discontinue OUD therapy within a year [ 158 ]. The Veterans Health Administration is spearheading such an initiative based on data (including clinical, insurance claim, imaging, and genetic data) from > 9 M veterans [ 158 ]. Social media is also emerging as a method to monitor substance abuse and related behaviors. For example, Cuomo et al. reported on the results of an analysis of geo-localized Big Data collected in 2015 via 10 M tweets from Twitter regressed with Indiana State Department of Health data on non-fatal opioid-related hospitalizations and new “HIV cases from the US Centers for Disease Control and Prevention" to examine the transition from "opioid prescription abuse to heroin injection and HIV transmission risk” [ 161 ]. Leveraging Big Data from online content is likely to aid public health practitioners in monitoring SUD. Table 5 and Additional file 1 : Table S5 summarize Big Data research in SUD and OUD.

Chronic pain is a widespread condition that affects a significant portion of the global population, with an estimated 20% of adults suffering from it and 10% newly diagnosed each year [ 162 ]. In the US, this condition is most prevalent and affects over 50 million adults. The most common pain locations are the back, hip, knee, or foot [ 163 ], which are chiefly due to neural entrapment syndromes (e.g., Carpal Tunnel Syndrome (CTS)), peripheral neuropathy (such as from diabetes), or unknown causes (such as non-specific chronic Lower Back Pain (LBP)). Pain treatment remains challenging and includes physical therapy, pharmacological and neuromodulation approaches [ 164 ]. As in other areas of Neurology, the Big Data revolution has been impacting pain research and management strategies. As reviewed by Zaslansky et al., multiple databases have been created to monitor pain, for example the international acute pain registry PAIN OUT, established in 2009 with EU funds, to improve the management of surgeries [ 165 , 166 ]. Besides risk factors [ 167 ], such as those based on genetic data (e.g., see [ 168 , 169 ]), pain studies using Big Data mainly focus on management of symptoms and improving therapy outcomes. Large-scale studies aimed at comparing different treatments [ 170 , 171 ] or at identifying phenotypes in order to classify and diagnose patients (see for example [ 172 ]) are particularly common. Table 6 and Additional file 1 : Table S6 summarize Big Data research in Pain, while Fig.  3 shows the increasing number of published papers in the field.

Example of Big Data impact on treatments and diagnostics-brain stimulation

In the last twenty years, neurostimulation methods have seen a substantial rise in application for neurological disease treatment [ 36 , 138 , 173 ]. Among the most used approaches are invasive techniques like DBS [ 173 , 174 , 175 , 176 ], which utilize implanted devices to apply electrical currents directly into the neural tissue and modulate neural activity. Noninvasive techniques, on the other hand, like those applied transcranially, offer stimulation without the risks associated with surgical procedures (such as bleeding or infection) [ 36 ]. Both invasive and noninvasive approaches have been used for psychiatric and neurological disorders treatments, including those for depression, PD, addiction, and pain. While High Performance Computing has been used in the field for some time (see Fig.  4 ), Big Data applications have just recently started to be explored in brain stimulation. For example, structural and functional connectome studies have yielded new insights into the potential targets for stimulation, in the quest to enhance stimulation effectiveness. Although DTI has optimized the definition of targets for DBS and noninvasive stimulation technologies since mid-2000 [ 177 , 178 , 179 ], Big Data and advances in computational methods have enabled new venues for DTI to further improve stimulation, which have enhanced clinical results. For example, in 2017, Horn et al. utilized structural and functional connectivity data of open-source connectome databases (including healthy subjects connectome from the Brain Genomics Superstruct Project, the HCP, and PD connectome from the PPMI) to build a computational model to predict outcomes following subthalamic nucleus modulation with DBS in PD. As a result, Big Data allowed the identification of a distinct pattern of functional and structural connectivity, which independently accurately predicted DBS response. Additionally, the findings held external validity as connectivity profiles obtained from one cohort were able to predict clinical outcomes in a separate DBS center’s independent cohort. This work also demonstrated the prospective use of Big Data in Precision Medicine by illustrating how connectivity profiles can be utilized to predict individual patient outcomes [ 180 ]. For a more comprehensive review of application of functional connectome studies to DBS, the reader is referred to [ 142 ], where Wong et al. discuss application of structural and functional connectivity to phenotyping of patients undergoing DBS treatment and prediction of DBS treatment response. Big Data is also expected to augment current efforts in the pursuit of genetic markers to optimize DBS in PD (e.g., [ 148 , 181 , 182 ]).

figure 4

High Performance Computing solutions for modeling brain stimulation dosing have been explored for well over a decade. The above figure is adapted from [ 183 ], where Sinusoidal Steady State Solutions of the electromagnetic fields during TMS and DBS were determined from MRI derived Finite Element Models based on frequency specific tissue electromagnetic properties of head and brain tissue. The sinusoidal steady state solutions were then transformed into the time domain to rebuild the transient solution for the stimulation dose in the targeted brain tissues. These solutions were then coupled with single cell conductance-based models of human motor neurons to explore the electrophysiological response to stimulation. Today, high resolution patient specific models are being developed (see below), implementing more complicated biophysical modeling (e.g., coupled electromechanical field models) and are being explored as part of large heterogenous data sets (e.g., clinical, imaging, and movement kinematic) to optimize/tune therapy

Compared to DBS, studies on NIBS have been sparser. However, the use of Big Data methodologies has facilitated the improvement and standardization of established TMS techniques (i.e., single and paired pulse), which had large inter-subject variability, by identifying factors that affect responses to this stimulation in a multicentric sample [ 184 ]. A similar paradigm was followed to characterize theta-burst stimulation [ 185 ]. Regarding disease, a large multisite TMS study (n = 1188), showed that resting state connectivity in limbic and frontostriatal networks can be used for neurophysiological subtype classification in depression. Moreover, individual connectivity evaluations predicted TMS therapy responsiveness better than isolated symptomatology in a subset of patients (n = 154) [ 17 ].

Proposed solutions

As reviewed above, Big Data has been improving the care of patients with neurological diseases in multiple ways. It has elevated the value of diverse and often incomplete data sources, enhanced data sharing and multicentric studies, streamlined multidisciplinary collaboration, and improved the understanding of neurological disease (diagnosis, prognosis, optimizing current treatment, and helping develop novel therapies). Nevertheless, existing methodologies suffer from several limitations, which have prevented the full realization of Big Data’s potential in Neuroscience and Neurology. Below, we discuss the limitations of current approaches and propose possible solutions.

Full exploitation of available resources

Many Neuroscience and Neurology purported “Big Data” studies do not fully implement the classic 3 V's (i.e., “Volume, Variety, and Velocity”) or 5 V’s (i.e., “Volume, Variety, Velocity, Veracity and Value”) and/or are characterized by the high heterogeneity in which the V’s can be interpreted. For example, in “Big Data” Neuroscience and Neurology studies, Volume sometimes refers to studies with hundreds of thousands of patients’ multidimensional datasets and other times to studies with 10's of patients’ unidimensional datasets. Value, a characteristic of Big Data typically defined in financial terms in other Big Data fields, is not usually considered in Big Data studies in Neuroscience and Neurology. In this paper, across studies and databases, we adopted a measure of clinical or preclinical Value where financial information was not given (see Tables 2 – 6 and Additional file 1 : Tables S2–S6). Data Veracity is not standardized in Neuroscience or Neurology and thus, we focused our analysis on both typical data Veracity measures and potential experimental sources of error in the data sets from studies that we reviewed above. In terms of Variety, few clinical studies make use of large multimodal data sets and even fewer are acquired and processed at a rapid Velocity. Data Velocity information is sparsely reported throughout the literature, but its clear reporting would enable a better understanding and refinement of methodologies through the research community.

While these limitations may be simply labeled as semantics, we believe that these deficits often result in Big Data analytics being underexploited, which limits the potential impact of a study and possibly increases its cost. Thus, aligning studies in Neuroscience and Neurology to the V’s represents an opportunity to leverage the knowledge, technology, analytics, and principles established in fields that have been using Big Data more extensively, thereby improving the Big Data studies in Neurology and Neuroscience. Identifying whether a study is suitable for using Big Data approaches makes it easier to choose the best tools for the study and exploit the plethora of resources (databases, software, models, data management strategies) that are already available (part of which we have reviewed herein, see for example Tables 1 – 2 and Additional file 1 : Tables S1, S2).

Tools for data harmonization

The overall lack of tools for data harmonization (particularly for multimodal datasets used in clinical research and care) is a significant issue of current Big Data studies. Creation of methods for sharing data and open-access databases has been a priority of Big Data initiatives since their inception. Data sharing is required by many funding agencies and scientific journals, and publicly available repositories have been established. While these repositories have become more common and organized (see Sect. “ Existing Solutions ”), there has been less emphasis on the development of tools for quality control, standardization of data acquisition, visualization, pre-processing, and analysis. With the proliferation of initiatives promoting data sharing and pooling of existing resources, the need for better tools in these areas is becoming increasingly urgent. Despite efforts made by the US Department of Health and Human Service to establish standardized libraries of outcome measures in various areas, such as Depression [ 186 , 187 ], and by the NIH that has spearheaded Clinical Trials Network (CTN)-recommended Common Data Elements (CDEs) for use in RCTs and EHRs [ 188 ], more work is needed to ensure data harmonization across not only clinical endpoints but also across all data types that typically comprise Big Data in Neuroscience and Neurology. For example, in neuroimaging, quality control of acquired images is a long-standing problem. Traditionally, this is performed visually, but in Big Data sets, large volumes make this approach exceedingly expensive and impractical. Thus, methods for automatic quality control have become in high demand [ 189 ]. Quality control issues are compounded in collaborative datasets, where variability may stem from multiple sources. In multisite studies, a typical source of variability arises from the use of different MRI scanners (i.e., from different manufacturers, with different field strengths or hardware drifts [ 190 , 191 ]). Variability can also arise from data pre-processing techniques and pipelines. For example, the pre-processing pipeline of MRI data involves a variety of steps (such as correcting field inhomogeneity and motion, segmentation, and registration) and continues to undergo refinement through algorithm development, ultimately affecting reproducibility/Veracity of study results. As an additional example, while working on data harmonization methods in genome-wide association studies Chen et. al. have noted similar problems where an “aggregation of controls from multiple sources is challenging due to batch effects, difficulty in identifying genotyping errors and the use of different genotyping platforms” [ 192 ].

Some progress towards harmonization of data and analysis procedures [ 193 ] has been enabled by the availability of free software packages that incorporate widely accepted sets of best practices, see, e.g., Statistical Parametric Mapping (SPM), FreeSurfer, FMRIB Software Library (FSL), Analysis of Functional NeuroImages (AFNI), or their combination (such as Fusion of Neuroimaging Processing (FuNP) [ 194 ]). In addition, open-access pre-processed datasets have been made available (see Table 2 and Additional file 1 : Table S2); for example, the Preprocessed Connectome Project has been systematically pre-processing the data from the International Neuroimaging Data-sharing Initiative and 1000 Functional Connectomes Project [ 195 , 196 ] or GWAS Central (Genome-wide association study Central) which “provides a centralized compilation of summary level findings from genetic association studies” [ 197 ]. As another example, EU-funded NeuGRID and neuGRID4You projects included a set of analysis tools and services for neuroimaging analysis [ 106 ]. Development of software like Combat (which was initially created to eliminate batch effects in genomic data [ 198 ] and subsequently adapted to handle DTI, cortical thickness measurements [ 199 ], and functional connectivity matrices [ 200 ]) can also help researchers harmonize data from various types of study, regardless of whether they are analyzing newly collected or retrospective data gathered with older standards. For more detailed discussions on efforts to address data harmonization challenges in neuroimaging, the reader is directed to the review papers of Li et al. [ 12 ], Pinto et al. [ 201 ], and Jovicich et al. [ 202 ]. In clinical studies using data different from neuroimaging (and/or biospecimen sources), standardization of clinical assessments and measures of outcome across multiple sites has also proven to be challenging. For example, as shown by the ENIGMA study group, multi-center addiction studies face notable methodological challenges due to the heterogeneity of measurements for substance consumption in the context of genomic studies [ 203 ].

Developing tools to harmonize datasets across different sources and data types (e.g., based on machine learning [ 191 ]) for Neurology-based clinical studies might allow researchers to exploit Big Data to their full potential. Tools for complex data visualization and interactive manipulation are also needed to allow researchers from different backgrounds to fully understand the significance of their data [ 204 ]. For studies that are in the design phase, identifying whether tools for data harmonization are available or developing such tools in an early phase of the study will allow researchers to enhance the Veracity, and ultimately the impact of the study, while cutting costs.

New technologies for augmented study design and patient data collection

Traditional clinical studies are associated with several recognized limitations. However, a few recent Big Data studies have shown potential in mitigating some of these limitations.

First, traditional clinical studies, particularly RCTs which serve as the standard in clinical trials, are often expensive and inefficient. The integration of Big Data, particularly in the form of diverse data types or multicenter trials, can further amplify these issues and lead to exponential increases in costs. Thus, there is a pressing need for tools that can optimize resources and contain expenses. Virtual trials are a promising but underutilized approach that can potentially enhance study design and address cost-related challenges. To achieve this, health economics methods could be used to compare different scenarios, such as recruitment strategies or inclusion criteria, and select the most effective one prior to initiating an actual clinical study. These methods can also assign quantitative values to data sets or methods [ 205 ]. For studies testing interventions, virtual experiments that use simulations can be performed. For example, in the area of brain stimulation, virtual DBS is being explored [ 206 ] to supplement existing study design. Similarly, for NIBS, our group and others are building biophysics-based models that can be used to personalize interventions [ 58 ].

Second, traditional clinical studies, including RCTs, often suffer from limited data and limited generalizability of conclusions. Collected data is often too limited to fully account for highly multidimensional and heterogenous neurological conditions. PD is an example of this, where patients’ clinical presentation, progression and response to different treatment strategies can vary significantly, even within a single day [ 153 ]. Limited external validity due to discrepancies between the study design (patient inclusion criteria) and real-world clinical scenarios, as well as limited generalizability of findings to different time points beyond those assessed during the study are other known limitations. Relaxing study criteria and increasing timepoints could provide more data, but often at the expense of increased patient burden and study cost. Mobile applications can potentially help overcome some of these limitations while offering other advantages. For example, by allowing a relatively close monitoring of patients mobile applications may help capture features of symptoms not easily observable during hospital visits. This richer dataset could be used to design algorithms for patient classification/phenotyping or medication tuning. However, data collected via mobile technology is often limited to questionnaires or by the type of data that can be collected with sensors that can be embedded in mobile/wearable devices (typically accelerometers in motor disorders studies). Leveraging Big Data in this context would require the development of technology to monitor patients outside the time and space constraints of a traditional clinical study/RCT (e.g., home, or other unstructured environments); such technology should be sufficiently inexpensive to be useful at scale, while still providing reliable and clinically valuable data. Other related approaches include additional nontraditional data sources, such as information gathered from Payer Databases, EHR, or social media particular to a disease and treatment to support conventional findings. For example, the FDA is poised to pursue Big Data approaches to continue to assess products through their life cycle to "fill knowledge gaps and inform FDA regulatory decision-making" [ 207 ].

Finally, clinical studies might be subject to bias due to important clinical information being missing. This is particularly true for studies that rely on databases for billing or claim purposes, part of which we have reviewed herein, as they use data which were not collected primarily for research (see Additional file 1 : Tables S4–S6). A possible way to overcome this limitation is to more directly couple payer data with clinical data and correlating the results. This approach is still mostly theoretical: modern patient tracking systems like Epic are beginning to offer billing code data within the EHR, but the system was not designed for population-based analysis. Ideally, information such as payer data can be used for exploration purposes and results of the analysis can guide the design of more rigorous studies aimed at testing specific clinical hypotheses.

Tools for facilitating interdisciplinary research

As the use of Big Data continues to expand across various fields, there is a growing need for better tools that can facilitate collaborations among professionals with different backgrounds. A project that exemplifies this need is the American Heart Association (AHA) Precision Medicine Platform [ 208 ]. This platform aims to "realize precision cardiovascular and stroke medicine" by merging large, varying datasets and providing analytical tools and tutorials for clinicians and researchers. Despite the strong technological and community-based support of this platform, major challenges related to scalability, security, privacy, and ease of use have prevented it from being integrated into mainstream medicine, subsequently obstructing its full exploitation.

Creating tools to visualize and interactively manipulate multidimensional data (e.g., borrowing from fields such as virtual or augmented reality that already use these tools [ 209 ]) might help overcome this type of issue.

Future directions

We have identified current limitations in the application of Big Data to Neuroscience and Neurology and have proposed general solutions to overcome them. One area where the limitations in Big Data, as currently defined and implemented, could be addressed, and make a major impact is in the development of personalized therapies and Precision Medicine. In this field, the acceleration Big Data could enable has not yet occurred [ 210 ]. Unlike a traditional one-size-fits-all approach, Precision Medicine seeks to optimize patient care based on individual patient characteristics, including genetic makeup, environmental factors, and lifestyle. This approach can help in preventing, diagnosing, or treating diseases. Precision oncology has been a driver of Precision Medicine for approximately two decades [ 211 ] and exploited availability of big, multi-omics data to develop data-driven approaches to predict risk of developing a disease, help diagnosis, identify patient phenotypes, and identify new therapeutic targets. In Neurology, availability of large neuroimaging, connectivity, and genetics datasets has opened the possibility for data-driven approaches in Precision Medicine. However, these approaches have not yet been fully integrated with clinical decision making and personalized care. Diagnosis and treatment are still often guided by only clinical symptoms. Currently, there are no widely used platforms, systems, or projects that analytically combine personalized data, either to generate personalized treatment plans or assist physicians with diagnostics. However, the AHA Precision Medicine Platform [ 208 ] aims to address this gap by providing a means to supplement treatment plans with personalized analytics. Despite the strong technological and community-based support of this platform, integration of the software into mainstream medicine has been challenging, as discussed above (see SubSect. “ Future Directions ” in Sect. “ Proposed Solutions ").

As a potential way to acquire large real-time multimodal data sets for use in personalized care in the movement disorder, pain, and rehabilitation spaces we have been developing an Integrated Motion Analysis Suite (IMAS), which combines motion capture technology, inertial sensors (gyroscope/accelerometers), and force sensors to assess patient movement kinematics from multiple body joints as well as kinetics. The hardware system for movement kinematic and kinetic data capture is underpinned with an AI driven computational system with algorithms for data reduction, modeling, and prediction of clinical scales, prognostic potential for motor recovery (e.g., in the case of injury such as stroke), and response to treatment. Ultimately, the low-cost hardware package is coupled to computational packages to holistically aid clinicians in motor symptom assessments. The system is currently being investigated as part of a stroke study [ 212 ] and supporting other studies in the movement disorder [ 213 ] and Chronic Pain [ 214 , 215 ] spaces. As for the Big Data component, the system has been designed for different data streams and systems to be networked and interconnected. As a result, data such as multiple patients’ kinematic/kinetic, imaging, EHR, payer database, and clinical data can be longitudinally assessed and analyzed to develop a continually improving model of patient disease progression. This approach also serves as a method to personalize and optimize therapy delivery and/or predict response to therapy (see below).

Our group is also developing a new form of NIBS, electrosonic stimulation (ESStim™) [ 138 ], and testing it in multiple areas (e.g., diabetic neuropathic pain [ 215 ], LBP, CTS pain [ 214 ], PD [ 138 ], and OUD [ 216 ]). While the RCTs that are being conducted for the device are based on classic safety and efficacy endpoints, several of our studies are also focused on developing models of stimulation efficacy through combined imaging data, clinical data, kinematic data, and/or patient specific biophysical models of stimulation dose at the targeted brain sites to identify best responders to therapy (e.g., in PD, OUD, and Pain). These computational models are being developed with the goal of not only identifying the best responders but as a future means to personalize therapy based on the unique characteristics of the individual patients [ 58 ] and multimodal disease models. It is further planned that the IMAS system, with its Big Data backbone, will be integrated with the ESStim™ system to further aid in personalizing patient stimulation dose in certain indications (e.g., PD, CTS pain).

Finally, our group is working on developing a trial optimization tool based on health economics modeling (e.g., Cost Effective Analysis (CEA)) [ 205 , 217 ]. The software we are generating allows for a virtual trial design and the predicting of the cost effectiveness of the trial. We anticipate that the software could also be implemented to quantify data set values in health economic terms or used to quantify non-traditional data for use in RCT design or assessment (e.g., for the OUD patient population CEA methodologies could be used to quantify the impact of stigma on the patient, caregiver, or society with traditional (e.g., biospecimen) and non-traditional data sets (e.g., EHR, social media)). Ultimately, we see all these systems being combined into a personalized treatment suite, based on a Big Data infrastructure, whereby the multimodal data sets (e.g., imaging, biophysical field-tissue interaction models, clinical, and biospecimen data) are coupled rapidly to personalize brain stimulation-based treatments in diverse and expansive patient cohorts (see Fig.  5 ).

figure 5

Schematic of our suite under development for delivering personalized treatments based on a Big Data infrastructure, whereby multimodal data sets (e.g., imaging, biophysical field-tissue interaction models, clinical, biospecimen data) can be coupled to deliver personalized brain stimulation-based treatments in a diverse and expansive patient cohort. Each integrated step can be computationally intensive (e.g., see Fig.  4 for simplified dosing example for exemplary electromagnetic brain stimulation devices)

Elaboration

The Section “ Existing Solutions ” has reviewed the influence of Big Data on Neuroscience and Neurology, specifically in the context of advancing treatments for neurological diseases. Our analysis spans the last few decades and includes a diverse selection of cutting-edge projects in Neuroscience and Neurology that illustrate the continuing shift towards a Big Data-driven paradigm; also, it reveals that certain areas of neurological treatment development have not fully embraced the potential of the Big Data revolution, as demonstrated through our comprehensive review of clinical literature in Sect. “ Proposed Solutions ”.

One sign of this gap is that there are differences between the definition of Big Data and the use the 3 V's or 5 V’s across studies that are considered “Big Data” studies in Neuroscience and Neurology literature. Several definitions can be found in the literature from these fields. For example, van den Heuvel et al. noted that the term “Big Data” includes many data types, such as “observational study data, large datasets, technology-generated outcomes (e.g., from wearable sensors), passively collected data, and machine-learning generated algorithms” [ 153 ]; Muller-Wirtz and Volk stated that “Big Data can be defined as Extremely large datasets to be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions” [ 166 ]; and Eckardt et al. referred to Big Data science as the “application of mathematical techniques to large data sets to infer probabilities for prediction and find novel patterns to enable data driven decisions” [ 218 ]. Other definitions also include the techniques required for data analysis. For example, van den Heuvel et al. stated that “these information assets (characterized by high Volume, Velocity, and Variety) require specific technology and analytical methods for its transformation into Value” [ 153 ]; and according to Banik and Bandyopadhyay, the term “Big Data encompassed massive data sets having large, more varied, and complex structure with the difficulties of storing, analyzing, and visualizing for further processes or results” [ 219 ]. Thus, what constitutes Big Data in Neuroscience and Neurology is not established nor always aligned with the definition of Big Data outside of these fields.

In addition, in the fields of Neuroscience and Neurology, often some V’s are incompletely considered or even dismissed. At present, Neuroscience study data from “Big Data” studies are often just big and sometimes multimodal, and Neurology studies with "Big Data" are often characterized by small multimodal datasets. Incorporating all the V’s into studies might spur innovation. The area of research focused on OUD treatments is a particularly salient example. Adding “Volume” to OUD studies by integrating OUD patient databases, as it has been done for other diseases, could lead to better use of Big Data techniques and ultimately help understand the underlying disease and develop new treatments (e.g., see the work of Slade et. al. discussed above [ 122 ]). Similarly, adding “Velocity” to OUD studies by developing technology for increasing dataflow (e.g., integrating clinical data collected during hospital visits with home monitoring signals collected with mobile apps) might lead to using Big Data techniques for uncovering data patterns that could ultimately translate into development of new, personalized OUD treatments. In this vein, Variety in OUD studies could significantly add to the clinical toolbox of caregivers or researchers developing new technologies. For example, infovelliance of social media combined with machine learning algorithms, such as those developed for use during the COVID Pandemic [ 220 ], could be used to assess the stigma associated with potential treatment options for OUD patients, and quantify potential methods to lower patient treatment hesitancy. As for data Veracity, additional metrics of veracity could be garnered from clinical data sets to further assessment of the internal and external validity of trial results. For example, in OUD, Big Data sets could be used to assess the validity of self-reported opioid use, such as data gathered from drug diaries, in reference to other components of the Data Set (e.g., social media presence, sleep patterns, biospecimens, etc.). Finally, while we characterized Value herein as direct or indirect in terms of clinical utility, one could assign economic value to the Neuroscience and Neurology data sets through health economics methods. For example, in the OUD patient population, CEA or cost benefit analysis methodologies could be used to quantify the value of the data in health economics terms and guide policy makers in the design of studies or programs for aiding OUD treatment.

Finally, the rapid growth of Big Data in Neuroscience and Neurology has brought to the forefront ethical considerations that must be addressed [ 221 , 222 ]. For example, a perennial concern is data security and how to best manage patient confidentiality [ 223 ]. In the US, current laws and regulations require that SUD treatment information be kept separate from patient’s EHR, which can limit Big Data approaches for improving OUD treatment [ 158 ]. The cost versus benefit of making the information more accessible poses ethical challenges as there are risks to trying to acquire such sensitive protected health information (PHI). As of November 28, 2022, the US Health and Human Services Department, through the Office for Civil Rights (OCR) and the Substance Abuse and Mental Health Services Administration (SAMHSA) put forth proposed modifications to rules and has requested public comments on the issue [ 224 ]. Ultimately, as the use of Big Data in the treatment of neurological patients progresses, such challenges will need to be addressed in a manner which provides the most benefit to the patient with minimal risks [ 225 , 226 ].

This paper has provided a comprehensive analysis of how Big Data has influenced Neuroscience and Neurology, with an emphasis on the clinical treatment of a broad sample of neurological disorders. It has highlighted emerging trends, identified limitations of current approaches, and proposed possible methodologies to overcome these limitations. Such a comprehensive review can foster further innovation by enabling readers to identify unmet needs and fill them with a Mendeleyevization-based approach; to compare how different (but related) areas have been advancing and assess whether a solution from an area can be applied to another (Cross-disciplinarization); or to use Big Data to enhance traditional solutions to a problem (Implantation) [ 227 ]. This paper has also tackled the issue of the application of the classic 5 V’s or 3 V’s definitions of Big Data in Neuroscience and Neurology, an aspect that has been overlooked in previous literature. Review of the literature under this perspective has contributed to highlight the limitations of current Big Data studies which, as a result, rarely take advantage of AI methods typical of Big Data analytics. This can significantly impact treatment of neurological disorders, which are highly heterogeneous in both symptom presentation and etiology, and would benefit significantly from the application of these methods. At the same time, assessing the missing V’s of Big Data can provide the basis to improve study design. In light of our findings, we recommend that future research should focus on the following areas:

Augment and standardize the way the 5 V’s are currently defined and implemented , since not all "Big Data" studies are truly "Big Data" studies.

Encourage collaborative, multi-center studies : especially in clinical research, adding Volume might help overcome the limitations of classical RCTs (e.g., type II error).

Leverage new technologies for real-time data collection : for diseases characterized by time-varying patterns of symptoms, higher data Velocity such as implemented in home monitoring or wearables might help personalize treatments and/or improve treatment effectiveness.

Diversify data types collected in the clinic and/or home : as data Variety can help uncover patterns in patients subtypes or treatment responses.

Enforce protocols for data harmonization to improve Veracity.

Consider each V in terms of Value and identify ways to categorize and increase Value out of a study, since adding V’s might amplify study costs (and not all data is preclinically or clinically meaningful).

Funding agencies should encourage initiatives aimed at educating junior and established scientists on the methods, tools, and resources that Big Data challenges require.

It often happens that when new methods/techniques/technologies are developed or simply get the attention of researchers in a field, that field changes trajectory. In Neuroscience and Neurology, the use of Big Data has been an evolving trend, as evident from our review of over 300 papers and 120 databases. We discussed how Big Data is altering the course of these fields by leveraging computational tools to develop innovative treatments for neurological diseases, a major global health concern. While our analysis has identified significant advancements made in the fields, we also note that the use of Big Data remains fragmented. Nevertheless, we view this as an opportunity for progress in these rapidly developing fields, which can ultimately benefit patients with improved diagnosis and treatment options.

Availability of data and materials

Data sharing is not applicable to this survey article as no primary research datasets were generated during the survey (further, all data survey material is included in the manuscript and/or Additional file 1 ).

Change history

28 july 2023.

The clean version of ESM has been updated.

Abbreviations

  • Artificial Intelligence

Multiple Sclerosis

United States

National Institutes of Health

Volume, Variety, Velocity, Veracity, and Value

Alzheimer’s Disease

Parkinson’s Disease

Substance Use Disorder

Brain Mapping by Integrated Neurotechnologies for Disease Studies

Human Connectome Project

Massachusetts General Hospital

University of California Los Angeles

Brain Activity Map Project

Alzheimer’s Disease Neuroimaging Initiative

Enhancing Neuroimaging Genetics through Meta-Analysis

Electron Microscopy

Two-photon Fluorescence Microscopy

Magnetic Resonance Imaging

Diffusion Tensor Imaging

Functional Magnetic Resonance Imaging

Resting State Magnetic Resonance Imaging

Task Functional Magnetic Resonance Imaging

Diffusion Magnetic Resonance Imaging

Magnetoencephalography

Electroencephalography

Positron Emission Technology

Cerebrospinal Fluid

Major Depressive Disorder

Transcranial Magnetic Stimulation

Randomized Controlled Trial

Inflammatory Bowel Disease

Anti-Tumor Necrosis Factor

Attention Deficit Hyperactivity Disorder

Food and Drug Administration

Electronic Health Records

High Performance Computing

Real World Evidence

Deep Brain Stimulation

Non-Invasive Brain Stimulation

Parkinson’s Progression Markers Initiative

European Union

Opioid Use Disorder

Alcohol Use Disorder Identification Test

Carpal Tunnel Syndrome

Lower Back Pain

Volume, Variety, and Velocity

Clinical Trials Network

Common Data Elements

Statistical Parametric Mapping

Analysis of Functional NeuroImages

FMRIB Software Library (FSL)

Fusion of Neuroimaging Processing

Genome-Wide Association Study

A grid-based e-Infrastructure for neuroimaging research

American Heart Association

Integrated Motion Analysis Suite

Electrosonic Stimulation

Cost Effective Analysis

Protected Health Information

Office for Civil Rights

Substance Abuse and Mental Health Services Administration

Switzerland

United Kingdom

South Korea

Healthy and Pathology

Chronic Back Pain

Fibromyalgia

Irritable Bowel Syndrome

Neurodegenerative Disease

Cerebral Palsy

Computed Tomography

Single-Photon Emission Computerized Tomography

Second Capture

Spinal Muscular Atrophy

Structural Magnetic Resonance Imaging

Alzheimer’s Disease and Related Dementias

Electro-Corticography

Event Related Potential

Intracranial Electroencephalography

Electromyography

Central Nervous System

Autism Spectrum Disorder

Arterial Spin Labeling

In Situ Hybridization

Intensive Care Unit

National Science Foundation

Fixed Studies

Fixed (Updates Anticipated)

Open/Closed To Uploads

Pre-Clinical

Pre-Clinical and Clinical

Mushroom Body

Data verified through automated analytical process (AI, statistical methods)

Manual Verification

Dependent on Methodological Limitations

Mobile App Realtime Dependent

Social Media Dependent

Hospital upload Dependent

Spinal Cord Stimulation

Osteoarthritis

Interstitial Cystitis

Bladder Pain Syndrome

Visual Analog Scale

National Institute of Aging

National Institute of Diabetes and Digestive and Kidney Diseases

National Institute of Neurological Disorders and Stroke

National Institute of Arthritis and Musculoskeletal and Skin Diseases

National Institute on Drug Abuse

Massachusetts Institute of Technology DoDC. Neurosciences Research Program Records, AC-0107, box X (Schmitt, Francis Otto). 1986 [Available from: https://archivesspace.mit.edu/repositories/2/resources/83 ].

Trappenberg TP. Fundamentals of Computational Neuroscience. United States: Oxford University Press; 2010.

MATH   Google Scholar  

Reed JL, Kaas JH. Statistical analysis of large-scale neuronal recording data. Neural Netw. 2010;23(6):673–84.

Article   Google Scholar  

Ikegaya Y, Aaron G, Cossart R, Aronov D, Lampl I, Ferster D, et al. Synfire chains and cortical songs: temporal modules of cortical activity. Science. 2004;304(5670):559–64.

Chung JE, Sellers KK, Leonard MK, Gwilliams L, Xu D, Dougherty ME, et al. High-density single-unit human cortical recordings using the Neuropixels probe. Neuron. 2022;110(15):2409–21.

Pnevmatikakis EA, Soudry D, Gao Y, Machado TA, Merel J, Pfau D, et al. Simultaneous denoising, deconvolution, and demixing of calcium imaging data. Neuron. 2016;89(2):285–99.

Scheffer LK, Xu CS, Januszewski M, Lu Z, Takemura SY, Hayworth KJ, et al. A connectome and analysis of the adult Drosophila central brain. Elife. 2020 Sep 7;9:e57443.

Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, et al. The minimal preprocessing pipelines for the Human Connectome Project. Neuroimage. 2013;80:105–24.

Elam JS, Glasser MF, Harms MP, Sotiropoulos SN, Andersson JLR, Burgess GC, et al. The human connectome project: a retrospective. Neuroimage. 2021;244: 118543.

Kumar DR, Aslinia F, Yale SH, Mazza JJ. Jean-Martin Charcot: the father of neurology. Clin Med Res. 2011;9(1):46–9.

Didi-Huberman G. Invention of Hysteria: Charcot and the Photographic Iconography of the Salpêtrière. Cambridge, MA: MIT Press; 2003. p. 373.

Google Scholar  

Li X, Guo N, Li Q. Functional neuroimaging in the New Era of Big Data. Genomics Proteomics Bioinform. 2019;17(4):393–401.

Bethlehem RAI, Seidlitz J, White SR, Vogel JW, Anderson KM, Adamson C, et al. Brain charts for the human lifespan. Nature. 2022;604(7906):525–33.

Veitch DP, Weiner MW, Aisen PS, Beckett LA, DeCarli C, Green RC, et al. Using the Alzheimer’s Disease neuroimaging initiative to improve early detection, diagnosis, and treatment of Alzheimer’s disease. Alzheimers Dement. 2022;18(4):824–57.

Demro C, Mueller BA, Kent JS, Burton PC, Olman CA, Schallmo MP, et al. The psychosis human connectome project: an overview. Neuroimage. 2021;241: 118439.

Kim SJ, Marsch LA, Hancock JT, Das AK. Scaling Up research on drug abuse and addiction through social media Big Data. J Med Internet Res. 2017;19(10): e353.

Drysdale AT, Grosenick L, Downar J, Dunlop K, Mansouri F, Meng Y, et al. Resting-state connectivity biomarkers define neurophysiological subtypes of depression. Nat Med. 2017;23(1):28–38.

Xia M, Liu J, Mechelli A, Sun X, Ma Q, Wang X, et al. Connectome gradient dysfunction in major depression and its association with gene expression profiles and treatment outcomes. Mol Psychiatry. 2022;27(3):1384–93.

Wheatley M. Google’s latest AI tools help doctors read medical records faster. 2020 [cited 2022]. Available from: https://siliconangle.com/2020/11/10/googles-latest-ai-tools-help-doctors-read-medical-records-faster/ .

Nasralah T, El-Gayar O, Wang Y. Social media text mining framework for drug abuse: development and validation study with an opioid crisis case analysis. J Med Internet Res. 2020;22(8): e18350.

Elements of this image (Figure 1) and Figure 5 were developed from images sourced under Public Domain, Creative Commons, Wikimedia Commons, and/or GNU Free Documentation License from Public Domain, Wikipedia, Wikimedia Commons, and Freesvg.org sources.

Glickstein M. Golgi and Cajal: the neuron doctrine and the 100th anniversary of the 1906 Nobel Prize. Curr Biol. 2006;16(5):R147–51.

Schwiening CJ. A brief historical perspective: Hodgkin and Huxley. J Physiol. 2012;590(11):2571–5.

McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biol. 1990;52(1–2):99–115.

Article   MATH   Google Scholar  

Fornito A, Zalesky A, Breakspear M. The connectomics of brain disorders. Nat Rev Neurosci. 2015;16(3):159–72.

Galenus. Galeni Opera Librorum Sexta Classis De Cucurbitulis, Scarificationibus, Hirudinibus, & Phlebotomia praecipuo artis remedio tradit. Iunta; 1586; 6.

Tremblay P, Dick AS. Broca and Wernicke are dead, or moving past the classic model of language neurobiology. Brain Lang. 2016;162:60–71.

Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts H. Artificial intelligence in radiology. Nat Rev Cancer. 2018;18(8):500–10.

Nadgir R, Yousem DM. Neuroradiology: The Requisites. 4th ed. Amsterdam: Elsevier; 2016.

Van Essen DC, Ugurbil K. The future of the human connectome. Neuroimage. 2012;62(2):1299–310.

Bota M, Dong HW, Swanson LW. From gene networks to brain networks. Nat Neurosci. 2003;6(8):795–9.

Connectome Coordination Facility. Human Connectome Project: What is the Connectome Coordination Facility? 2011 [cited 2022]. Available from: https://www.humanconnectome.org .

Zheng Z, Lauritzen JS, Perlman E, Robinson CG, Nichols M, Milkie D, et al. A complete electron microscopy volume of the brain of adult drosophila melanogaster. Cell. 2018;174(3):730–43.

Damasio H, Grabowski T, Frank R, Galaburda AM, Damasio AR. The return of Phineas Gage: clues about the brain from the skull of a famous patient. Science. 1994;264(5162):1102–5.

Lewis J. Something hidden : a Biography of Wilder Penfield. 1st ed. Toronto, Ont. Garden City, N.Y.: Doubleday Canada; 1981. xiv, 311.

Wagner T, Valero-Cabre A, Pascual-Leone A. Noninvasive human brain stimulation. Annu Rev Biomed Eng. 2007. https://doi.org/10.1146/annurev.bioeng.9.061206.133100 .

Thompson PM, Jahanshad N, Ching CRK, Salminen LE, Thomopoulos SI, Bright J, et al. ENIGMA and global neuroscience: a decade of large-scale studies of the brain in health and disease across more than 40 countries. Transl Psychiatry. 2020;10(1):100.

NIH. Hope Through Research [cited 2022]. https://www.ninds.nih.gov/health-information/patient-caregiver-education/hope-through-research .

Feigin VL. The evolution of neuroepidemiology: marking the 40-year anniversary of publishing studies on epidemiology of neurological disorders. Neuroepidemiology. 2022;56(1):2–3.

Fregnac Y. Big data and the industrialization of neuroscience: a safe roadmap for understanding the brain? Science. 2017;358(6362):470–7.

Landhuis E. Neuroscience: Big brain, big data. Nature. 2017;541(7638):559–61.

Chen S, He Z, Han X, He X, Li R, Zhu H, et al. How Big Data and high-performance computing drive brain science. Genomics Proteomics Bioinform. 2019;17(4):381–92.

Van Horn JD. Bridging the brain and data sciences. Big Data. 2021;9(3):153–87.

Bassett DS, Sporns O. Network neuroscience. Nat Neurosci. 2017;20(3):353–64.

Liu Y, Luo Y, Naidech AM. Big Data in stroke: how to use big data to make the next management decision. Neurotherapeutics. 2023. https://doi.org/10.1007/s13311-023-01358-4 .

Helwegen K, Libedinsky I, van den Heuvel MP. Statistical power in network neuroscience. Trends Cogn Sci. 2023;27(3):282–301.

Tang Y, Chen D, Li X. Dimensionality reduction methods for brain imaging data analysis. ACM Comput Surveys. 2021;54(4):1–36.

Choudhury S, Fishman JR, McGowan ML, Juengst ET. Big data, open science and the brain: lessons learned from genomics. Front Hum Neurosci. 2014;8:239.

Ferguson AR, Nielson JL, Cragin MH, Bandrowski AE, Martone ME. Big data from small data: data-sharing in the ‘long tail’ of neuroscience. Nat Neurosci. 2014;17(11):1442–7.

The impact of the NIH BRAIN Initiative. Nat Methods. 2018;15(11):839.

Rethinking the brain. Nature. 2015;519(7544):389.

Mahfoud T. Visions of unification and integration: building brains and communities in the European human brain project. New Media Soc. 2021;23(2):322–43.

Okano H, Sasaki E, Yamamori T, Iriki A, Shimogori T, Yamaguchi Y, et al. Brain/MINDS: a japanese national brain project for marmoset neuroscience. Neuron. 2016;92(3):582–90.

Auger SD, Jacobs BM, Dobson R, Marshall CR, Noyce AJ. Big data, machine learning and artificial intelligence: a neurologist’s guide. Pract Neurol. 2020;21(1):4–11.

Vu MT, Adali T, Ba D, Buzsaki G, Carlson D, Heller K, et al. A shared vision for machine learning in neuroscience. J Neurosci. 2018;38(7):1601–7.

Nenning KH, Langs G. Machine learning in neuroimaging: from research to clinical practice. Radiologie. 2022;62(Suppl 1):1–10.

Dinsdale NK, Bluemke E, Sundaresan V, Jenkinson M, Smith SM, Namburete AIL. Challenges for machine learning in clinical translation of big data imaging studies. Neuron. 2022;110(23):3866–81.

Dipietro L, Elkin-Frankston S, Ramos-Estebanez C, Wagner T. Supercomputing in the Study and Stimulation of the Brain. In: Milutinović V, Kotlar M, editors. Handbook of Research on Methodologies and Applications of Supercomputing. Pennsylvania: IGI Global; 2021.

Briscoe J, Marin O. Looking at neurodevelopment through a big data lens. Science. 2020. https://doi.org/10.1126/science.aaz8627 .

Sporns O, Tononi G, Kotter R. The human connectome: a structural description of the human brain. PLoS Comput Biol. 2005;1(4): e42.

Abbott A. How the world’s biggest brain maps could transform neuroscience. Nature. 2021;598(7879):22–5.

Sporns O. The human connectome: a complex network. Ann N Y Acad Sci. 2011;1224:109–25.

Connectome NP. Connectome. Proc Natl Acad Sci USA. 2013;110(15):5739.

Alivisatos AP, Chun M, Church GM, Greenspan RJ, Roukes ML, Yuste R. The brain activity map project and the challenge of functional connectomics. Neuron. 2012;74(6):970–4.

Alivisatos AP, Chun M, Church GM, Deisseroth K, Donoghue JP, Greenspan RJ, et al. Neuroscience. Brain Activity Map Sci. 2013;339(6125):1284–5.

White JG, Southgate E, Thomson JN, Brenner S. The structure of the nervous system of the nematode Caenorhabditis elegans. Philos Trans R Soc Lond B Biol Sci. 1986;314(1165):1–340.

Scannell JW, Blakemore C, Young MP. Analysis of connectivity in the cat cerebral cortex. J Neurosci. 1995;15(2):1463–83.

Felleman DJ, Van Essen DC. Distributed hierarchical processing in the primate cerebral cortex. Cereb Cortex. 1991;1(1):1–47.

Young MP. Objective analysis of the topological organization of the primate cortical visual system. Nature. 1992;358(6382):152–5.

Wanner AA, Friedrich RW. Whitening of odor representations by the wiring diagram of the olfactory bulb. Nat Neurosci. 2020;23(3):433–42.

Ohyama T, Schneider-Mizell CM, Fetter RD, Aleman JV, Franconville R, Rivera-Alba M, et al. A multilevel multimodal circuit enhances action selection in Drosophila. Nature. 2015;520(7549):633–9.

van den Heuvel MP, Sporns O. Rich-club organization of the human connectome. J Neurosci. 2011;31(44):15775–86.

Connectome Coordination Facility. HCP Lifespan Studies [cited 2022. Available from: https://www.humanconnectome.org/lifespan-studies ].

Van Essen DC, Donahue CJ, Coalson TS, Kennedy H, Hayashi T, Glasser MF. Cerebral cortical folding, parcellation, and connectivity in humans, nonhuman primates, and mice. Proc Natl Acad Sci USA. 2019;116(52):26173–80.

Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C, Jagust W, et al. The Alzheimer’s disease neuroimaging initiative. Neuroimaging Clin N Am. 2005;15(4):869–77.

Weiner MW, Aisen PS, Jack CR Jr, Jagust WJ, Trojanowski JQ, Shaw L, et al. The Alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheimers Dement. 2010;6(3):202–11.

Alzheimer’s Disease Neuroimaging Initiative. About ADNI 2017 [cited 2022. Available from: https://adni.loni.usc.edu/about/ ].

Toga AW, Crawford KL. The Alzheimer’s disease neuroimaging initiative informatics core: a decade in review. Alzheimers Dement. 2015;11(7):832–9.

Weiner MW, Veitch DP. Introduction to special issue: overview of Alzheimer’s disease neuroimaging initiative. Alzheimers Dement. 2015;11(7):730–3.

Association As. Alzheimer’s Association Takes On Leadership Role In Landmark Alzheimer’s Biomarker Study—Known As ADNI4—To Convene Private Partner Scientific Board Chicago: Alzheimer’s Association 2022 https://www.alz.org/news/2022/alzheimers-association-takes-on-leadership-role-in [Accessed from 14 Oct 2022].

(NCIRE) NCIfRaE. Major study of Alzheimer’s disease to focus on including people from underrepresented communities 2022 https://www.ncire.org/research-news .

Thompson PM, Jahanshad N, Schmaal L, Turner JA, Winkler AM, Thomopoulos SI, et al. The enhancing neuroimaging genetics through meta-analysis consortium: 10 years of global collaborations in human brain mapping. Hum Brain Mapp. 2022;43(1):15–22.

Bearden CE, Thompson PM. Emerging global initiatives in neurogenetics: the enhancing neuroimaging genetics through meta-analysis (ENIGMA) consortium. Neuron. 2017;94(2):232–6.

Stein JL, Medland SE, Vasquez AA, Hibar DP, Senstad RE, Winkler AM, et al. Identification of common variants associated with human hippocampal and intracranial volumes. Nat Genet. 2012;44(5):552–61.

Hibar DP, Adams HHH, Jahanshad N, Chauhan G, Stein JL, Hofer E, et al. Novel genetic loci associated with hippocampal volume. Nat Commun. 2017;8:13624.

Schmaal L, Hibar DP, Samann PG, Hall GB, Baune BT, Jahanshad N, et al. Cortical abnormalities in adults and adolescents with major depression based on brain scans from 20 cohorts worldwide in the ENIGMA major depressive disorder working group. Mol Psychiatry. 2017;22(6):900–9.

Hibar DP, Westlye LT, Doan NT, Jahanshad N, Cheung JW, Ching CRK, et al. Cortical abnormalities in bipolar disorder: an MRI analysis of 6503 individuals from the ENIGMA bipolar disorder working group. Mol Psychiatry. 2018;23(4):932–42.

Sun BB, Loomis SJ, Pizzagalli F, Shatokhina N, Painter JN, Foley CN, et al. Genetic map of regional sulcal morphology in the human brain from UK biobank data. Nat Commun. 2022;13(1):6071.

Zhao B, Luo T, Li T, Li Y, Zhang J, Shan Y, et al. Genome-wide association analysis of 19,629 individuals identifies variants influencing regional brain volumes and refines their genetic co-architecture with cognitive and mental health traits. Nat Genet. 2019;51(11):1637–44.

Smith SM, Douaud G, Chen W, Hanayik T, Alfaro-Almagro F, Sharp K, et al. An expanded set of genome-wide association studies of brain imaging phenotypes in UK Biobank. Nat Neurosci. 2021;24(5):737–45.

Brainstorm C, Anttila V, Bulik-Sullivan B, Finucane HK, Walters RK, Bras J, et al. Analysis of shared heritability in common disorders of the brain. Science. 2018. https://doi.org/10.1126/science.aap8757 .

Cao M, Wang Z, He Y. Connectomics in psychiatric research: advances and applications. Neuropsychiatr Dis Treat. 2015;11:2801–10.

Bullmore E, Sporns O. Complex brain networks: graph theoretical analysis of structural and functional systems. Nat Rev Neurosci. 2009;10(3):186–98.

He Y, Evans A. Graph theoretical modeling of brain connectivity. Curr Opin Neurol. 2010;23(4):341–50.

Chong CD, Schwedt TJ, Hougaard A. Brain functional connectivity in headache disorders: a narrative review of MRI investigations. J Cereb Blood Flow Metab. 2019;39(4):650–69.

Yang J, Gohel S, Vachha B. Current methods and new directions in resting state fMRI. Clin Imaging. 2020;65:47–53.

Alyass A, Turcotte M, Meyre D. From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med Genomics. 2015;8:33.

Lozano AM, Lipsman N. Probing and regulating dysfunctional circuits using deep brain stimulation. Neuron. 2013;77(3):406–24.

Sun R, Sohrabpour A, Worrell GA, He B. Deep neural networks constrained by neural mass models improve electrophysiological source imaging of spatiotemporal brain dynamics. Proc Natl Acad Sci USA. 2022;119(31): e2201128119.

Xiao M, Li Q, Feng H, Zhang L, Chen Y. Neural vascular mechanism for the cerebral blood flow autoregulation after hemorrhagic stroke. Neural Plast. 2017;2017:5819514.

Field D, Ammouche Y, Peña J-M, Jérusalem A. Machine learning based multiscale calibration of mesoscopic constitutive models for composite materials: application to brain white matter. Comput Mech. 2021;67(6):1629–43.

Article   MathSciNet   MATH   Google Scholar  

Tamura H, Prokott KE, Fleming RW. Distinguishing mirror from glass: a “Big Data” approach to material perception. J Vis. 2022;22(4):4.

Tian Y-h, Chen X-l, Xiong H-k, Li H-l, Dai L-r, Chen J, et al. Towards human-like and transhuman perception in AI 2.0: a review. Front Informa Technol Elec Eng. 2017;18(1):58–67.

Santuz A, Ekizos A, Janshen L, Mersmann F, Bohm S, Baltzopoulos V, et al. Modular control of human movement during running: an open access data set. Front Physiol. 2018;9:1509.

Levey DF, Stein MB, Wendt FR, Pathak GA, Zhou H, Aslan M, et al. Bi-ancestral depression GWAS in the Million Veteran Program and meta-analysis in >1.2 million individuals highlight new therapeutic directions. Nat Neurosci. 2021;24(7):954–63.

Munir K, Ramón-Fernández Ad, Iqbal S, Javaid N. Neuroscience patient identification using big data and fuzzy logic–an Alzheimer’s disease case study. Expert Syst Appl. 2019;136:410–25.

Eshaghi A, Young AL, Wijeratne PA, Prados F, Arnold DL, Narayanan S, et al. Identifying multiple sclerosis subtypes using unsupervised machine learning and MRI data. Nat Commun. 2021;12(1):2078.

Mitelpunkt A, Galili T, Kozlovski T, Bregman N, Shachar N, Markus-Kalish M, et al. Novel Alzheimer’s disease subtypes identified using a data and knowledge driven strategy. Sci Rep. 2020;10(1):1327.

Wu J, Gao Y, Malik V, Gao X, Shan R, Lv J, et al. Prevalence and risk factors of MRI-defined brain infarcts among Chinese adults. Front Neurol. 2022;13: 967077.

Ma C, Zhang W, Mao L, Zhang G, Shen Y, Chang H, et al. Hyperhomocysteinemia and intracranial aneurysm: a mendelian randomization study. Front Neurol. 2022;13: 948989.

Wu W, Zhang Y, Jiang J, Lucas MV, Fonzo GA, Rolle CE, et al. An electroencephalographic signature predicts antidepressant response in major depression. Nat Biotechnol. 2020;38(4):439–47.

Barbanti P, Egeo G, Aurilia C, Fiorentini G, Proietti S, Tomino C, et al. The first report of the Italian Migraine Registry (I-GRAINE). Neurol Sci. 2022;43(9):5725–8.

McCarthy A. The biomarker future is digital. Inside Prec Med. 2020. https://doi.org/10.1089/clinomi.07.01.19 .

Kiral-Kornek I, Roy S, Nurse E, Mashford B, Karoly P, Carroll T, et al. Epileptic seizure prediction using big data and deep learning: toward a mobile system. EBioMedicine. 2018;27:103–11.

Bot BM, Suver C, Neto EC, Kellen M, Klein A, Bare C, et al. The mPower study, Parkinson disease mobile data collected using ResearchKit. Sci Data. 2016;3: 160011.

Prince J, Arora S, de Vos M. Big data in Parkinson’s disease: using smartphones to remotely detect longitudinal disease phenotypes. Physiol Meas. 2018;39(4): 044005.

Mayo CS, Matuszak MM, Schipper MJ, Jolly S, Hayman JA, Ten Haken RK. Big Data in designing clinical trials: opportunities and challenges. Front Oncol. 2017;7:187.

Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS. Evidence based medicine: what it is and what it isn’t. BMJ. 1996;312(7023):71–2.

Hemphill JC 3rd. Pro: neurocritical care Big Data and AI: it’s about expertise. Neurocrit Care. 2022;37(Suppl 2):160–2.

Peter I, Dubinsky M, Bressman S, Park A, Lu C, Chen N, et al. Anti-Tumor necrosis factor therapy and incidence of Parkinson disease among patients with inflammatory bowel disease. JAMA Neurol. 2018;75(8):939–46.

Olsen AL, Riise T, Scherzer CR. Discovering new benefits from old drugs with Big Data-promise for parkinson disease. JAMA Neurol. 2018;75(8):917–20.

Slade E, Dwoskin LP, Zhang GQ, Talbert JC, Chen J, Freeman PR, et al. Integrating data science into the translational science research spectrum: a substance use disorder case study. J Clin Transl Sci. 2020;5(1): e29.

Yu YX, Yu XD, Cheng QZ, Tang L, Shen MQ. The association of serum vitamin K2 levels with Parkinson’s disease: from basic case-control study to big data mining analysis. Aging. 2020;12(16):16410–9.

FDA. Unleashing the Power of Data Washington D.C. [updated 9/6/22. Available from: https://www.fda.gov/science-research/focus-areas-regulatory-science-report/unleashing-power-data ].

Mikailov M, Weizhe L, Petrick N, Guo Y, Xu L, Weaver J, et al. High Performance Computing Techniques for Big Data Processing: FDA; 2021 [cited 2022. Available from: https://www.fda.gov/science-research/fda-science-forum/high-performance-computing-techniques-big-data-processing ].

Desai RJ, Matheny ME, Johnson K, Marsolo K, Curtis LH, Nelson JC, et al. Broadening the reach of the FDA Sentinel system: a roadmap for integrating electronic health record data in a causal analysis framework. NPJ Digit Med. 2021;4(1):170.

FDA. Sentinel Initative 2022 https://www.fda.gov/safety/fdas-sentinel-initiative .

Warby SC, Wendt SL, Welinder P, Munk EG, Carrillo O, Sorensen HB, et al. Sleep-spindle detection: crowdsourcing and evaluating performance of experts, non-experts and automated methods. Nat Methods. 2014;11(4):385–92.

Doubal FN, Ali M, Batty GD, Charidimou A, Eriksdotter M, Hofmann-Apitius M, et al. Big data and data repurposing—using existing data to answer new questions in vascular dementia research. BMC Neurol. 2017;17(1):72.

Agoston DV, Langford D. Big Data in traumatic brain injury; promise and challenges. Concussion. 2017. https://doi.org/10.2217/cnc-2016-0013 .

Vrenken H, Jenkinson M, Pham DL, Guttmann CRG, Pareto D, Paardekooper M, et al. Opportunities for understanding MS mechanisms and progression with MRI using large-scale data sharing and artificial intelligence. Neurology. 2021;97(21):989–99.

Rodger JA. Discovery of medical Big Data analytics: improving the prediction of traumatic brain injury survival rates by data mining patient informatics processing software hybrid hadoop hive. Inform Med Unlocked. 2015. https://doi.org/10.1016/j.imu.2016.01.002 .

Hamza TH, Chen H, Hill-Burns EM, Rhodes SL, Montimurro J, Kay DM, et al. Genome-wide gene-environment study identifies glutamate receptor gene GRIN2A as a Parkinson’s disease modifier gene via interaction with coffee. PLoS Genet. 2011;7(8): e1002237.

de Lau LM, Breteler MM. Epidemiology of Parkinson’s disease. Lancet Neurol. 2006;5(6):525–35.

Parkinson’s Foundation. Parkinson’s Foundation: Better Live’s Together Available from: https://www.parkinson.org .

Tysnes OB, Storstein A. Epidemiology of Parkinson’s disease. J Neural Transm. 2017;124(8):901–5.

Fox SH, Katzenschlager R, Lim SY, Barton B, de Bie RMA, Seppi K, et al. International Parkinson and movement disorder society evidence-based medicine review: update on treatments for the motor symptoms of Parkinson’s disease. Mov Disord. 2018;33(8):1248–66.

Wagner T, Dipietro L. Novel Methods of Transcranial Stimulation: Electrosonic Stimulation. In: Neuromodulation: Comprehensive Textbook of Principles, Technologies, and Therapies. Editors: Krames P, Peckham H, Rezai A. Elsevier; 2018. p. 1619–26.

LONI, MJ Fox Foundation. Parkinson’s Progression Markers Initiative Available from: https://www.ppmi-info.org/ .

Dinov ID, Heavner B, Tang M, Glusman G, Chard K, Darcy M, et al. Predictive Big Data analytics: a study of Parkinson’s disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations. PLoS ONE. 2016;11(8): e0157077.

Nalls MA, Pankratz N, Lill CM, Do CB, Hernandez DG, Saad M, et al. Large-scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson’s disease. Nat Genet. 2014;46(9):989–93.

Wong JK, Middlebrooks EH, Grewal SS, Almeida L, Hess CW, Okun MS. A Comprehensive review of brain connectomics and imaging to improve deep brain stimulation outcomes. Mov Disord. 2020;35(5):741–51.

Hansen C, Sanchez-Ferro A, Maetzler W. How mobile health technology and electronic health records will change care of patients with Parkinson’s disease. J Parkinsons Dis. 2018;8(s1):S41–5.

Burton A. Smartphones versus Parkinson’s disease: i-PROGNOSIS. Lancet Neurol. 2020;19(5):385–6.

Zhao M, Yang CC. Drug repositioning to accelerate drug development using social media data: computational study on parkinson disease. J Med Internet Res. 2018;20(10): e271.

Article   MathSciNet   Google Scholar  

Kuusimaki T, Sainio J, Kurki S, Vahlberg T, Kaasinen V. Prediagnostic expressions in health records predict mortality in Parkinson’s disease: a proof-of-concept study. Parkinsonism Relat Disord. 2022;95:35–9.

Harrison PJ, Luciano S. Incidence of Parkinson’s disease, dementia, cerebrovascular disease and stroke in bipolar disorder compared to other psychiatric disorders: an electronic health records network study of 66 million people. Bipolar Disord. 2021;23(5):454–62.

Chen W, Kirkby L, Kotzev M, Song P, Gilron R, Pepin B. The role of large-scale data infrastructure in developing next-generation deep brain stimulation therapies. Front Hum Neurosci. 2021;15: 717401.

Wardell K, Nordin T, Vogel D, Zsigmond P, Westin CF, Hariz M, et al. Deep Brain stimulation: emerging tools for simulation, data analysis, and visualization. Front Neurosci. 2022;16: 834026.

Hallett M, de Haan W, Deco G, Dengler R, Di Iorio R, Gallea C, et al. Human brain connectivity: Clinical applications for clinical neurophysiology. Clin Neurophysiol. 2020;131(7):1621–51.

Tinaz S. Functional connectome in Parkinson’s disease and Parkinsonism. Curr Neurol Neurosci Rep. 2021;21(6):24.

Buckley C, Alcock L, McArdle R, Rehman RZU, Del Din S, Mazza C, et al. The role of movement analysis in diagnosing and monitoring neurodegenerative conditions: insights from gait and postural control. Brain Sci. 2019. https://doi.org/10.3390/brainsci9020034 .

van den Heuvel L, Dorsey RR, Prainsack B, Post B, Stiggelbout AM, Meinders MJ, et al. Quadruple decision making for Parkinson’s disease patients: combining expert opinion, patient preferences, scientific evidence, and Big Data approaches to reach precision medicine. J Parkinsons Dis. 2020;10(1):223–31.

Shen B, Lin Y, Bi C, Zhou S, Bai Z, Zheng G, et al. Translational informatics for parkinson’s disease: from big biomedical data to small actionable alterations. Genomics Proteomics Bioinform. 2019;17(4):415–29.

NIDA. Overdose Death Rates 2022 https://nida.nih.gov/research-topics/trends-statistics/overdose-death-rates .

Luo F, Li M, Florence C. State-Level economic costs of opioid use disorder and fatal opioid overdose—United States, 2017. Morb Mortal Weekly Rep (MMWR). 2021;70(15):541–6.

Volkow ND, Jones EB, Einstein EB, Wargo EM. Prevention and treatment of opioid misuse and addiction: a review. JAMA Psychiat. 2019;76(2):208–16.

Hayes CJ, Cucciare MA, Martin BC, Hudson TJ, Bush K, Lo-Ciganic W, et al. Using data science to improve outcomes for persons with opioid use disorder. Subst Abus. 2022;43(1):956–63.

Mackey S, Allgaier N, Chaarani B, Spechler P, Orr C, Bunn J, et al. Mega-Analysis of gray matter volume in substance dependence: general and substance-specific regional effects. Am J Psychiatry. 2019;176(2):119–28.

Sanchez-Roige S, Palmer AA, Fontanillas P, Elson SL, Adams MJ, et al. Genome-wide association study meta-analysis of the alcohol use disorders identification test (AUDIT) in two population-based cohorts. Am J Psychiatry. 2019;176(2):107–18.

Cuomo RE, Cai M, Shah N, Li J, Chen WH, Obradovich N, et al. Characterising communities impacted by the 2015 Indiana HIV outbreak: a Big Data analysis of social media messages associated with HIV and substance abuse. Drug Alcohol Rev. 2020;39(7):908–13.

Goldberg DS, McGee SJ. Pain as a global public health priority. BMC Public Health. 2011;11:770.

Yong RJ, Mullins PM, Bhattacharyya N. Prevalence of chronic pain among adults in the United States. Pain. 2022;163(2):e328–32.

Nijs J, Malfliet A, Ickmans K, Baert I, Meeus M. Treatment of central sensitization in patients with ‘unexplained’ chronic pain: an update. Expert Opin Pharmacother. 2014;15(12):1671–83.

Zaslansky R, Rothaug J, Chapman CR, Backstrom R, Brill S, Fletcher D, et al. PAIN OUT: the making of an international acute pain registry. Eur J Pain. 2015;19(4):490–502.

Muller-Wirtz LM, Volk T. Big Data in studying acute pain and regional anesthesia. J Clin Med. 2021. https://doi.org/10.3390/jcm10071425 .

Mukasa D, Sung J. A prediction model of low back pain risk: a population based cohort study in Korea. Korean J Pain. 2020;33(2):153–65.

Lotsch J, Lippmann C, Kringel D, Ultsch A. Integrated computational analysis of genes associated with human hereditary insensitivity to pain a drug repurposing perspective. Front Mol Neurosci. 2017. https://doi.org/10.3389/fnmol.2017.00252/full .

Ultsch A, Kringel D, Kalso E, Mogil JS, Lotsch J. A data science approach to candidate gene selection of pain regarded as a process of learning and neural plasticity. Pain. 2016;157(12):2747–57.

Wu J, Zhang J, Xu T, Pan Y, Cui B, Wei W, et al. The necessity or not of the addition of fusion to decompression for lumbar degenerative spondylolisthesis patients: a PRISMA compliant meta-analysis. Medicine. 2021;100(14): e24775.

Lin Z, He L. Intra-Articular injection of PRP in the treatment of knee osteoarthritis using Big Data. J Healthc Eng. 2021;2021:4504155.

Rossi-deVries J, Pedoia V, Samaan MA, Ferguson AR, Souza RB, Majumdar S. Using multidimensional topological data analysis to identify traits of hip osteoarthritis. J Magn Reson Imaging. 2018;48(4):1046–58.

Perlmutter JS, Mink JW. Deep brain stimulation. Annu Rev Neurosci. 2006. https://doi.org/10.1146/annurev.neuro.29.051605.112824 .

Tehovnik EJ. Electrical stimulation of neural tissue to evoke behavioral responses. J Neurosci Methods. 1996;65(1):1–17.

Yeomans JS. Principles of Brain Stimulation. London: Oxford University Press; 1990. p. 182.

McIntyre CC, Mori S, Sherman DL, Thakor NV, Vitek JL. Electric field and stimulating influence generated by deep brain stimulation of the subthalamic nucleus. Clin Neurophysiol. 2004;115(3):589–95.

Wagner T, Zahn M, Wedeen VJ, Grodzinsky A, Pascual-Leone A. Transcranial Magnetic Stimulation: High Resolution Tracking of the Induced Current Density in the Individual Human Brain. 12th Annual Meeting of Human Brain mapping; 2006; Florence, Italy: OHBM.

Sillery E, Bittar RG, Robson MD, Behrens TE, Stein J, Aziz TZ, et al. Connectivity of the human periventricular-periaqueductal gray region. J Neurosurg. 2005;103(6):1030–4.

Riva-Posse P, Choi KS, Holtzheimer PE, McIntyre CC, Gross RE, Chaturvedi A, et al. Defining critical white matter pathways mediating successful subcallosal cingulate deep brain stimulation for treatment-resistant depression. Biol Psychiatry. 2014;76(12):963–9.

Horn A, Reich M, Vorwerk J, Li N, Wenzel G, Fang Q, et al. Connectivity predicts deep brain stimulation outcome in Parkinson disease. Ann Neurol. 2017;82(1):67–78.

Weiss D, Landoulsi Z, May P, Sharma M, Schupbach M, You H, et al. Genetic stratification of motor and QoL outcomes in Parkinson’s disease in the EARLYSTIM study. Parkinsonism Relat Disord. 2022;103:169–74.

Artusi CA, Dwivedi AK, Romagnolo A, Pal G, Kauffman M, Mata I, et al. Association of subthalamic deep brain stimulation with motor, functional, and pharmacologic outcomes in patients with monogenic Parkinson disease: a systematic review and meta-analysis. JAMA Netw Open. 2019;2(2): e187800.

Wagner T, Eden U, Rushmore J, Russo CJ, Dipietro L, Fregni F, et al. Impact of brain tissue filtering on neurostimulation fields: a modeling study. Neuroimage. 2014;85(Pt 3):1048–57.

Corp DT, Bereznicki HGK, Clark GM, Youssef GJ, Fried PJ, Jannati A, et al. Large-scale analysis of interindividual variability in single and paired-pulse TMS data. Clin Neurophysiol. 2021;132(10):2639–53.

Corp DT, Bereznicki HGK, Clark GM, Youssef GJ, Fried PJ, Jannati A, et al. Large-scale analysis of interindividual variability in theta-burst stimulation data: results from the ‘Big TMS Data Collaboration.’ Brain Stimul. 2020;13(5):1476–88.

Quality AfHRa. Development of Harmonized Outcome Measures for Use in Patient Registries and Clinical Practice: Methods and Lessons Learned. U.S. Department of Health and Human Services; 2020 2020.

ASPE. Harmonization of Clinical Data Element Definitions for Outcome Measures in Registries https://aspe.hhs.gov/harmonization-clinical-data-element-definitions-outcome-measures-registries .

NIH. Data Harmonization Projects 2014 https://nida.nih.gov/research/research-data-measures-resources/data-harmonization-projects .

Esteban O, Birman D, Schaer M, Koyejo OO, Poldrack RA, Gorgolewski KJ. MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites. PLoS ONE. 2017;12(9): e0184661.

Takao H, Hayashi N, Ohtomo K. Effect of scanner in longitudinal studies of brain volume changes. J Magn Reson Imaging. 2011;34(2):438–44.

Monte-Rubio GC, Segura B, Strafella AP, van Eimeren T, Ibarretxe-Bilbao N, Diez-Cirarda M, et al. Parameters from site classification to harmonize MRI clinical studies: application to a multi-site Parkinson’s disease dataset. Hum Brain Mapp. 2022;43(10):3130–42.

Chen D, Tashman K, Palmer DS, Neale B, Roeder K, Bloemendal A, et al. A data harmonization pipeline to leverage external controls and boost power in GWAS. Hum Mol Genet. 2022;31(3):481–9.

Gliklich RE, Leavy MB, Dreyer NA, editors. Tools and Technologies for Registry Interoperability, Registries for Evaluating Patient Outcomes: A User’s Guide, 3rd Edition, Addendum 2 ( https://www.ncbi.nlm.nih.gov/books/NBK551879/ ). AHRQ Methods for Effective Health Care. Rockville (MD) 2019.

Park BY, Byeon K, Park H. FuNP (Fusion of Neuroimaging Preprocessing) pipelines: a fully automated preprocessing software for functional magnetic resonance imaging. Front Neuroinform. 2019;13:5.

Biswal BB, Mennes M, Zuo XN, Gohel S, Kelly C, Smith SM, et al. Toward discovery science of human brain function. Proc Natl Acad Sci USA. 2010;107(10):4734–9.

Mennes M, Biswal BB, Castellanos FX, Milham MP. Making data sharing work: the FCP/INDI experience. Neuroimage. 2013;82:683–91.

GWAS. GWAS Central [cited 2022. Available from: https://www.gwascentral.org ].

Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.

Fortin JP, Cullen N, Sheline YI, Taylor WD, Aselcioglu I, Cook PA, et al. Harmonization of cortical thickness measurements across scanners and sites. Neuroimage. 2018;167:104–20.

Yu M, Linn KA, Cook PA, Phillips ML, McInnis M, Fava M, et al. Statistical harmonization corrects site effects in functional connectivity measurements from multi-site fMRI data. Hum Brain Mapp. 2018;39(11):4213–27.

Pinto MS, Paolella R, Billiet T, Van Dyck P, Guns PJ, Jeurissen B, et al. Harmonization of brain diffusion MRI: concepts and methods. Front Neurosci. 2020;14:396.

Jovicich J, Barkhof F, Babiloni C, Herholz K, Mulert C, van Berckel BNM, et al. Harmonization of neuroimaging biomarkers for neurodegenerative diseases: a survey in the imaging community of perceived barriers and suggested actions. Alzheimers Dement. 2019;11:69–73.

Mackey S, Kan KJ, Chaarani B, Alia-Klein N, Batalla A, Brooks S, et al. Genetic imaging consortium for addiction medicine: from neuroimaging to genes. Prog Brain Res. 2016;224:203–23.

Dash S, Shakyawar SK, Sharma M, Kaushik S. Big data in healthcare: management, analysis and future prospects. J Big Data. 2019;6(1):54.

Rafferty H, Rocha E, Gonzalez-Mego P, Ramos CL, El-Hagrassy MM, Gunduz ME, et al. Cost-Effectiveness analysis to inform randomized controlled trial design in chronic pain research: methods for guiding decisions on the addition of a run-in period. Princ Pract Clin Res. 2022;8(2):31–42.

Meier JM, Perdikis D, Blickensdorfer A, Stefanovski L, Liu Q, Maith O, et al. Virtual deep brain stimulation: multiscale co-simulation of a spiking basal ganglia model and a whole-brain mean-field model with The Virtual Brain. Exp Neurol. 2022;354: 114111.

Kass-Hout TA, Stevens LM, Hall JL. American heart association precision medicine platform. Circulation. 2018;137(7):647–9.

Olshannikova E, Ometov A, Koucheryavy Y, Olsson T. Visualizing Big Data with augmented and virtual reality: challenges and research agenda. Journal of Big Data. 2015;2(1):22.

Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–5.

Subbiah V, Kurzrock R. Debunking the delusion that precision oncology is an illusion. Oncologist. 2017;22(8):881–2.

Clinicaltrials.gov. IMAS Optimization and Applicability in an Acute Stroke Setting 2022 [cited 2022. Available from: https://clinicaltrials.gov/ct2/show/NCT05469438 ].

Clinicaltrials.gov. Parkinson's Disease: Enhancing Physical Therapy With Brain Stimulation for Treating Postural Instability 2022 [cited 2022. Available from: https://clinicaltrials.gov/ct2/show/NCT03981055 ].

Clinicaltrials.gov. Noninvasive Brain Stimulation for Treating Carpal Tunnel Syndrome 2022 [cited 2022. Available from: https://clinicaltrials.gov/ct2/show/NCT04206215 ].

Sukpornchairak P, Shah Aka Khandelwal K, Hayek S, Connor C, Gonzalez-Mego P, Chitturu G, et al. Non-Invasive Brain Stimulation For Diabetic Neuropathic Pain. American Academy of Neurology Annual Meeting; 2022; Seattle.

Clinicaltrials.gov. Optimization of NIBS for Treatment of Addiction 2022 [cited 2022. Available from: https://clinicaltrials.gov/ct2/show/NCT04379115 .]

Wagner T, Ramos-Estebanez C, Hayek S, Parran T, Sukpornchairak P, Gonzalez-Mego P, et al. Noninvasive Brain Stimulation for Treating Chronic Pain and Addiction. Third Annual NIH HEAL Initiative Investigator Meeting; 2022 4/11/2022; Virtual NIH Conference.

Eckardt P, Bailey D, DeVon HA, Dougherty C, Ginex P, Krause-Parello CA, et al. Opioid use disorder research and the council for the advancement of nursing science priority areas. Nurs Outlook. 2020;68(4):406–16.

Banik A, Bandyopadhyay SK. Big-Data—a review on analysing 3Vs. J Sci Eng Res. 2016;3(1):21–4.

Mackey T, Purushothaman V, Li J, Shah N, Nali M, Bardier C, et al. Machine learning to detect self-reporting of symptoms, testing access, and recovery associated With COVID-19 on Twitter: retrospective big data infoveillance study. JMIR Public Health Surveill. 2020;6(2): e19509.

Ramos KM, Grady C, Greely HT, Chiong W, Eberwine J, Farahany NA, et al. The NIH BRAIN initiative: integrating neuroethics and neuroscience. Neuron. 2019;101(3):394–8.

Ienca M, Ferretti A, Hurst S, Puhan M, Lovis C, Vayena E. Considerations for ethics review of big data health research: a scoping review. PLoS ONE. 2018;13(10): e0204937.

Ferretti A, Ienca M, Sheehan M, Blasimme A, Dove ES, Farsides B, et al. Ethics review of big data research: what should stay and what should be reformed? BMC Med Ethics. 2021;22(1):51.

HHS Proposes New Protections to Increase Care Coordination and Confidentiality for Patients With Substance Use Challenges [press release]. November 28, 2022 2022.

Emerging Issues Task Force INS. Neuroethics at 15: the current and future environment for neuroethics. AJOB Neurosci. 2019;10(3):104–10.

Fothergill BT, Knight W, Stahl BC, Ulnicane I. Responsible data governance of neuroscience Big Data. Front Neuroinform. 2019;13:28.

Blagojević V, Bojić D, Bojović M, Cvetanović M, Đorđević J, Đurđević Đ, et al. Chapter One—A Systematic Approach to Generation of New Ideas for PhD Research in Computing. In: Hurson AR, Milutinović V, editors., et al., Advances in Computers. Amsterdam: Elsevier; 2017.

Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack CR, Jagust W, et al. Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s disease Neuroimaging Initiative (ADNI). Alzheimers Dement. 2005;1(1):55–66.

Markram H. The blue brain project. Nat Rev Neurosci. 2006;7(2):153–60.

Glasser MF, Smith SM, Marcus DS, Andersson JL, Auerbach EJ, Behrens TE, et al. The human connectome project’s neuroimaging approach. Nat Neurosci. 2016;19(9):1175–87.

Van Essen DC, Smith SM, Barch DM, Behrens TE, Yacoub E, Ugurbil K, et al. The WU-Minn Human Connectome Project: an overview. Neuroimage. 2013;80:62–79.

Van Essen DC, Ugurbil K, Auerbach E, Barch D, Behrens TE, Bucholz R, et al. The human connectome project: a data acquisition perspective. Neuroimage. 2012;62(4):2222–31.

Jabalpurwala I. Brain Canada: one brain one community. Neuron. 2016;92(3):601–6.

Insel TR, Landis SC, Collins FS. Research priorities. NIH BRAIN Initiative Sci. 2013;340(6133):687–8.

Normile D. China’s big brain project is finally gathering steam. Science. 2022;377(6613):1368–9.

Jeong SJ, Lee H, Hur EM, Choe Y, Koo JW, Rah JC, et al. Korea brain initiative: integration and control of brain functions. Neuron. 2016;92(3):607–11.

Richards LR, Michie PT, Badcock DR, Bartlett PF, Bekkers JM, Bourne JA, Castles A, Egan GF, Fornito A, Hannan AJ, Hickie IB, Mattingley JB, Schofield PR. Australian Brain Alliance. Neuron. 2016;92(3):597–600.

Menard C, Siddiqui TJ, Sargin D, Lawson A, De Koninck Y, Illes J. The canadian brain research strategy: a focus on early career researchers. Can J Neurol Sci. 2022;49(2):168–70.

The LN. The international brain initiative: collaboration in progress. Lancet Neurol. 2021;20(12):969.

Ngai J. BRAIN 2.0: transforming neuroscience. Cell. 2022;185(1):4–8.

Appukuttan S, Bologna LL, Schurmann F, Migliore M, Davison AP. EBRAINS Live papers—interactive resource sheets for computational studies in neuroscience. Neuroinformatics. 2022. https://doi.org/10.1007/s12021-022-09598-z .

Young MP. The organization of neural systems in the primate cerebral cortex. Proc Biol Sci. 1993;252(1333):13–8.

Stephan KE, Kamper L, Bozkurt A, Burns GA, Young MP, Kotter R. Advanced database methodology for the collation of connectivity data on the macaque brain (CoCoMac). Philos Trans R Soc Lond B Biol Sci. 2001;356(1412):1159–86.

Bota M, Dong HW, Swanson LW. Combining collation and annotation efforts toward completion of the rat and mouse connectomes in BAMS. Front Neuroinform. 2012;6:2.

Modha DS, Singh R. Network architecture of the long-distance pathways in the macaque brain. Proc Natl Acad Sci USA. 2010;107(30):13485–90.

Bock DD, Lee WC, Kerlin AM, Andermann ML, Hood G, Wetzel AW, et al. Network anatomy and in vivo physiology of visual cortical neurons. Nature. 2011;471(7337):177–82.

Briggman KL, Helmstaedter M, Denk W. Wiring specificity in the direction-selectivity circuit of the retina. Nature. 2011;471(7337):183–8.

Harriger L, van den Heuvel MP, Sporns O. Rich club organization of macaque cerebral cortex and its role in network communication. PLoS ONE. 2012;7(9): e46497.

Jarrell TA, Wang Y, Bloniarz AE, Brittin CA, Xu M, Thomson JN, et al. The connectome of a decision-making neural network. Science. 2012;337(6093):437–44.

Takemura SY, Bharioke A, Lu Z, Nern A, Vitaladevuni S, Rivlin PK, et al. A visual motion detection circuit suggested by Drosophila connectomics. Nature. 2013;500(7461):175–81.

Markov NT, Ercsey-Ravasz MM, Ribeiro Gomes AR, Lamy C, Magrou L, Vezoli J, et al. A weighted and directed interareal connectivity matrix for macaque cerebral cortex. Cereb Cortex. 2014;24(1):17–36.

Ingalhalikar M, Smith A, Parker D, Satterthwaite TD, Elliott MA, Ruparel K, et al. Sex differences in the structural connectome of the human brain. Proc Natl Acad Sci USA. 2014;111(2):823–8.

Deligianni F, Centeno M, Carmichael DW, Clayden JD. Relating resting-state fMRI and EEG whole-brain connectomes across frequency bands. Front Neurosci. 2014;8:258.

Bota M, Sporns O, Swanson LW. Architecture of the cerebral cortical association connectome underlying cognition. Proc Natl Acad Sci USA. 2015;112(16):E2093–101.

Ryan K, Lu Z, Meinertzhagen IA. The CNS connectome of a tadpole larva of Ciona intestinalis (L.) highlights sidedness in the brain of a chordate sibling. Elife. 2016. https://doi.org/10.7554/eLife.16962 .

Hildebrand DGC, Cicconet M, Torres RM, Choi W, Quan TM, Moon J, et al. Whole-brain serial-section electron microscopy in larval zebrafish. Nature. 2017;545(7654):345–9.

Vishwanathan A, Daie K, Ramirez AD, Lichtman JW, Aksay ERF, Seung HS. Electron microscopic reconstruction of functionally identified cells in a neural integrator. Curr Biol. 2017;27(14):2137–47.

Ardesch DJ, Scholtens LH, Li L, Preuss TM, Rilling JK, van den Heuvel MP. Evolutionary expansion of connectivity between multimodal association areas in the human brain compared with chimpanzees. Proc Natl Acad Sci USA. 2019;116(14):7101–6.

Ashaber M, Tomina Y, Kassraian P, Bushong EA, Kristan WB, Ellisman MH, et al. Anatomy and activity patterns in a multifunctional motor neuron and its surrounding circuits. Elife. 2021. https://doi.org/10.7554/eLife.61881 .

Scholl B, Thomas CI, Ryan MA, Kamasawa N, Fitzpatrick D. Cortical response selectivity derives from strength in numbers of synapses. Nature. 2021;590(7844):111–4.

Brittin CA, Cook SJ, Hall DH, Emmons SW, Cohen N. A multi-scale brain map derived from whole-brain volumetric reconstructions. Nature. 2021;591(7848):105–10.

Sorrentino P, Seguin C, Rucco R, Liparoti M, Troisi Lopez E, Bonavita S, et al. The structural connectome constrains fast brain dynamics. Elife. 2021. https://doi.org/10.7554/eLife.67400 .

Scholl B, Tepohl C, Ryan MA, Thomas CI, Kamasawa N, Fitzpatrick D. A binocular synaptic network supports interocular response alignment in visual cortical neurons. Neuron. 2022;110(9):1573–84.

Chen Z, Zhang R, Huo H, Liu P, Zhang C, Feng T. Functional connectome of human cerebellum. Neuroimage. 2022;251: 119015.

Rosenthal LS, Drake D, Alcalay RN, Babcock D, Bowman FD, Chen-Plotkin A, et al. The NINDS Parkinson’s disease biomarkers program. Mov Disord. 2016;31(6):915–23.

Ofori E, Du G, Babcock D, Huang X, Vaillancourt DE. Parkinson’s disease biomarkers program brain imaging repository. Neuroimage. 2016;124(Pt B):1120–4.

https://pdbp.ninds.nih.gov/ .

Cohen S, Bataille LR, Martig AK. Enabling breakthroughs in Parkinson’s disease with wearable technologies and big data analytics. Mhealth. 2016;2:20.

https://www.age-platform.eu/project/i-prognosis .

Hadjidimitriou S, Charisis V, Kyritsis K, Konstantinidis E, Delopoulos, A, Bamidis P, Bostantjopoulou S, Rizos A, Trivedi D, Chaudhuri R, Klingelhoefer L, Reichmann H, Wadoux J, De Craecker N, Karayiannis F, Fagerberg P, Ioakeimidis I, Stadtschnitzer M, Esser A, Grammalidis N, Dimitropoulos K, Dias SB, Diniz JA, da Silva HP, Lyberopoulos G, Theodoropoulou E, Hadjileontiadis LJ. Active and healthy ageing for Parkinson’s disease patients’ support: a user’s perspective within the i-PROGNOSIS framework. 1st International Conference on Technology and Innovation in Sports, Health and Wellbeing (TISHW). 2016. p. 1–8.

Hadjidimitriou SI, Charisis D, Hadjileontiadis LJ. On Capturing Older Adults’ Smartphone Keyboard Interaction as a Means for Behavioral Change Under Emotional Stimuli Within i-PROGNOSIS Framework. In: Antona M, Stephanidis C, editors. Universal Access in Human-Computer Interaction Design and Development Approaches and Methods. Cham: Springer International Publishing; 2017.

European Commission. Intelligent Parkinson eaRly detectiOn Guiding NOvel Supportive InterventionS [cited 2022. Available from: https://cordis.europa.eu/project/id/690494 ].

Suo X, Lei D, Li N, Cheng L, Chen F, Wang M, et al. Functional brain connectome and its relation to hoehn and yahr stage in Parkinson disease. Radiology. 2017;285(3):904–13.

SenthilarumugamVeilukandammal MN, S.; Ganapathysubramanian, B.; Anantharam, V.; Kanthasamy, A.; Willette, A.A., editor Big Data and Parkinson’s Disease: exploration, analyses, and data challenges. Proceedings of the 51st Hawaii International Conference on System Sciences; 2018.

Sreenivasan K, Mishra V, Bird C, Zhuang X, Yang Z, Cordes D, et al. Altered functional network topology correlates with clinical measures in very early-stage, drug-naive Parkinson’s disease. Parkinsonism Relat Disord. 2019;62:3–9.

Wu C, Nagel SJ, Agarwal R, Potter-Nerger M, Hamel W, Sharan AD, et al. Reduced risk of reoperations with modern deep brain stimulator systems: big data analysis from a united states claims database. Front Neurol. 2021;12: 785280.

Zhang H, Meng F, Li X, Ning Y, Cai M. Social listening—revealing Parkinson’s disease over day and night. BMC Neurol. 2021;21(1):2.

De Micco R, Agosta F, Basaia S, Siciliano M, Cividini C, Tedeschi G, et al. Functional connectomics and disease progression in drug-naive Parkinson’s disease patients. Mov Disord. 2021;36(7):1603–16.

Loh A, Boutet A, Germann J, Al-Fatly B, Elias GJB, Neudorfer C, et al. A functional connectome of Parkinson’s disease patients prior to deep brain stimulation: a tool for disease-specific connectivity analyses. Front Neurosci. 2022;16: 804125.

Kohno M, Okita K, Morales AM, Robertson CL, Dean AC, Ghahremani DG, et al. Midbrain functional connectivity and ventral striatal dopamine D2-type receptors: link to impulsivity in methamphetamine users. Mol Psychiatry. 2016;21(11):1554–60.

Ipser JC, Uhlmann A, Taylor P, Harvey BH, Wilson D, Stein DJ. Distinct intrinsic functional brain network abnormalities in methamphetamine-dependent patients with and without a history of psychosis. Addict Biol. 2018;23(1):347–58.

Lisdahl KM, Sher KJ, Conway KP, Gonzalez R, Feldstein Ewing SW, Nixon SJ, et al. Adolescent brain cognitive development (ABCD) study: Overview of substance use assessment methods. Dev Cogn Neurosci. 2018;32:80–96.

NIMH. ABCD Data Repository: NIMH; [2022]. Available from: https://nda.nih.gov/abcd/ .

Sun Y, Zhang Y, Zhang D, Chang S, Jing R, Yue W, et al. GABRA2 rs279858-linked variants are associated with disrupted structural connectome of reward circuits in heroin abusers. Transl Psychiatry. 2018;8(1):138.

Yip SW, Scheinost D, Potenza MN, Carroll KM. Connectome-based prediction of cocaine abstinence. Am J Psychiatry. 2019;176(2):156–64.

Young SD, Padwa H, Bonar EE. Social big data as a tool for understanding and predicting the impact of cannabis legalization. Front Public Health. 2019;7:274.

Segal Z, Radinsky K, Elad G, Marom G, Beladev M, Lewis M, et al. Development of a machine learning algorithm for early detection of opioid use disorder. Pharmacol Res Perspect. 2020;8(6): e00669.

Zhou H, Rentsch CT, Cheng Z, Kember RL, Nunez YZ, Sherva RM, et al. Association of OPRM1 functional coding variant with opioid use disorder: a genome-wide association study. JAMA Psychiat. 2020;77(10):1072–80.

Flores L, Young SD. Regional variation in discussion of opioids on social media. J Addict Dis. 2021;39(3):316–21.

Gelernter J, Polimanti R. Genetics of substance use disorders in the era of big data. Nat Rev Genet. 2021;22(11):712–29.

Liu S, Wang S, Zhang M, Xu Y, Shao Z, Chen L, et al. Brain responses to drug cues predict craving changes in abstinent heroin users: a preliminary study. Neuroimage. 2021;237: 118169.

Purushothaman V, Li J, Mackey TK. Detecting suicide and self-harm discussions among opioid substance users on instagram using machine learning. Front Psychiatry. 2021;12: 551296.

Rossetti MG, Patalay P, Mackey S, Allen NB, Batalla A, Bellani M, et al. Gender-related neuroanatomical differences in alcohol dependence: findings from the ENIGMA Addiction Working Group. NeuroImage Clinical. 2021;30: 102636.

Tretter F, Loeffler-Stastka H. How does the ‘environment’ come to the person? The ‘ecology of the person’ and addiction. World J Psychiatry. 2021;11(11):915–36.

Li Y, Cheng P, Liang L, Dong H, Liu H, Shen W, et al. Abnormal resting-state functional connectome in methamphetamine-dependent patients and its application in machine-learning-based classification. Front Neurosci. 2022;16:1014539.

Ottino-Gonzalez J, Uhlmann A, Hahn S, Cao Z, Cupertino RB, Schwab N, et al. White matter microstructure differences in individuals with dependence on cocaine, methamphetamine, and nicotine: findings from the ENIGMA-Addiction working group. Drug Alcohol Depend. 2022;230: 109185.

https://nda.nih.gov/abcd/ .

Kim CH, Chung CK, Park CS, Choi B, Kim MJ, Park BJ. Reoperation rate after surgery for lumbar herniated intervertebral disc disease: nationwide cohort study. Spine. 2013;38(7):581–90.

European Commission. Improvement in Postoperative PAIN OUTcome [cited 2022. Available from: https://cordis.europa.eu/project/id/223590/reporting ].

Pain-OUT. About Pain-OUT [cited 2022. Available from: http://pain-out.med.uni-jena.de/about-pain-out ].

Taghva A, Karst E, Underwood P. Clinical paresthesia atlas illustrates likelihood of coverage based on spinal cord stimulator electrode location. Neuromodulation. 2017;20(6):582–8.

Nijs J, Clark J, Malfliet A, Ickmans K, Voogt L, Don S, et al. In the spine or in the brain? Recent advances in pain neuroscience applied in the intervention for low back pain. Clin Exp Rheumatol. 2017;35(5):108–15.

Nomura ATG, de Abreu AM, Pruinelli L. Information model on pain management: an analysis of Big Data. J Nurs Scholarsh. 2021;53(3):270–7.

Min J, Osborne V, Kowalski A, Prosperi M. Reported adverse events with painkillers: data mining of the US food and drug administration adverse events reporting system. Drug Saf. 2018;41(3):313–20.

Bomberg H, Wetjen L, Wagenpfeil S, Schope J, Kessler P, Wulf H, et al. Risks and benefits of ultrasound, nerve stimulation, and their combination for guiding peripheral nerve blocks: a retrospective registry analysis. Anesth Analg. 2018;127(4):1035–43.

Kwon JW, Ha JW, Lee TS, Moon SH, Lee HM, Park Y. Comparison of the prevalence of low back pain and related spinal diseases among smokers and nonsmokers: using korean national health insurance database. Clin Orthop Surg. 2020;12(2):200–8.

Schnabel A, Yahiaoui-Doktor M, Meissner W, Zahn PK, Pogatzki-Zahn EM. Predicting poor postoperative acute pain outcome in adults: an international, multicentre database analysis of risk factors in 50,005 patients. Pain Rep. 2020;5(4): e831.

Yu Y, Cui L, Qian L, Lei M, Bao Q, Zeng Q, et al. Efficacy of perioperative intercostal analgesia via a multimodal analgesic regimen for chronic post-thoracotomy pain during postoperative follow-up: a big-data, intelligence platform-based analysis. J Pain Res. 2021;14:2021–8.

Huie JR, Ferguson AR, Kyritsis N, Pan JZ, Irvine KA, Nielson JL, et al. Machine intelligence identifies soluble TNFa as a therapeutic target for spinal cord injury. Sci Rep. 2021;11(1):3442.

Kringel D, Ultsch A, Zimmermann M, Jansen JP, Ilias W, Freynhagen R, et al. Emergent biomarker derived from next-generation sequencing to identify pain patients requiring uncommonly high opioid doses. Pharmacogenomics J. 2017;17(5):419–26.

Anis O, Kridin K, Cohen AD, Levmore M, Yaron S, Valdman-Grinshpoun Y, et al. Chronic spontaneous urticaria in patients with interstitial cystitis/bladder pain syndrome: insights from big data analyses. Urology. 2022. https://doi.org/10.1016/j.urology.2022.10.017 .

Download references

Acknowledgements

Not applicable.

Work reported in this publication are or were supported in part by the National Institute of Health NIA (Award Number R44AG055360), NIDDK (Award Number DK117710), NINDS (Award Number 1R44NS110237, R43NS113737, and R01NS125307), NIAMS (Award Number 1R44AR076885), and NIDA (Award Number 4R44DA049685). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and affiliations.

Highland Instruments, Cambridge, MA, USA

Laura Dipietro & Timothy Wagner

Spaulding Rehabilitation/Neuromodulation Lab, Harvard Medical School, Cambridge, MA, USA

Paola Gonzalez-Mego

University of Illinois Chicago, Chicago, IL, USA

Ciro Ramos-Estebanez

Case Western University, Cleveland, OH, USA

Lauren Hana Zukowski & Rahul Mikkilineni

Boston University, Boston, MA, USA

Richard Jarrett Rushmore

Harvard-MIT Division of Health Sciences and Technology, Cambridge, MA, USA

Timothy Wagner

You can also search for this author in PubMed   Google Scholar

Contributions

LD crafted the idea for the manuscript and wrote the initial manuscript. TW, PGM, JR, LHZ, and CR contributed to multiple sections of the manuscript. RM contributed to ethics component. LD, TW, JR, PGM, LHZ, and RM helped compile and/or review the tabular material. TW, LHZ, and LD generated the graphics. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Laura Dipietro .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

"TW and LD are officers at Highland Instruments, a medical device company. They have patents pending or issued, personally or as officers in the company, related to imaging, brain stimulation, diagnostics, modeling, and simulation."

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

: Table S1 . Sample of national projects that spurred on the big data revolution. Table S2 . Sample of neurology and neuroscience databases. Table S3 . Sample of connectome studies and evolving big data use. Table S4 . Sample of PD "Big Data" studies. Table S5 . Sample of SUD and OUD "Big Data" studies. Table S6 . Sample of pain "Big Data" studies.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Dipietro, L., Gonzalez-Mego, P., Ramos-Estebanez, C. et al. The evolution of Big Data in neuroscience and neurology. J Big Data 10 , 116 (2023). https://doi.org/10.1186/s40537-023-00751-2

Download citation

Received : 28 January 2023

Accepted : 08 May 2023

Published : 10 July 2023

DOI : https://doi.org/10.1186/s40537-023-00751-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Neuroscience
  • Brain Stimulation
  • Alzheimer’s

case study for data scientist

IMAGES

  1. The Life of a Data Scientist [Infographic]

    case study for data scientist

  2. What a Successful Data Scientist Needs to Know?

    case study for data scientist

  3. How To Become A Data Scientist- Step By Step Guide to Follow

    case study for data scientist

  4. Your 12-Step Guide on How to Become a Data Scientist

    case study for data scientist

  5. Data in Action: 7 Data Science Case Studies Worth Reading

    case study for data scientist

  6. Anatomy Of A Data Scientist Infographic

    case study for data scientist

VIDEO

  1. Data Science Research Showcase

  2. (Mastering JMP) Visualizing and Exploring Data

  3. Common Mistakes Data Science and Engineering Candidates Make in Interviews

  4. Data Science Interview

  5. Difference between Data Analytics and Data Science . #shorts #short

  6. Lecture_55: Capstone Project on Data Analysis and Visualizations

COMMENTS

  1. 10 Real-World Data Science Case Studies Worth Reading

    Real-world data science case studies differ significantly from academic examples. While academic exercises often feature clean, well-structured data and simplified scenarios, real-world projects tackle messy, diverse data sources with practical constraints and genuine business objectives. These case studies reflect the complexities data ...

  2. 10 Real World Data Science Case Studies Projects with Example

    A case study in data science is an in-depth analysis of a real-world problem using data-driven approaches. It involves collecting, cleaning, and analyzing data to extract insights and solve challenges, offering practical insights into how data science techniques can address complex issues across various industries.

  3. Data in Action: 7 Data Science Case Studies Worth Reading

    Data science case studies highlight the work done by practitioners, and they can be used to educate new and existing data scientists on how to approach problems. Case studies also help companies determine which type of data science teams they should create and how those teams should be structured. By providing valuable information about what ...

  4. Top 12 Data Science Case Studies: Across Various Industries

    Examples of Data Science Case Studies. Hospitality: Airbnb focuses on growth by analyzing customer voice using data science. Qantas uses predictive analytics to mitigate losses. Healthcare: Novo Nordisk is Driving innovation with NLP. AstraZeneca harnesses data for innovation in medicine. Covid 19: Johnson and Johnson uses data science to fight ...

  5. Data Science Case Studies: Solved and Explained

    Feb 21, 2021. 1. Solving a Data Science case study means analyzing and solving a problem statement intensively. Solving case studies will help you show unique and amazing data science use cases in ...

  6. Case Studies

    Optimizing deep learning trading bots using state-of-the-art techniques. Let's teach our deep RL agents to make even more money using feature engineering and Bayesian optimization. Adam King. Jun 4, 2019. Discover some of our best data science and machine learning case studies. Your home for data science. A Medium publication sharing concepts ...

  7. Case Study: Applying a Data Science Process Model to a Real-World

    Through this case study, we demonstrated how data science techniques can be used to develop predictive models to help businesses make informed decisions e.g., in the supply chain. While this case study focuses on demand planning, the process model can be used in various ways, such as for building personalized recommendations on e-commerce ...

  8. 6 of my favorite case studies in Data Science!

    The goal of this case study is to share some of the high-level issues Airbnb encountered as it grew, and details about how its data science team solved them. To do so, the author (Airbnb's first data scientist) breaks things down into three areas: How Airbnb characterizes data science; How data science is involved in decision-making at Airbnb

  9. 7 Case Studies of Data Science and ML

    7 Case Studies. In this blog, we will explore 7 inspiring case studies of how data science and machine learning are used in these companies to achieve remarkable results. Guide to Machine Learning with Explainable AI (XAI) and Python, a new way to build ML Models that are explainable and interpretable. It will make AI and ML modes trustworthy ...

  10. Case Study

    Read writing about Case Study in Towards Data Science. Your home for data science. A Medium publication sharing concepts, ideas and codes.

  11. Doing Data Science: A Framework and Case Study

    A data science framework has emerged and is presented in the remainder of this article along with a case study to illustrate the steps. This data science framework warrants refining scientific practices around data ethics and data acumen (literacy). A short discussion of these topics concludes the article. 2.

  12. Data Science Interview Case Studies: How to Prepare and Excel

    Exceling in data science interview case studies requires a combination of technical proficiency, analytical thinking, and effective communication.By mastering the art of case study preparation and problem-solving, you can showcase your data science skills and secure coveted job opportunities in the field.

  13. Data Science Case Study Interview: Your Guide to Success

    This section'll discuss what you can expect during the interview process and how to approach case study questions. Step 1: Problem Statement: You'll be presented with a problem or scenario—either a hypothetical situation or a real-world challenge—emphasizing the need for data-driven solutions within data science.

  14. Top Data Science Case Studies For Inspiration

    A data science case study refers to a process comprising a practical business problem on which data scientists work to develop deep learning or machine learning algorithms and programs. These programs and algorithms lead to an optimal solution to the business problem. Working on a data science case study involves analysing and solving a problem ...

  15. Data science case interviews (what to expect & how to prepare)

    2. How to approach data science case studies. Approaching data science cases with a repeatable framework will not only add structure to your answer, but also help you manage your time and think clearly under the stress of interview conditions. Let's go over a framework that you can use in your interviews, then break it down with an example ...

  16. PDF Open Case Studies: Statistics and Data Science Education through Real

    Keywords: applied statistics, data science, statistical thinking, case studies, education, computing 1Introduction A major challenge in the practice of teaching data sci-ence and statistics is the limited availability of courses and course materials that provide meaningful opportu-nities for students to practice and apply statistical think-

  17. 20+ Data Science Case Study Interview Questions (with Solutions)

    Product Case Studies - This type of case study tackles a specific product or feature offering, often tied to the interviewing company. Interviewers are generally looking for a sense of business sense geared towards product metrics. Data Analytics Case Study Questions - Data analytics case studies ask you to propose possible metrics in order to investigate an analytics problem.

  18. Problem Solving as Data Scientist: a Case Study

    So I decomposed the problem further into two steps: Step 1. calculate the purchase likelihood for a customer given the vehicle P (C|V) Step 2. based on the likelihood, attribute a car to the most likely customer in the batch. Now we can further identify the solution for each.

  19. Data science approaches to confronting the COVID-19 pandemic: a

    1. Introduction. The use of data science methodologies in medicine and public health has been enabled by the wide availability of big data of human mobility, contact tracing, medical imaging, virology, drug screening, bioinformatics, electronic health records and scientific literature along with the ever-growing computing power [1-4].With these advances, the huge passion of researchers and ...

  20. Top 10 Data Science Case Study Interview Questions for 2024

    What is a Data Science Case Study? A data science case study is an in-depth, detailed examination of a particular case (or cases) within a real-world context. A data science case study is a real-world business problem that you would have worked on as a data scientist to build a machine learning or deep learning algorithm and programs to construct an optimal solution to your business problem ...

  21. The evolution of Big Data in neuroscience and neurology

    Neurological diseases are on the rise worldwide, leading to increased healthcare costs and diminished quality of life in patients. In recent years, Big Data has started to transform the fields of Neuroscience and Neurology. Scientists and clinicians are collaborating in global alliances, combining diverse datasets on a massive scale, and solving complex computational problems that demand the ...

  22. Biodiversity reporting: standardization, materiality, and assurance

    Reporting data on biodiversity-related impacts and dependency of company operations and financial risks and opportunities. ... let's consider a real-world case study of Forico, a forest management company operating in Tasmania. ... Science, 366 (2019), Article eaax3100. View in Scopus Google Scholar. 20.

  23. Doing Data Science: A Framework and Case Study

    A data science framework has emerged and is presented in the remainder of this article along with a case study to illustrate the steps. This data science framework warrants refining scientific practices around data ethics and data acumen (literacy). A short discussion of these topics concludes the article. 2.