Winning Space Race with Data Science

November 1, 2022

This is the presentation of the capstone project in the IBM Data Science Professional Certificate .

Note that this presentation is much more detailed and technical than regular high-level and abstracted presentations for executive teams.

I assume the role of a Data Scientist working for a startup intending to compete with SpaceX , and in the process follow the Data Science methodology involving data collection, data wrangling, exploratory data analysis, data visualization, model development, model evaluation, and reporting results to stakeholders.

In this capstone, we will predict if the Falcon 9 first stage will land successfully, SpaceX advertises Falcon 9 rocket launches on its website, with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.

ibm data science capstone project github 2022

Executive Summary

ibm data science capstone project github 2022

Introduction

ibm data science capstone project github 2022

Methodology

ibm data science capstone project github 2022

Data collection API notebook

ibm data science capstone project github 2022

Web scraping notebook

ibm data science capstone project github 2022

Data wrangling notebook

ibm data science capstone project github 2022

EDA with Visualization notebook

ibm data science capstone project github 2022

EDA with SQL notebook

ibm data science capstone project github 2022

Launch Sites Locations Analysis with Folium notebook

ibm data science capstone project github 2022

Interactive Dashboard with Ploty Dash

ibm data science capstone project github 2022

Machine Learning Prediction notebook

ibm data science capstone project github 2022

Insights Drawn from EDA

ibm data science capstone project github 2022

Launch Sites Proximities Analysis

ibm data science capstone project github 2022

Build a Dashboard with Plotly Dash

ibm data science capstone project github 2022

Predictive Analysis (Classification)

ibm data science capstone project github 2022

Conclusions

ibm data science capstone project github 2022

For notebooks, datasets and scripts, follow this GitHub repository link: Applied Data Science Capstone

ibm data science capstone project github 2022

Instantly share code, notes, and snippets.

@YouyouZheng2022

YouyouZheng2022 / Loan.ipynb

  • Download ZIP
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save YouyouZheng2022/640e8df75f6dbcb6df7c6bf4e6a4d788 to your computer and use it in GitHub Desktop.

logo

Data Science Capstone 2022

Starting the project, starting the project ¶, about the project ¶.

The main purpose of this project is to create a recommender system that recommends properties to users based on previous user iteractions and property features. In particular, our focus is recommendations to new users, in addition to existing users. We will explore and compare various approaches of recommender systems and apply them to our data. Our final goal is to integrate our model into the Santa Barbara property website created by the UC Santa Barbara’s AppFolio sponsored computer science capstone team.

This project showcases the power of machine learning on the user interaction data collected from AppFolio’s property management software. AppFolio will be able to analyze the results from our recommender system to assess the effect of utilizing interaction data to better their products.

About the Data ¶

Our dataset is derived from LISA , Appfolio’s Award Winning AI Leasing Assistant Chat Bot, which is designed for property companies to streamline the process of receiving inquiries about properties from users. Our data is divided into three components: user ratings, item features, and images. It consists of over four million users, ten thousand properties, and five million user interactions.

What have we done so far ¶

Currently working in teams to clean the data and test out different recommender system models.

For the data cleaning, we’ve communicated with the sponsor about what the different attributes in the data represent and how it can add as a feature in our model. We’ve changed the current INTERACTION_TYPE, or ratings to numbers so they can be used in the Experimental Library we are using. In addition, will will begin experimenting different rating ranges soon (i.e. using 1-4 or 0-3).

For the Model Testing, we’ve taken the time to understand the libraries and recommender system concepts that we will be using. Our models are tested using Cornac which makes it easier to test and experiment with multiple different multimodal recommender systems. In addition, this team is focused on applying the BiLateral Variational Autoencoder recommender system to our final project. Right now, we are building a simple model with a smaller dataset with no images.

We are currently meeting weekly with Appfolio’s CS Capstone to help them with their current recommendations on their website.

Obstacles ¶

Data struggles ¶.

Our data consisted of uninformative labels and unintuitive names for values which gave us questions to research and ask our sponsors about . We created a documentation page for our entire dataset so we can use it for quick reference and help future collaborators understand our data. The “item features” dataset includes many null observations which we are unsure on how best to deal with. Our options include dropping the observations that include null values and increasing our precision on the data that we do include, but we would miss out on valuable information regarding recommending the properties we left out. Our other option is to find a way to work around the null observations and perhaps figure out a way to derive default values for these observations.

Model Complexity ¶

ibm data science capstone project github 2022

It has been a learning curve to understand how recommender systems work. Our chosen model consists of four encoders and one decoder to make item recommendations. It has been a puzzle deciphering the research paper for our model and getting it working to train. Lots of data and a complex model makes training time increase tremendously. It currently takes over thirty minutes for one epoch and our model needs a considerable amount of epochs to be accurate. We talked with our sponsors and set up a notebook on Google cloud that enables us to use GPUs to train more efficiently.

Future Tasks ¶

Continue learning about the Bilateral Variational Autoencoder

Clean the larger dataset that includes the properties of the CS capstone team’s database

Consider other ways to deal with missing data without removing it

Find default values in replacement with missing data

Update our model to suit the CS team capstone’s website

Figure out a way to map between our dataset and the CS capstone team’s dataset

Implement our prediction model to the CS capstone team’s website (see image of CS team’s website below)

ibm data science capstone project github 2022

For next quarter:

Upgrade our model with complex features including:

Latitude and longitude of property

Creating a Simple Model

ibm data science capstone project github 2022

Environmental Monitoring, remote sensing, cyber-physical systems, Engineers for Exploration

E4e microfaune project.

  • Group members: Jinsong Yang, Qiaochen Sun

Abstract: Nowadays, human activities such as wildfires and hunting have become the largest factor that would have serious negative effects on biodiversity. In order to deeply understand how anthropogenic activities deeply affect wildlife populations, field biologists utilize automated image classification driven by neural networks to get relevant biodiversity information from the images. However, for some small animals such as insects or birds, the camera could not work very well because of the small size of these animals. It is extremely hard for cameras to capture the movement and activities of small animals. To effectively solve this problem, passive acoustic monitoring (PAM) has become one of the most popular methods. We could utilize sounds we collect from PAM to train certain machine learning models which could tell us the fluctuation of biodiversity of all these small animals. The goal of the whole program is to test the biodiversity of these small animals (most of them are birds). However, the whole program could be divided into plenty of small parts. I and Jinsong will pay attention to the intermediate step of the program. The goal of our project is to generate subsets of audio recordings that have higher probability of vocalization of interest, which could help our labeling volunteer to save time and energy. The solutions could help us reduce down the amount of time and resources required to achieve enough training data for species-level classifiers. We perform the same thing with AID_NeurIPS_2021. Only the data is different between these two github. For this github, we use the peru data instead of Coastal_Reserve data.

  • Group members: Harsha Jagarlamudi, Kelly Kong

Eco-Acoustic Event Detection: Classifying temporal presence of birds in recorded bird vocalization audio

  • Group members: Alan Arce, Edmundo Zamora

Abstract: Leveraging "Deep Learning" methods to classify temporal presence birds in recorded bird vocalization audio. Using a hybrid CNN-RNN model, trained on audio data, in the interest of benefitting wildlife monitoring and preservation.

Pyrenote - User Profile Design & Accessible Data

  • Group members: Dylan Nelson

Abstract: Pyrenote is a project in development by a growing group of student researchers here at UCSD. It's primary purpose is to allow anyone to contribute to research by labeling data in an intuitive and accessible way. Right now it is currently being used to develop a sort of voice recognition for birds. The goal is to make an algorithm that can strongly label data (say where in the clip a bird is calling and what bird is making the call). To do this, a very vast dataset is needed to be labeled. I worked mostly on the user experience side. Allowing them to interact with their labeling in new ways, such as keeping tabs on their progress and reaching goals. Developing a User Profile page was the primary source for receiving this data and was developed iteratively as a whole new page for the site

Pyrenote Webdeveloper

  • Group members: Wesley Zhen

Abstract: The website, Pyrenote, is helping scientists track bird populations by identifying them using machine learning classifiers on publicly annotated audio recordings. I have implemented three features over the course of two academic quarters aimed at streamlining user experience and improving scalability. The added scalability will be useful for future projects as we start becoming more ambitious with the number of users we bring to the site.

Spread of Misinformation Online

Who is spreading misinformation and worries in twitter.

  • Group members: Lehan Li, Ruojia Tao

Abstract: Spread of misinformation over social media posts challenges to daily information intake and exchange. Especially under current covid 19 pandemic, the disperse of misinformation regarding to covid 19 diseases and vaccination posts threats to individuals' wellbeing's and general publish health. The people's worries also increase with misinformation such as the shortage of food and water. This spread of misinformation also provide This project seeks to investigate the spread of misinformation over social media (Twitter) under covid 19 pandemic. wo main directions are investigated in the project. The first direction is the analysis of the effect of bot users on the spread of misinformation: We want to explore what is the role that robot user plays in spreading the misinformation. Where are the bot users located in the social network. The second direction is the sentiment analysis that examines users' attitudes towards misinformation: We want to see the spread of sentiment with different places in social networks. We also mixed the two directions: What is the relationship between bot-users with positive and negative emptions? Since online social medias users form social networks, the project also seeks to investigate the effect of social network on the above two topics. Moreover, the project is also interested in exploring the change in proportion of bot users and users' attitude towards misinformation as the social network becomes more concentrated and tightly connected.

Misinformation on Reddit

  • Group members: Samuel Huang, David Aminifard

Abstract: As social media has grown in popularity, namely Reddit, its use for rapidly sharing information based on categories or topics (subreddits) has had massive implications for how people are usually exposed to information and the quality of the information they interact with. While Reddit has its benefits, e.g. providing instant access to - nearly - real time, categorized information, it has possibly played a role in worsening divisions and the spread of misinformation. Our results showed that subreddits with the highest proportions of misinformation posts tend to lean more towards politics and news. In addition, we found that despite the frequency of misinformation per subreddit, the average upvote ratio per submission seemed consistently high, which indicated that subreddits tend to be ideologically homogeneous.

The Spread of YouTube Misinformation Through Twitter

  • Group members: Alisha Sehgal, Anamika Gupta

Abstract: In our Capstone Project, we explore the spread of misinformation online. More specifically, we look at the spread of misinformation across Twitter and YouTube because of the large role these two social media platforms play in the dissemination of news and information. Our main objectives are to understand how YouTube videos contribute to spreading misinformation on Twitter, evaluate how effectively YouTube is removing misinformation and if these policies also prevent users from engaging with misinformation. We take a novel approach of analyzing tweets, YouTube video captions, and other metadata using NLP to determine the presence of misinformation and investigate how individuals interact or spread misinformation. Our research focuses on the domain of public health as this is the subject of many conspiracies, varying opinions, and fake news.

Particle Physics

Understanding higgs boson particle jets with graph neural networks.

  • Group members: Charul Sharma, Rui Lu, Bryan Ambriz

Abstract: Extending the content of last quarter of deep sets neural network, fully connected neural network classifier, adversarial deep set model and designed decorrelated tagger (DDT), we went a little bit further this quarter about picking up different layers in neural network like GENConv and EdgeConv. GENConv and EdgeConv play incredibly important roles here for boosting the performances of our basic GNN model. We also evaluated the performance of our model using ROC (Receiver-Operating Curve) curves describing AUC (Area Under the Curve). Meanwhile, based on previous experiences of project one and past project of particle physics domain, we decided to add one more section, exploratory data analysis in our project for conducting some basic theory, bootstrapping or common sense of our dataset. But we have not produced all the optimal outcomes so far even though we finished the EdgeConv part and for the following weeks, we would like to finish the GENConv and may try some other layers to find out the potential to increase the performance of our model.

Predicting a Particle's True Mass

  • Group members: Jayden Lee, Dan Ngo, Isac Lee

Abstract: The Large Hadron Collider (LHC) collides protons traveling near light speed to generate high-energy collisions. These collisions produce new particles and have led to the discovery of new elementary particles (e.g., Higgs Boson). One key information to collect from this collision event is the structure of the particle jet, which refers to a group of collective spray of decaying particles that travel in the same direction, as accurately identifying the type of these jets - QCD or signal - play a crucial role in discovery of high-energy elementary particles like Higgs particle. There are several properties that determine jet type with jet mass being one of the strongest indicators in jet type classification. A previous study jet mass estimation, called “soft drop declustering,” has been one of the most effective methods in making rough estimations on the jet mass. With this in mind, we aim to implement machine learning in jet mass estimation through various neural network architectures. With data collected and processed by CERN, we implemented a model capable of improving jet mass prediction through jet features.

Mathematical Signal Processing (compression of deep nets, or optimization for data-science/ML)

Graph neural networks, graph neural network based recommender systems for spotify playlists.

  • Group members: Benjamin Becze, Jiayun Wang, Shone Patil

Abstract: With the rise of music streaming services on the internet in the 2010’s, many have moved away from radio stations to streaming services like Spotify and Apple Music. This shift offers more specificity and personalization to users’ listening experiences, especially with the ability to create playlists of whatever songs that they wish. Oftentimes user playlists have a similar genre or theme between each song, and some streaming services like Spotify offer recommendations to expand a user’s existing playlist based on the songs in it. Using Node2vec and GraphSAGE graph neural network methods, we set out to create a recommender system for songs to add to an existing playlist by drawing information from a vast graph of songs we built from playlist co-occurrences. The result is a personalized song recommender based not only on Spotify’s community of playlist creators, but also the specific features within a song.

Dynamic Stock Industry Classification

  • Group members: Sheng Yang

Abstract: Use Graph-based Analysis to Re-classify Stocks in China A-share and Improve Markowitz Portfolio Optimization

NLP, Misinformation

Hdsi faculty exploration tool.

  • Group members: Martha Yanez, Sijie Liu, Siddhi Patel, Brian Qian

Abstract: The Halıcıoğlu Data Science Institute (HDSI) at University of California, San Diego is dedicated to the discovery of new methods and training of students and faculty to use data science to solve problems in the current world. The HDSI has several industry partners that are often searching for assistance to tackle their daily activities and need experts in different domain areas. Currently, there are around 55 professors affiliated to HDSI. They all have diverse research interests and have written numerous papers in their own fields. Our goal was to create a tool that allows HDSI to select the best fit from their faculty, based on their published work, to aid their industry partners in their specific endeavors. We did this with Natural Language Processing (NLP) by managing all the abstracts from the faculty’s published work and organizing them by topics. We will then obtained the proportion of papers of each faculty associated with each of the topics and drew a relationship between researchers and their most published topics. This will allow HDSI to personalize recommendations of faculty candidates to their industry partner’s particular job.

  • Group members: Du Xiang

AI in Healthcare, Deep Reinforcement Learning, Trustworthy Machine Learning

Improving robustness in deep fusion modeling against adversarial attacks.

  • Group members: Ayush More, Amy Nguyen

Abstract: Autonomous vehicles rely heavily on deep fusion modeling, which utilize multiple inputs for its inferences and decision making. By using the data from these inputs, the deep fusion model benefits from shared information, which is primarily associated with robustness as these input sources can face different levels of corruption. Thus, it is highly important that the deep fusion models used in autonomous vehicles are robust to corruption, especially to input sources that are weighted more heavily in different conditions. We explore a different approach in training the robustness for a deep fusion model through adversarial training. We fine-tune the model on adversarial examples and evaluate its robustness against single source noise and other forms of corruption. Our experimental results show that adversarial training was effective in improving the robustness of a deep fusion model object detector against adversarial noise and Gaussian noise while maintaining performance on clean data. The results also highlighted the lack of robustness of models that are not trained to handle adversarial examples. We believe that this is relevant given the risks that autonomous vehicles pose to pedestrians - it is important that we ensure the inferences and decisions made by the model are robust against corruption, especially if it is intentional from outside threats.

Healthcare: Adversarial Defense In Medical Deep Learning Systems

  • Group members: Rakesh Senthilvelan, Madeline Tjoa

Abstract: In order to combat against such adversarial instances, there needs to be robust training done with these models in order to best protect against the methods that these attacks use on deep learning systems. In the scope of this paper, we will be looking into the methods of fast gradient signed method and projected gradient descent, two methods used in adversarial attacks to maximize loss functions and cause the affected system to make opposing predictions, in order to train our models against them and allow for stronger accuracy when faced with adversarial examples.

Satellite image analysis

Ml for finance, ml for healthcare, fair ml, ml for science, actionable recourse.

  • Group members: Shweta Kumar, Trevor Tuttle, Takashi Yabuta, Mizuki Kadowaki, Jeffrey Feng

Abstract: In American society today there is a constant encouraged reliance on credit, despite it not being available to everyone as a legal right. Currently, there are countless evaluation methods of an individual's creditworthiness in practice. In an effort to regulate the selection criteria of different financial institutions, the Equal Credit Opportunity Act (ECOA) requires that applicants denied a loan are entitled to an Adverse Action notice, a statement from the creditor explaining the reason for the denial. However, these adverse action notices are frequently unactionable and ineffective in providing feedback to give an individual recourse, which is the ability to act up on a reason for denial to raise one’s odds of getting accepted for a loan. In our project, we will be exploring whether it is possible to create an interactive interface to personalize adverse action notices in alignment with personal preferences for individuals to gain recourse.

Social media; online communities; text analysis; ethics

Finding commonalities in misinformative articles across topics.

  • Group members: Hwang Yu, Maximilian Halvax, Lucas Nguyen

Abstract: In order to combat the large scale distribution of misinformation online, We wanted to develop a way to flag news articles that are misinformative and could potentially mislead the general public. In addition to flagging news articles, we also wanted to find commonalities between the misinformation that we found. Were some topics in specific containing more misleading information than others? How much overlap do these articles have when we break their content down into TF IDF and see what words carry the most importance when put into various models detecting misinformation. We wanted to narrow down our models to be trained on four different topics: economics, politics, science, and general which is a dataset encompassing the three previous topics. We Found that general included the most overlap overall, while the topics themselves, while mostly different than the other specific topics, had certain models that still put emphasis on similar words, indicating a possible pattern of misinformative language in these articles. We believe, from these results, that we can find a pattern that could direct further investigation into how misinformation is written and distributed online.

The Effect of Twitter Cancel Culture on the Music Industry

  • Group members: Peter Wu, Nikitha Gopal, Abigail Velasquez

Abstract: Musicians often trend on social media for various reasons but in recent years, there has been a rise in musicians being “canceled” for committing offensive or socially unacceptable behavior. Due to the wide accessibility of social media, the masses are able to hold accountable musicians for their actions through “cancel culture”, a form of modern ostracism. Twitter has become a well-known platform for “cancel culture” as users can easily spread hashtags and see what’s trending, which also has the potential to facilitate the spread of toxicity. We analyze how public sentiment towards canceled musicians on Twitter changes in respect to the type of issue they were canceled for, their background, and the strength of their parasocial relationship with their fans. Through our research, we aim to determine whether “cancel culture” leads to an increase in toxicity and negative sentiment towards a canceled individual.

Analyzing single cell multimodality data via (coupled) autoencoder neural networks

Coupled autoencoders for single-cell data analysis.

  • Group members: Alex Nguyen, Brian Vi

Abstract: Historically, analysis on single-cell data has been difficult to perform, due to data collection methods often resulting in the destruction of the cell in the process of collecting information. However, an ongoing endeavor of biological data science has recently been to analyze different modalities, or forms, of the genetic information within a cell. Doing so will allow modern medicine a greater understanding of cellular functions and how cells work in the context of illnesses. The information collected on the three modalities of DNA, RNA, and protein can be done safely and because it is known that they are same information in different forms, analysis done on them can be extrapolated understand the cell as a whole. Previous research has been conducted by Gala, R., Budzillo, A., Baftizadeh, F. et al. to capture gene expression in neuron cells with a neural network called a coupled autoencoder. This autoencoder framework is able to reconstruct the inputs, allowing the prediction of one input to another, as well as align the multiple inputs in the same low dimensional representation. In our paper, we build upon this coupled autoencoder on a data set of cells taken from several sites of the human body, predicting from RNA information to protein. We find that the autoencoder is able to adequately cluster the cell types in its lower dimensional representation, as well as perform decently at the prediction task. We show that the autoencoder is a powerful tool for analyzing single-cell data analysis and may prove to be a valuable asset in single-cell data analysis.

Machine Learning, Natural Language Processing

On evaluating the robustness of language models with tuning.

  • Group members: Lechuan Wang, Colin Wang, Yutong Luo

Abstract: Prompt tuning and prefix tuning are two effective mechanisms to leverage frozen language models to perform downstream tasks. Robustness reflects models’ resilience of output under a change or noise in the input. In this project, we analyze the robustness of natural language models using various tuning methods with respect to a domain shift (i.e. training on a domain but evaluating on out-of-domain data). We apply both prompt tuning and prefix tuning on T5 models for reading comprehension (i.e. question-answering) and GPT-2 models for table-to-text generation.

Activity Based Travel Models and Feature Selection

A tree-based model for activity based travel models and feature selection.

  • Group members: Lisa Kuwahara, Ruiqin Li, Sophia Lau

Abstract: In a previous study, Deloitte Consulting LLP developed a method of creating city simulations through cellular location and geospatial data. Using these simulations of human activity and traffic patterns, better decisions can be made regarding modes of transportation or road construction. However, the current commonly used method of estimating transportation mode choice is a utility model that involves many features and coefficients that may not necessarily be important but still make the model more complex. Instead, we used a tree-based approach - in particular, XGBoost - to identify just the features that are important for determining mode choice so that we can create a model that is simpler, robust, and easily deployable, in addition to performing better than the original utility model on both the full dataset and population subsets.

Explainable AI, Causal Inference

Explainable ai.

  • Group members: Jerry Chan, Apoorv Pochiraju, Zhendong Wang, Yujie Zhang

Abstract: Nowadays, the algorithmic decision-making system has been very common in people’s daily lives. Gradually, some algorithms become too complex for humans to interpret, such as some black-box machine learning models and deep neural networks. In order to assess the fairness of the models and make them better tools for different parties, we need explainable AI (XAI) to uncover the reasoning behind the predictions made by those black-box models. In our project, we will be focusing on using different techniques from causal inferences and explainable AI to interpret various classification models across various domains. In particular, we are interested in three domains - healthcare, finance, and the housing market. Within each domain, we are going to train four binary classification models first, and we have four goals in general: 1) Explaining black-box models both globally and locally with various XAI methods. 2) Assessing the fairness of each learning algorithm with regard to different sensitive attributes; 3) Generating recourse for individuals - a set of minimal actions to change the prediction of those black-box models. 4) Evaluating the explanations from those XAI methods using domain knowledge.

AutoML Platforms

Deep learning transformer models for feature type inference.

  • Group members: Andrew Shen, Tanveer Mittal

Abstract: The first step AutoML software must take after loading in the data is to identify the feature types of individual columns in input data. This information then allows the software to understand the data and then preprocess it to allow machine learning algorithms to run on it. Project Sortinghat of the ADA lab at UCSD frames this task of Feature Type Inference as a machine learning multiclass classification problem. Machine learning models defined in the original SortingHat feature type inference paper use 3 sets of features as input. 1. The name of the given column 2. 5 not null sample values 3. Descriptive numeric features about the column The textual features are easy to access, however the descriptive statistics previous models rely on require a full pass through the data which make preprocessing less scalable. Our goal is to produce models that may rely less on these statistics by better leveraging the textual features. As an extension of Project SortingHat, we experimented with deep learning transformer models and varying the sample sizes used by random forest models. We found that our transformer models achieved state of the art results on this task which outperform all existing tools and ML models that have been benchmarked against SortingHat's ML Data Prep Zoo. Our best model used a pretrained Bidirectional Encoder Representations Transformer(BERT) language model to produce word embeddings which are then processed by a Convolutional Neural Network(CNN) model. As a result of this project, we have published 2 BERT CNN models using the PyTorch Hub api. This is to allow software engineers to easily integrate our models or train similar ones for use in AutoML platforms or other automated data preparation applications. Our best model uses all the features defined above, while the other only uses column names and sample values while offering comparable performance and much better scalability for all input data.

Exploring Noise in Data: Applications to ML Models

  • Group members: Cheolmin Hwang, Amelia Kawasaki, Robert Dunn

Abstract: In machine learning, models are commonly built in such a way to avoid what is known as overfitting. As it is generally understood, overfitting is when a model is fit exactly to the training data causing the model to have poor performance on new examples. This means that overfit models tend to have poor accuracy on unseen data because the model is fit exactly to the training data. Therefore, in order to generalize to all examples of data and not only the examples found in a given training set, models are built with certain techniques to avoid fitting the data exactly. However, it can be found that overfitting does not always work in this way that one might expect as will be shown by fitting models with a given level of noisiness. Specifically, it is seen that some models fit exactly to data with high levels of noise still produce results with high accuracy whereas others are more prone to overfitting.

Group Testing for Optimizing COVID-19 Testing

Covid-19 group testing optimization strategies.

  • Group members: Mengfan Chen, Jeffrey Chu, Vincent Lee, Ethan Dinh-Luong

Abstract: The COVID-19 pandemic that has persisted for more than two years has been combated by efficient testing strategies that reliably identifies positive individuals to slow the spread of the pandemic. Opposed to other pooling strategies within the domain, the methods described in this paper prioritize true negative samples over overall accuracy. In the Monte Carlo simulations, both nonadaptive and adaptive testing strategies with random pool sampling resulted in high accuracy approaching at least 95% with varying pooling sizes and population sizes to decrease the number of tests given. A split tensor rank 2 method attempts to identify all infected samples within 961 samples, converging the number of tests to 99 as the prevalence of infection converges to 1%.

Causal Discovery

Patterns of fairness in machine learning.

  • Group members: Daniel Tong, Anne Xu, Praveen Nair

Abstract: Machine learning tools are increasingly used for decision-making in contexts that have crucial ramifications. However, a growing body of research has established that machine learning models are not immune to bias, especially on protected characteristics. This had led to efforts to create mathematical definitions of fairness that could be used to estimate whether, given a prediction task and a certain protected attribute, an algorithm is being fair to members of all classes. But just like how philosophical definitions of fairness can vary widely, mathematical definitions of fairness vary as well, and fairness conditions can in fact be mutually exclusive. In addition, the choice of model to use to optimize fairness is also a difficult decision we have little intuition for. Consequently, our capstone project centers around an empirical analysis for studying the relationships between machine learning models, datasets, and various fairness metrics. We produce a 3-dimensional matrix of the performance of a certain machine learning model, for a certain definition of fairness, for a certain given dataset. Using this matrix on a sample of 8 datasets, 7 classification models, and 9 fairness metrics, we discover empirical relationships between model type and performance on specific metrics, in addition to correlations between metric values across different dataset-model pairs. We also offer a website and command-line interface for users to perform this experimentation on their own datasets.

Causal Effects of Socioeconomic and Political Factors on Life Expectancy in 166 Different Countries

  • Group members: Adam Kreitzman, Maxwell Levitt, Emily Ramond

Abstract: This project examines causal relationships between various socioeconomic variables and life expectancy outcomes in 166 different countries, with the ability to account for new, unseen data and variables with an intuitive data pipeline process with detailed instructions and the PC algorithm with updated code to account for missingness in data. With access to this model and pipeline, we hope that questions such as “do authoritarian countries have a direct relation to life expectancy?” or “how does women in government affect perceived notion of social support?” will now be able to be answered and understood. Through our own analysis, we were able to find intriguing results, such as a higher Perception of Corruption is distinctly related to a lower Life Ladder score. We also found that higher quality of life perceptions is related to lower economic inequality. These results aim to educate not only the general public, but government officials as well.

Time series analysis in health

Time series analysis on the effect of light exposure on sleep quality.

  • Group members: Shubham Kaushal, Yuxiang Hu, Alex Liu

Abstract: The increase of artificial light exposure through the increased prevalence of technology has an affect on the sleep cycle and circadian rhythm of humans. The goal of this project is to determine how different colors and intensities of light exposure prior to sleep affects the quality of sleep through the classification of time series data.

Sleep Stage Classification for Patients With Sleep Apnea

  • Group members: Kevin Chin, Yilan Guo, Shaheen Daneshvar

Abstract: Sleeping is not uniform and consists of four stages: N1, N2, N3, and REM sleep. The analysis of sleep stages is essential for understanding and diagnosing sleep-related diseases, such as insomnia, narcolepsy, and sleep apnea; however, sleep stage classification often does not generalize to patients with sleep apnea. The goal of our project is to build a sleep stage classifier specifically for people with sleep apnea and understand how it differs from the normal sleep stage. We will then explore whether or not the inclusion and featurization of ECG data will improve the performance of our model.

Environmental health exposures & pollution modeling & land-use change dynamics

Supervised classification approach to wildfire mapping in northern california.

  • Group members: Alice Lu, Oscar Jimenez, Anthony Chi, Jaskaranpal Singh

Abstract: Burn severity maps are an important tool for understanding fire damage and managing forest recovery. We have identified several issues with current mapping methods used by federal agencies that affect the completeness, consistency, and efficiency of their burn severity maps. In order to address these issues, we demonstrate the use of machine learning as an alternative to traditional methods of producing severity maps, which rely on in-situ data and spectral indices derived from image algebra. We have trained several supervised classifiers on sample data collected from 17 wildfires across Northern California and evaluate their performance at mapping fire severity.

Network Performance Classification

Network signal anomaly detection.

  • Group members: Laura Diao, Benjamin Sam, Jenna Yang

Abstract: Network degradation occurs in many forms, and our project will focus on two common factors: packet loss and latency. Packet loss occurs when one or more data packets transmitted across a computer network fail to reach their destination. Latency can be defined as a measure of delay for data to transmit across a network. For internet users, high rates of packet loss and significant latency can manifest in jitter or lag, which are indicators of overall poor network performance as perceived by the end user. Thus, when issues arise in these two factors, it would be beneficial for internet service providers to know exactly when the user is experiencing problems in real time. In real world scenarios, situations or environments such as poor port quality, overloaded ports, network congestion and more can impact overall network performance. In order to detect some of these issues in network transmission data, we built an anomaly detection system that predicts the estimated packet loss and latency of a connection and detects whether there is a significant degradation of network quality for the duration of the connection.

Real Time Anomaly Detection in Networks

  • Group members: Justin Harsono, Charlie Tran, Tatum Maston

Abstract: Internet companies are expected to deliver the speed their customer has paid for. However, for various reasons such as congestion or connectivity issues, it is inevitable for one to perceive degradations in network quality. To still ensure the customer is satisfied, certain monitoring systems must be built to inspect the quality of the connection. Our goal is to build a model that would be able to detect, in real time, these regions of networks degradations, so that an appropriate recovery can be enacted to offset these degradations. Our solution is a combination of two anomaly detection methods that successfully detects shifts in the data, based on a rolling window of data it has seen.

System Usage Reporting

Intel telemetry: data collection & time-series prediction of app usage.

  • Group members: Srikar Prayaga, Andrew Chin, Arjun Sawhney

Abstract: Despite advancements in hardware technology, PC users continue to face frustrating app launch times, especially on lower end Windows machines. The desktop experience differs vastly from the instantaneous app launches and optimized experience we have come to expect even from low end smartphones. We propose a solution to preemptively run Windows apps in the background based on the app usage patterns of the user. Our solution is two-step. First, we built telemetry collector modules in C/C++ to collect real-world app usage data from two of our personal Windows 10 devices. Next, we developed neural network models, trained on the collected data, to predict app usage times and corresponding launch sequences in python. We achieved impressive results on selected evaluation metrics across different user profiles.

Predicting Application Use to Reduce User Wait Time

  • Group members: Sasami Scott, Timothy Tran, Andy Do

Abstract: Our goal for this project was to lower the user wait time when loading programs by predicting the next used application. In order to obtain the needed data, we created data collection libraries. Using this data, we created a Hidden Markov Model (HMM) and a Long Short-Term Memory (LSTM) model, but the latter proved to be better. Using LSTM, we can predict the application use time and expand this concept to more applications. We created multiple LSTM models with varying results, but ultimately chose a model that we think had potential. We decided on using the model that reported a 90% accuracy.

INTELlinext: A Fully Integrated LSTM and HMM-Based Solution for Next-App Prediction With Intel SUR SDK Data Collection

  • Group members: Jared Thach, Hiroki Hoshida, Cyril Gorlla

Abstract: As the power of modern computing devices increases, so too do user expectations for them. Despite advancements in technology, computer users are often faced with the dreaded spinning icon waiting for an application to load. Building upon our previous work developing data collectors with the Intel System Usage Reporting (SUR) SDK, we introduce INTELlinext, a comprehensive solution for next-app prediction for application preload to improve perceived system fluidity. We develop a Hidden Markov Model (HMM) for prediction of the k most likely next apps, achieving an accuracy of 64% when k = 3. We then implement a long short-term memory (LSTM) model to predict the total duration that applications will be used. After hyperparameter optimization leading to an optimal lookback value of 5 previous applications, we are able to predict the usage time of a given application with a mean absolute error of ~45 seconds. Our work constitutes a promising comprehensive application preload solution with data collection based on the Intel SUR SDK and prediction with machine learning.

IMAGES

  1. GitHub

    ibm data science capstone project github 2022

  2. Data Science and Machine Learning Capstone Project

    ibm data science capstone project github 2022

  3. GitHub

    ibm data science capstone project github 2022

  4. Ibm data science capstone project-SpaceX launch analysis

    ibm data science capstone project github 2022

  5. IBM- Data Science Capstone Project By Aarkesh Sharma

    ibm data science capstone project github 2022

  6. GitHub

    ibm data science capstone project github 2022

VIDEO

  1. IBM Data Science capstone

  2. Advanced Data Science Capstone Sara Iaccheo

  3. Data Science Capstone Project

  4. Aldie Adrian

  5. Live session on Capstone Project Discussion- Oct,8 2023

  6. IBM Coursera Advanced Data Science Capstone

COMMENTS

  1. evgenyzorin/IBM-Applied-Data-Science-Capstone

    This Capstone is the 10th (final) course in IBM Data Science Professional Certificate specialization, and it actually summarizes in the form of project all materials that have been learned during this specialization Topics

  2. ibm-data-science · GitHub Topics · GitHub

    Add this topic to your repo. To associate your repository with the ibm-data-science topic, visit your repo's landing page and select "manage topics." GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

  3. GitHub

    You will assume the role of a Data Scientist working for a startup intending to compete with SpaceX, and in the process follow the Data Science methodology involving data collection, data wrangling, exploratory data analysis, data visualization, model development, model evaluation, and reporting your results to stakeholders. You are tasked with ...

  4. ibm-data-science-professional · GitHub Topics · GitHub

    Add this topic to your repo. To associate your repository with the ibm-data-science-professional topic, visit your repo's landing page and select "manage topics." GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

  5. GitHub

    Contribute to Legolas42/IBM-Data-Science-Capstone-Project development by creating an account on GitHub.

  6. amrinderkaur15/IBM-DATA-SCIENCE-CAPSTONE-PROJECT-2022-BY ...

    Contribute to amrinderkaur15/IBM-DATA-SCIENCE-CAPSTONE-PROJECT-2022-BY-AMRINDER-KAUR development by creating an account on GitHub.

  7. Notebook

    Originally I have shared it to my GitHub repository IBM-Applied-Data-Science-Capstone as part of the required "share to GitHub" exercises of the 10th course from the said courses. The 10th course is where the sharing to GitHub comes into play, but until that, during the courses 1 to 9, students are not required or encouraged to share solutions ...

  8. Applied Data Science Capstone

    This is the final course in the IBM Data Science Professional Certificate as well as the Applied Data Science with Python Specialization. This capstone project course will give you the chance to practice the work that data scientists do in real life when working with datasets. In this course you will assume the role of a Data Scientist working ...

  9. GitHub

    IBM-Data-Science-Professional-Certificate---Capstone-Project. Winning space race with data science. The commercial space age is here. Our company SpaceY is entering the market and defining its strategy. One of the first steps is to determine the price of each launch.

  10. Notebook

    DISCLAIMER: Please be aware, that this notebook contains solutions for the Capstone Project from the IBM Data Science Professional Certificate courses. Originally I have shared it to my GitHub repository IBM-Applied-Data-Science-Capstone as part of the required "share to GitHub" exercises of the 10th course from the said courses.

  11. Python Notebooks by Sentinel-1

    Solutions by Sentinel-1 for asignments of the "IBM Applied Data Science Capstone" - the capstone project from the IBM Data Science Professional Certificate courses. #. Notebook. Schedule. Topic. 1. Data Collection API Lab. Week 1 of the 10th course. Collecting the Data.

  12. LVM

    Winning Space Race with Data Science. Data Science. Coursera. IBM. Capstone Project. Published. November 1, 2022. This is the presentation of the capstone project in the IBM Data Science Professional Certificate. Note that this presentation is much more detailed and technical than regular high-level and abstracted presentations for executive teams.

  13. IBM: Data Science and Machine Learning Capstone Project

    10 ratings. Create a project that you can use to showcase your Data Science skills to prospective employers. Apply various data science and machine learning techniques to analyze and visualize a data set involving a real life business scenario and build a predictive model. 6 weeks. 3-4 hours per week. Self-paced. Progress at your own speed. Free.

  14. PDF CoCalc -- IBM Data Science Capstone Project 2022.pdf

    GitHub Repository: DanielBarnes18 / IBM-Data-Science-Professional-Certificate Path: blob/main/10. Applied Data Science Capstone/IBM Data Science Capstone Project 2022.pdf Views: 1 2 4 4 CoCalc - About - Sagemath, Inc. - C ontact - Policies - Products and Pricing - Software - Status

  15. PDF Ibm Data Science Capstone Project

    IBM Applied Data Science Capstone | Daniel Barnes | 2022 11 GEOSPATIAL ANALYSIS - FOLIUM The following steps were taken to visualize the launch data on an interactive map: 1. Mark all launch sites on a map • Initialise the map using a Folium Mapobject • Add a folium.Circleand folium.Markerfor each launch site on the launch map 2.

  16. PDF IBM-Data-Science-Professional-Certificate/10. Applied Data ...

    A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior.

  17. Badge: Advanced Data Science Capstone

    This badge earner has proven a deep understanding on massive parallel data processing, data exploration and visualization, and advanced machine learning and deep learning. The earner knows how to apply this knowledge in real-world practical use cases and can justify architectural decisions. The individual has a proven understanding of architectural methods and frameworks used in Industry Data ...

  18. IBM Data Science Capstone Project Dashboard · GitHub

    IBM Data Science Capstone Project Dashboard. Raw. space_dash_app.py. # Import required libraries. import pandas as pd. import dash. import dash_html_components as html. import dash_core_components as dcc. from dash.dependencies import Input, Output.

  19. This is the IBM Data Science Certificate Capstone project for ...

    This is the IBM Data Science Certificate Capstone project for Machine Learning with Python. This is the IBM Data Science Certificate Capstone project for Machine Learning with Python. - Loan.ipynb.

  20. Welcome to Data Science Capstone

    We're excited to have you as students in the second capstone project course offering! This jupyter book will be the central place to find all course material for the course as well as host your project progress in the future. Note. DS Capstone GitHub org. DS Capstone Slack. DS Capstone Google Drive Folder. Capstone Schedule : Sheet1.

  21. Starting the Project

    Find default values in replacement with missing data. Update our model to suit the CS team capstone's website. Figure out a way to map between our dataset and the CS capstone team's dataset. Implement our prediction model to the CS capstone team's website (see image of CS team's website below) For next quarter:

  22. IBM Data Analyst Capstone Project

    The project will culminate with a presentation of your data analysis report for various stakeholders in the organization. The report will include an executive summary, your analysis, and a conclusion. You will be assessed on both your work for the various stages in the Data Analysis process, as well as the final deliverable.

  23. UCSD Data Science Capstone Projects: 2021-2022

    Project GitHub; Project Report; Abstract: The Halıcıoğlu Data Science Institute (HDSI) at University of California, San Diego is dedicated to the discovery of new methods and training of students and faculty to use data science to solve problems in the current world. The HDSI has several industry partners that are often searching for ...