A Novel Approach for Cross-Selling Insurance Products Using Positive Unlabelled Learning

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Int J Environ Res Public Health

Logo of ijerph

Machine Learning-Based Regression Framework to Predict Health Insurance Premiums

Keshav kaushik.

1 School of Computer Science, University of Petroleum and Energy Studies, Dehradun 248007, India; moc.liamg@kihsuakvahseklaiciffo (K.K.); moc.oohay@hwdrhb (A.B.)

Akashdeep Bhardwaj

Ashutosh dhar dwivedi.

2 Centre for Business Data Analytics, Department of Digitalization, Copenhagen Business School, 2000 Frederiksberg, Denmark; [email protected]

Rajani Singh

Associated data.

The dataset used in this research is publicly available at https://www.kaggle.com/datasets/noordeen/insurance-premium-prediction (accessed on 20 June 2022).

Artificial intelligence (AI) and machine learning (ML) in healthcare are approaches to make people’s lives easier by anticipating and diagnosing diseases more swiftly than most medical experts. There is a direct link between the insurer and the policyholder when the distance between an insurance business and the consumer is reduced to zero with the use of technology, especially digital health insurance. In comparison with traditional insurance, AI and machine learning have altered the way insurers create health insurance policies and helped consumers receive services faster. Insurance businesses use ML to provide clients with accurate, quick, and efficient health insurance coverage. This research trained and evaluated an artificial intelligence network-based regression-based model to predict health insurance premiums. The authors predicted the health insurance cost incurred by individuals on the basis of their features. On the basis of various parameters, such as age, gender, body mass index, number of children, smoking habits, and geolocation, an artificial neural network model was trained and evaluated. The experimental results displayed an accuracy of 92.72%, and the authors analyzed the model’s performance using key performance metrics.

1. Introduction

We live in a world that is filled with dangers and uncertainties. People, homes, businesses, buildings, and property are all vulnerable to various types of risk, and these risks might differ. These threats include the risk of death, illness, and the loss of property or possessions. People’s lives revolve around their health and happiness. However, because risks cannot always be avoided, the financial sector has devised a number of products to protect individuals and organisations from them by utilizing financial resources to compensate them. As a result, insurance is a policy that reduces or eliminates the expenses of various risks. A policy that protects medical bills is known as health insurance. An individual who has purchased a health insurance policy receives coverage after paying a certain premium. The cost of health insurance is determined by a variety of factors. The cost of a health insurance policy premium varies from person to person since various factors influence the cost of a health insurance plan. Consider age: a young individual is far less likely than an older person to suffer serious health issues. As a result, treating an elderly person is more expensive than treating a young one. As a result, an older individual must pay a higher premium than a younger person. Because [ 1 ] numerous factors influence the insurance premium of a health insurance policy, the premium amount varies from person to person.

In healthcare, artificial intelligence is capable of completing many medical-related activities at a much quicker rate in order to forecast or diagnose illnesses/injuries effectively and deliver the best medical therapy to the patient. AI may gather data, process it, and offer the appropriate result to the user. This reduces the time it takes to detect diseases and mistakes, allowing the diagnosis–treatment–recovery cycle to be dramatically shortened. For example, if you choose an online consultation with a doctor, chatbots are used by healthcare professionals or organisations to obtain basic information prior to an appointment with the doctor. This assists the doctor in comprehending the problem before beginning the consultation procedure. As a result, both the doctor and the patient save time.

AI and ML play various roles in the health insurance market, some of which are listed below:

  • The use of chatbots has become an increasingly important aspect of any firm; even healthcare organisations are embracing the technology. Because almost everyone has access to the Internet and a smartphone, interacting with physicians, hospitals, and insurance companies is much easier using chat applications. They are available 24 h a day, seven days a week, making them more effective than human interaction. They employ emotional analysis and natural language processing to better comprehend consumers’ requests and respond to a variety of queries about insurance claims and product choices.
  • Faster Claim Settlements: The time it takes for health insurance claims to be settled is one of the main difficulties for both policyholders and insurers. This might be due to lengthy manual processes or bogus claims. It takes time and effort to manually identify valid claims. However, AI has the potential to significantly lower claim processing times in the future. AI can detect fraudulent claims and learn from previous data to improve efficiency significantly.
  • Personalised Health Insurance Policies: On the basis of an individual’s past data and current health circumstances, insurers can identify and develop a health insurance plan for them. This assists the insurer in providing a proper health insurance plan rather than a health insurance package that clients may or may not utilise efficiently. Customers will also be urged to select a plan that meets their requirements rather than paying for services they may not use.
  • Cost-effectiveness: Insurers are utilising AI to recommend good habits and behaviours to clients, such as exercise and diet, lowering the cost of avoidable healthcare expenditures caused by bad habits.
  • Fraud Detection: Researchers are working on building machines that can evaluate health insurance claims and anticipate fraud. This also aids insurers in resolving legitimate claims more quickly.
  • Faster Underwriting: The health insurance underwriting procedure is lengthy and time-consuming. Fitness trackers, for example, can now collect and analyse vast amounts of data and share it with insurance companies thanks to technological breakthroughs, such as smart wearable technologies. Insurers can find innovative methods to underwrite consumers differently by employing these data. By adopting AI-based predictive analysis, health insurance firms may save time and money.

Even as the healthcare business quickly digitises, enormous amounts of data will inevitably be created and gathered. This will simply increase the workload for healthcare providers since more raw data means more effort. For healthcare professionals and patients, AI can interpret these data and deliver insights based on them. It is a more efficient way to diagnose ailments. Some of the advantages of AI and ML in healthcare are:

  • Clinical Observation-Based Decisions: AI and machine learning can process vast volumes of data in real time and give critical information that can aid in patient diagnosis and treatment recommendations. This translates to improved healthcare services at a reduced cost by evaluating patient data and delivering findings in a couple of minutes. Diabetes or blood sugar devices, for example, may analyse data rather than merely reading raw data and alert you to patterns depending on the information presented, allowing you to take immediate or corrective action.
  • Increased Accessibility: While affluent countries can offer healthcare to the majority of their citizens, underdeveloped countries may struggle. This is owing to a technological gap in healthcare, which results in a drop in the respective country’s health index. Reaching out to individuals in the farthest reaches of the globe is an important task, and the risk of healthcare deprivation is growing. By establishing an efficient healthcare system, AI can assist to alleviate this problem. Digital healthcare will help bridge the gap between poor and wealthy countries by allowing people to better understand their symptoms and obtain treatment as soon as possible.
  • Helps Reveal Early Illness Risks: AI can evaluate enormous amounts of patient medical data and compile it all in one location, which can help reveal early illness risks. It may examine prior and current health issues using the information. Doctors may compare the data and make an accurate diagnosis, allowing them to deliver the best therapy possible. With a large amount of data in one location, AI-powered healthcare applications can assess a wide range of symptoms, diagnose ailments, and potentially forecast future illnesses.
  • Early Detection of Illness: Artificial intelligence can learn from data, such as diagnoses, medical reports, and photographs. This helps detect the beginning of ailments over time as well as implement preventative and mitigation measures.
  • Artificial intelligence also saves time and money by reducing the time and effort required to evaluate and diagnose an ailment. Instead of waiting for a doctor’s consultation to diagnose your sickness, AI will be able to analyse and offer correct inputs to the doctor, allowing the doctor to make the best decision possible and minimising the time it takes to deliver early treatment. People may not need to visit many laboratories for diagnosis if AI can read and evaluate the condition.
  • Expediting Processes: By streamlining visits, interpreting clinical notes, and recording patient notes and treatment plans, AI can assist clinicians in decreasing their administrative load. The benefits of AI in healthcare are numerous since it simplifies operations and offers reliable data in less time.
  • Improve Drug Development: Drug development can take a long time and sometimes miss deadlines for pharmaceutical companies to deliver the proper formula. On the other hand, drug development has never been faster than it is now, thanks to AI. AI allows scientists to concentrate on creating treatments that are both promising and relevant to the needs of patients. It saves time and money when creating medications that might save lives in an emergency.

When it comes to evaluating data, healthcare in India is incredibly complicated and difficult to grasp, and patients often suffer the price. Artificial intelligence (AI) in healthcare can boost efficiency and treatment effectiveness. It can also assist healthcare personnel in spending more time delivering appropriate treatment, lowering burnout among medical experts. Here are a few examples of how AI affect healthcare:

  • In undeveloped or neglected nations, healthcare access is limited.
  • Electronic health records are less burdensome.
  • Antibiotic resistance threats are being reduced.
  • Insurance claims are processed faster.
  • Plans for individual health insurance.

The highlights of this research are:

  • This domain of insurance prediction is not fully explored and requires thorough research. From the proposed machine learning model, patients, hospitals, physicians, and insurance providers could benefit and accomplish their tasks faster and more efficiently.
  • The authors trained an ANN-based regression model to predict health insurance premiums.
  • The model was evaluated against key performance metrics, such as RMSE , MSE , MAE , r 2, and adjusted r 2.
  • The overall accuracy of the proposed model was 92.72%.
  • The correlation matrix was plotted to visualise the relationship between various factors with the charges.

This paper is organised as follows: Section 1 starts with the introduction to the topic and concept; this section highlights the latest related work in this domain. Section 3 discusses the working methodology followed in the implementation, and Section 4 shows the results and discussion. Finally, Section 5 contains the conclusion of the entire paper.

2. Related Work

The authors identified 245 published research papers from 2017 to the present date from IEEE, ACM, Inderscience, Elsevier, and other highly referenced journals. On the basis of similar research using artificial intelligence and machine learning models to predict the health insurance premium amounts for subscribers, the authors divided and classified these publications with a four-stage selection method. The literature survey for all the selected papers and the categorisation breakdown are illustrated in Table 1 .

Research literature classification.

This helped to shortlist 46 papers that were closely matched and relevant research, as illustrated in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g001.jpg

Research papers selection process.

In the healthcare business, predicting health insurance premiums using machine learning (ML) algorithms is still a subject that has to be investigated and improved. The work of [ 2 ] presented a computational intelligence technique for estimating healthcare insurance expenditures using a set of machine learning techniques. One essay [ 3 ] began by looking at the potential ramifications of using predictive algorithms to calculate insurance prices. Would this jeopardise the idea of risk mutualisation, resulting in new kinds of prejudice and insurance exclusion? In the second stage, the authors looked at how the connection between the company and the insured was altered when the customer realised that the firm had a lot of data about her actual behaviour that was constantly updated.

The goal of the study proposed by van den Broek-Altenburg and Atherly [ 4 ] was to find out what customers think about medical insurance by looking at what they talk about on Twitter. The goal was to utilise sentiment classification to find out how people feel about health insurance and doctors. During the 2016–2017 healthcare insurance registration period in the United States, the authors utilised an Application Program Interface (API) to collect tweets on Twitter with the phrases “health insurance” or “health plan”. A policy that decreases or negates the costs of losses caused by different hazards is known as insurance. Several [ 5 ] variables impact the cost of insurance. These elements have an impact on the development of insurance plans. Machine learning (ML) can help the insurance industry enhance the efficiency of policy wording.

An article by Nidhi Bhardwaj and Rishabh Anand [ 6 ] used individuals’ health data to forecast their insurance premiums. To assess and evaluate the performance of various algorithms, regression was utilised. The dataset was used to train the models, and the results of that training were utilised to make predictions. The model was then tested and verified by comparing the anticipated quantity to the actual data. The accuracy of these models was later compared. Multiple linear regression and gradient boosting algorithms outperformed linear regression and decision trees, according to the findings. Gradient boosting was suitable in this scenario since it required far less computing time to attain the same performance measure, although its performance was equivalent to that of multiple regression.

In the life insurance sector, risk assessment is critical for classifying applicants. Companies utilise screening methodology to produce application decisions and determine the pricing of insurance products. The vetting process may be computerised to speed up applications or programs thanks to the expansion of data and advances in business intelligence. The goal of the study in [ 7 ] was to find ways to use predictive analytics to improve risk assessment for life insurance companies. The research was conducted using a real-world dataset with over a hundred characteristics (anonymised). Dimensionality reduction was performed to choose salient features that could increase the models’ prediction potential.

Actuaries utilise a variety of numerical procedures to forecast yearly medical claims expenditure in an insurance business. This sum must be accounted for in the annual financial budgets. Inaccurate estimation usually has a detrimental impact on a company’s overall success. Goundar et al. [ 8 ] explained how to build an artificial neural network (ANN) that can predict yearly medical claims. The aim was to lower the mean absolute percentage error by changing factors of the configuration, such as the epoch, learning rate, and neurons, in various layers once the neural network models were constructed. Feed forward and recurrent neural networks were utilised to forecast the yearly claim amounts.

Joseph Ejiyi et al. [ 9 ] investigated an insurance dataset from the Zindi Africa competition, which was stated to be from Olusola Insurance Company in Lagos, Nigeria, to demonstrate the efficacy of each of the ML algorithms we employed here. The results showed that, according to a dataset obtained from Zindi, insurance authorities, shareholders, administration, finance professionals, banks, accountants, insurers, and customers all expressed worry about insurance company insolvency. This worry stemmed from a perceived requirement to shield the general public from the repercussions of insurer insolvencies while also lowering management and auditing duties. In this work [ 10 ], we offer a strategy for preventing insurance company insolvency. In the past, insolvency prediction approaches, such as multiple regression, logit analysis, recursive partitioning algorithm, and others were applied.

Fauzan and Murfi [ 11 ] used XGBoost to solve the issue of claim prediction and evaluate its accuracy. We also compared XGBoost’s performance against that of other ensemble learning methods, such as AdaBoost, Stochastic GB, Random Forest, and Neural Network, as well as online learning methods. In terms of normalised Gini, our simulations suggest that XGBoost outperforms other techniques. People are increasingly investing in such insurance, allowing con artists to defraud them. Insurance fraud is a crime that can be committed by either the customer or the insurance contract’s vendor. Unrealistic claims and post-dated policies, among other things, are examples of client-side insurance fraud. Insurance fraud occurs on the vendor side in the implementation of regulations from non-existent firms and failure to submit premiums, among other things. In this study [ 12 ], we compare and contrast several categorisation methods.

Kumar Sharma and Sharma [ 13 ] aimed to develop mathematical models for predicting future premiums and validating the findings using regression models. To anticipate policyholders’ choice to lapse life insurance contracts, we employed the random forest approach. Even when factoring in feature interactions, the technique beats the logistic model. Azzone et al. [ 14 ] studied how the model works; we employed global and local classification tools. The findings suggest that existing models, such as the logistic regression model, are unable to account for the variety of financial decisions.

Understanding [ 15 ] the elements that influence a user’s health insurance premium is critical for insurance firms to generate proper charges. Premium should always be a user’s first concern when making suitable selections. The majority of characteristics that contribute to the cost of health care premiums are BMI, smoking status, age, and kids, according to the output, which revealed that these four parameters have a strong correlating effect on health insurance rates.

Premiums are determined by health insurance companies’ private statistical procedures and complicated models, which are kept concealed from the public. The goal of this study [ 16 ] is to see if machine learning algorithms can be used to anticipate the pricing of yearly health insurance premiums on the basis of contract parameters and business characteristics. The goal of this article [ 17 ] is to use a strong machine learning model to estimate the future medical costs of patients on the basis of specific parameters. Using the simulation results, the elements that influence individuals’ medical expenditures were determined.

The Japanese government has mandated that insurers develop a population health management strategy. To assess the strategy [ 18 ], a cost estimate is required. A standard linear model is not suited for the prediction since one insured patient might have several conditions. Using a quantitative machine learning technique, we created a medical cost forecasting model. The historical uniformity of health care expenses in a major state Medicaid programme was investigated in this research. The expenses were forecasted using predictive machine learning algorithms, particularly for high-cost, high-need (HCHN) patients. The findings of Yang et al. [ 19 ] indicated a high temporal link and showed potential for utilising machine learning to forecast future health care spending. HCHN patients had a stronger temporal association, and their expenditures may be anticipated more accurately. Including additional historical eras improved forecasting accuracy.

Some individuals who are economically disadvantaged will be unable to cover treatment-related fees.

According to our behaviour and genetics, the necessity for health insurance varies as we grow older. Health insurance is becoming increasingly important as people’s lifestyles and ailments change. Because a medical problem can strike anybody at any moment and have such a significant psychological and economic impact on the individual, it is difficult to predict when one will occur. With this background in mind, this research [ 20 ] aimed to forecast the cost of health insurance using the following contributions: age, gender, region, smoking, BMI, and children.

The K-means algorithm [ 21 ] and the Elbow technique were used in this study to properly arrange people into an appropriate number of clusters on the basis of similarities. On the basis of this analysis, the health insurance premium quotation was predicted for each group of people using the specified criteria. Predicting the cost of people’s health insurance is a valuable way to increase healthcare accountability. In order to forecast insurance premiums for people in this research [ 22 ], Sailaja et al. employed several regression models to assess personal health information. A lot of things impact the cost of insurance rates. The use of a Stacking Regression model to anticipate insurance prices for people might help health insurers. Dutta et al. [ 23 ] estimated the cost of health insurance that the patient was responsible for paying. To accomplish the best prediction analysis, several data mining regression methods, such as decision tree, random forest, polynomial regression, and linear regression, were used. A study of the actual and expected expenditures of the prediction premium was made, and a graph was displayed as a result, allowing us to select the best-suited regression technique for insurance policy forecasting.

3. Research Methodology

In this paper, the authors used the Python programming language for the implementation and trained the machine learning-based model for the prediction of health insurance premiums. Initially, the dataset and the necessary python libraries and packages were imported. The dataset consisted of over 1300 entries and seven columns, namely charges, smoking, region, children, BMI, sex, and age. This dataset was used to predict the health insurance premium. Thereafter, an exploratory data analysis was performed. In this step, the dataset was checked for null values. Since there were no null values in the dataset, the statistical summary of the dataset was analysed. The statistical summary included the count, mean, standard deviation, and various other statistics related to the columns available in the dataset—age, BMI, number of children, and health insurance charges. The dataset link is given at the end of the paper in the Data Availability Statement. The entire methodology followed in this paper is shown in Figure 2 .

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g002.jpg

Machine learning-based regression framework.

3.1. Step 1: Performing the Data Analysis and Feature Engineering

In this step, the dataset was analysed to check the relationship between the various columns. As shown in Table 2 , it was observed that the southeast region had the highest charges and body mass index. The dataset was grouped by age, and then the relationship between age and charges was analysed.

Relationship between the region and charges.

In this step, the unique values in the sex, smoking, and region columns were checked, and the categorical variables were converted to numerical variables.

3.2. Step 2: Data Visualisation

In the previous step, the dataset was cleaned so that the model could be trained and visualised. In this step, the data was visualised to obtain useful information. In Figure 3 , the histogram is plotted for all the columns present in the dataset for a visual glimpse.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g003.jpg

Histogram plots for columns.

After that, the pairplot diagram was plotted, as illustrated in Figure 4 . Pairplot diagrams are used to figure out which attributes best explain the connection between two variables or form the most separated groups. Drawing basic lines or making a linear distinction in our dataset also aided in the formation of some simple categorisation models.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g004.jpg

Pairplot diagram of entire dataset.

The pairplot diagram showed the relationship between the various columns present in the dataset. A pairplot is a grid that shows all the different scatter plots with all the different combinations in our data. After plotting the pairplot diagram, the regplot was plotted, as shown in Figure 5 . We can see that as age increased, charges tended to increase as well. Therefore, there is a linear relationship between the charges and age.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g005.jpg

Regplot of charges vs. age.

Regplot is a programme that plots data and fits a linear regression model. To evaluate the regression model, there are several mutually incompatible alternatives. In Figure 6 , there is a straight line that passes through the data, and it seems that body mass index (BMI) tends to increase a little bit. It is possible that the charges also tend to increase slightly.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g006.jpg

Regplot of charges vs. BMI.

3.3. Step 3: Training and Evaluating a Linear Regression Model

In this step, the authors trained the linear regression model, but before training the model, the dataset was cleaned. Only the numerical values were taken, and the data were scaled. A standard scaler was used to scale the data. Scaling the data is important before feeding the data to the model. Once the data was scaled completely, the linear regression model was trained. The accuracy of the linear regression model came out to be 75.09%. After that, the linear regression model was evaluated by finding the Root Mean Square Error ( RMSE ), Mean Squared Error ( MSE ), Mean Absolute Error ( MAE ), and adjusted r 2 score. The formulas used for the calculation of all the parameters mentioned are given below.

The output of the evaluation is shown in Table 3 .

Evaluation metrics for the linear regression model.

4. Results and Discussion

The final step i.e., training and evaluating an ANN-based regression model is discussed in this section. Initially, the entire dataset is split into 20% testing data and 80% training data. In training the ANN model, the authors have used keras sequential model in which five dense layers are added and ‘relu’ activation function is used. Adam optimiser is used to optimise the performance of the model. Table 4 shows the model summary. The total trainable parameters are 38,351 whereas there are 0 non-trainable parameters.

ANN model summary.

The model was trained for 100 epochs, and the batch size was 20 with a validation split equal to 0.2. The accuracy of this model came out to be 92.72%, and the validation and training loss is plotted in Figure 7 .

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g007.jpg

Training loss vs. validation loss.

Moreover, the model predictions and true values were also plotted to see the relationship between them. Figure 8 shows the plot of model predictions vs. true values, whereas Figure 9 shows the inverse transform plot of model predictions vs. true values.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g008.jpg

Model predictions vs. true values.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g009.jpg

Inverse transform of model predictions vs. true values.

Once the ANN model was trained and the accuracy was calculated, then the performance of the model was evaluated using the same performance metrics, i.e., RMSE , MSE , MAE , r 2, and adjusted r 2. Table 5 shows the comparison between the evaluation metrics of our trained ANN model and the linear regression model. From the comparison, it is clear that our trained model had better performance.

Comparison of the evaluation metrics for the trained ANN model vs. linear regression model.

Here, one can compare Table 4 and can conclude that the evaluation metrics of our trained model are better than those of the linear regression model. Finally, the correlation matrix was plotted to see the positive and negative relationships among the multiple factors. Here, after observing the correlation matrix in Figure 10 , we can conclude that charges are positively related to smoking and age, whereas southwest and northwest regions are negatively related to charges.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-07898-g010.jpg

Correlation matrix.

5. Conclusions

In the field of health insurance, machine learning is well-suited to tasks that are often performed by people at a slower speed. AI and machine learning are capable of analysing and evaluating large volumes of data in order to streamline and simplify health insurance operations. The impact of machine learning on health insurance will save time and money for both policyholders and insurers. AI will handle repetitive activities, allowing insurance experts to focus on processes that will improve the policyholder’s experience. Patients, hospitals, physicians, and insurance providers will benefit from ML’s ability to accomplish jobs that are currently performed by people but are much faster and less expensive when performed by ML. When it comes to exploiting historical data, machine learning is one component of cognitive computing that may address various challenges in a broad array of applications and systems. Forecasting health insurance premiums is still a topic that has to be researched and addressed in the healthcare business. In this study, the authors trained an ANN-based regression model to predict health insurance premiums. The model was then evaluated using key performance metrics, i.e., RMSE , MSE , MAE , r 2, and adjusted r 2. The accuracy of our model was 92.72%. Moreover, the correlation matrix was also plotted to see the relationship between various factors with the charges. This domain of insurance prediction has not been fully explored and requires thorough research.

Funding Statement

This research received no external funding.

Author Contributions

Conceptualisation, K.K. and A.B.; methodology, K.K. and A.B.; software, K.K. and A.B.; validation, K.K. and A.B.; formal analysis, K.K. and A.B.; investigation, K.K. and A.B.; resources, A.D.D. and R.S.; data curation, K.K. and A.B.; writing—original draft preparation, K.K. and A.B.; writing—review and editing, K.K. and A.B.; visualisation, K.K. and A.B.; supervision, A.D.D. and R.S.; project administration, A.D.D. and R.S.; funding acquisition, A.D.D. and R.S. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Book cover

Proceedings of International Conference on Computing and Communication Networks pp 29–38 Cite as

Insurance Sales Forecast Using Machine Learning Algorithms

  • Zuhal Kurt 13 ,
  • Emrecan Varyok 14 ,
  • Ege Baran Ayhan 14 ,
  • Mehmet Turhan Bilgin 13 &
  • Duygu Duru 15  
  • Conference paper
  • First Online: 09 July 2022

526 Accesses

Part of the Lecture Notes in Networks and Systems book series (LNNS,volume 394)

Car accidents and the possible resulting loss of assets or life are issues for every car owner that must contend with some point in their driving life. Driving is an inherently dangerous act, even if it does not seem so at first, resulting in greater than 33,000 fatal vehi le crashes in USA in 2019 alone. However, the loss of life and possible damages can be reduced with the help of insurances. Insurance is an arrangement under which a person or agency receives financial security or reimbursement from an insurance provider in the form of a policy. Insurances help limit the losses of the customers when an undesirable event occurs, such as a car crash or a heart attack. Vehicle insurance provides customers monetary compensation after unfortunate accidents, provided they annually pay premium fees to the companies first. Our goal is to develop a machine learning algorithm that predicts customers who are interested in getting or renewing their vehicle insurance with the help of personal, vehicle, contact, and previous insurance data. The insurance sales forecast is helpful to companies, since they can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue, while also being beneficial to customers, who can go through the process and the aftermath of car accidents easier thanks to their monetary compensation. In this paper, the Health Insurance Cross-Sell Prediction dataset is used. The proposed model tries getting the value by training itself on a train and test dataset and will result in a categorical response feature based on the aforementioned data with the aid of well-known machine learning algorithms: k-nearest neighbors, random forest, support vector machines, Naive Bayes, and logistic regression.

  • Insurance prediction
  • Data analysis
  • Machine learning algorithm

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

S. Rawat, A. Rawat, D. Kumar, A.S. Sabitha, Application of machine learning and data visualization techniques for decision support in the insurance sector. Int. J. Inf. Manage. Data Insights 1 (2) (2021)

Google Scholar  

D.R. Gopagoni, P.V. Lakshmi, P. Siripurapu, Predicting the sales conversion rate of car insurance promotional calls, in Rising Threats in Expert Applications and Solutions. Advances in Intelligent Systems and Computing , vol. 1187, eds by V.S. Rathore, N. Dey, V. Piuri, R. Babo, Z. Polkowski, J.M.R.S. Tavares (Springer, Singapore, 2021)

IIHS HLDI, Fatality facts 2019 state by state: https://www.iihs.org/topics/fatality-statistics/detail/state-by-state#:~:text=There%20were%2033%2C244%20fatal%20motor,Columbia%20to%2025.4%20in%20Wyoming , Last accessed 31 June 2021 (2021)

K. Anmol, Health ınsurance cross sell prediction: https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction , Last accessed 31 June 2021 (2020)

B.D. Sommers, A.A. Gawande, K. Baicker, Health insurance coverage and health—what the recent evidence tells us. N. Engl. J. Med. 377 (6), 586–593 (2017)

CrossRef   Google Scholar  

Tutorials Point, KNN Algorithm—Finding nearest neighbors: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_knn_algorithm_finding_nearest_neighbors.htm , Last accessed 31 June 2021

W.H. Dong, Research on the features of car ınsurance data based on machine learning. Procedia Comput. Sci. 166 (2020)

M. Hanafy, R. Ming, Machine learning approaches for auto insurance big data. Risks 9 (2), 42 (2021)

Download references

Author information

Authors and affiliations.

Department of Computer Engineering, Atılım University, Ankara, Turkey

Zuhal Kurt & Mehmet Turhan Bilgin

Department of Automotive Engineering, Atılım University, Ankara, Turkey

Emrecan Varyok & Ege Baran Ayhan

Department of Chemical Engineering, Atılım University, Ankara, Turkey

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Zuhal Kurt .

Editor information

Editors and affiliations.

Manchester Metropolitan University, Manchester, UK

Ali Kashif Bashir

University of Calabria, Rende, Italy

Giancarlo Fortino

Maharaja Agrasen Institute of Technology, New Delhi, Delhi, India

Ashish Khanna

Deepak Gupta

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Kurt, Z., Varyok, E., Ayhan, E.B., Bilgin, M.T., Duru, D. (2022). Insurance Sales Forecast Using Machine Learning Algorithms. In: Bashir, A.K., Fortino, G., Khanna, A., Gupta, D. (eds) Proceedings of International Conference on Computing and Communication Networks. Lecture Notes in Networks and Systems, vol 394. Springer, Singapore. https://doi.org/10.1007/978-981-19-0604-6_3

Download citation

DOI : https://doi.org/10.1007/978-981-19-0604-6_3

Published : 09 July 2022

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-0603-9

Online ISBN : 978-981-19-0604-6

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Predicting whether the customer would be interested in buying Vehicle insurance or not.


Name already in use.

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more about the CLI .

  • Open with GitHub Desktop
  • Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Health insurance cross sell prediction.

Predicting whether a customer would be interested in buying Vehicle Insurance so that the company can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Table of Content

Problem statement, data description.

  • 1 Data Wrangling
  • 2 Normalization
  • 4 Encoding categorical values
  • 5 Feature Selection
  • 6 Model Fitting
  • 8 Hyper-parameter Tuning
  • 9 Metrics Evaluation

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. There are multiple factors that play a major role in capturing customers for any insurance policy. Here we have information about demographics such as age, gender, region code, and vehicle damage, vehicle age, annual premium, policy sourcing channel. Based on the previous trend, this data analysis and prediction with machine learning models can help us understand what are the reasons for news popularity on social media and obtain the best classification model.

Our client is an Insurance company that has provided Health Insurance to its customers. Now they need the help in building a model to predict whether the policyholders (customers) from the past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

We have a dataset which contains information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc. related to a person who is interested in vehicle insurance. We have 381109 data points available.

Project Outline

1. data wrangling.


Outlier Treatment

2. normalization.

After outlier treatment, we observed that the values in the numeric columns were of different scales, so we applied the min-max scaler technique for feature scaling and normalization of data.

In Exploratory Data Analysis, firstly we explored the 4 numerical features: Age, Policy_Sales_Channel, Region_Code, Vintage. Further, we categorized age as youngAge, middleAge, and oldAge and also categorized policy_sales_channel and region_code. From here we observed that customers belonging to the youngAge group are less interested in taking vehicle insurance. Similarly, Region_C, Channel_A have the highest number of customers who are not interested in insurance. From the vehicle_Damage feature, we were able to conclude that customers with vehicle damage are more likely to take vehicle insurance. Similarly, the Annual Premium for customers with vehicle damage history is higher.

4. Encoding categorical values

We used one-hot encoding for converting the categorical columns such as 'Gender', 'Previously_Insured','Vehicle_Age','Vehicle_Damage', 'Age_Group', 'Policy_Sales_Channel_Categorical', 'Region_Code_Categorical' into numerical values so that our model can understand and extract valuable information from these columns.

5. Feature Selection

At first, we obtained the correlation between numeric features through Kendall’s Rank Correlation to understand their relation. We had two numerical features, i.e. Annual_Premium and Vintage. For categorical features, we tried to see the feature importance through Mutual Information. It measures how much one random variable tells us about another.

6. Model Fitting

For modeling, we tried the various classification algorithms like:

i. Decision Tree

Decision Trees are non-parametric supervised learning methods, capable of finding complex non-linear relationships in the data. Decision trees are a type of algorithm that uses a tree-like system of conditional control statements to create the machine learning model. A decision tree observes features of an object and trains a model in the structure of a tree to predict data in the future to produce output. For classification trees, it is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

ii. Gaussian Naive Bayes

Gaussian Naive Bayes is based on Bayes’ Theorem and has a strong assumption that predictors should be independent of each other. For example, Should we give a Loan applicant depending on the applicant’s income, age, previous loan, location, and transaction history? In real-life scenarios, it is most unlikely that data points don’t interact with each other but surprisingly Gaussian Naive Bayes performs well in that situation. Hence, this assumption is called class conditional independence.

iii. AdaBoost Classifier

Boosting is a class of ensemble machine learning algorithms that involve combining the predictions from many weak learners. A weak learner is a very simple model, although has some skill on the dataset. Boosting was a theoretical concept long before a practical algorithm could be developed, and the AdaBoost (adaptive boosting) algorithm was the first successful approach for the idea. The AdaBoost algorithm involves using very short (one-level) decision trees as weak learners that are added sequentially to the ensemble. Each subsequent model attempts to correct the predictions made by the model before it in the sequence. This is achieved by weighing the training dataset to put more focus on training examples on which prior models made prediction errors.

iv. Bagging Classifier

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

v. LightGBM

Light GBM is a gradient boosting framework that uses tree-based learning algorithms. Light GBM grows trees vertically while other algorithms grow trees horizontally meaning that Light GBM grows trees leaf-wise while other algorithms grow level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, a Leaf-wise algorithm can reduce more loss than a level-wise algorithm. Light GBM is prefixed as ‘Light’ because of its high speed. Light GBM can handle the large size of data and takes lower memory to run.

vi. Logistic Regression

Logistic regression is named for the function used at the core of the method, the logistic function.

The logistic function, also called the sigmoid function, was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

7. Hyperparameter Tuning

Tuning of hyperparameters is necessary for modeling to obtain better accuracy and to avoid overfitting. In our project, we used following techniques:

- GridSearchCV

- randomizedsearchcv, - halvingrandomizedsearchcv.

Screenshot from 2021-10-05 20-40-27

8. Metrics Evaluation

To evaluate our model and to obtain the accuracy and error rate of our models before and after hyperparameter tuning. We used some metric evaluation technique. They are:

i. Confusion Matrix

Ii. accuracy, iii. precision, v. f1-score, vi. roc-auc score, vii. log loss.

Evaluaion Before Tuning


Evaluaion After Tuning


Challenges Faced

  • Handling Large dataset.
  • Already available methods of hyper-parameter tuning were taking a huge amount of time to process.
  • Memory Optimization during hyperparameter tuning.

Starting from loading our dataset, we firstly performed data cleaning and refactoring by outlier detection and normalization of data. Then we covered EDA, feature selection and algorithm selection, and hyperparameter tuning. The Accuracy score obtained for all models was in the range of 68% to 85% before tuning After tuning the models we were able to get an accuracy of approx 87%. But we selected our best model as the model with an accuracy score of 85% considering precision and recall as we have an unequal number of observations in each class in our dataset, so accuracy alone can be misleading.

Here is a glimpse of few graphs we plotted, there are many more in the notebook please have a look.

  • https://towardsdatascience.com
  • https://www.analyticsvidhya.com
  • https://machinelearningmastery.com

Meet The Team:

Bhwesh Gaur: https://github.com/bhweshgaur
S Sravya Sri : https://github.com/SSravyaSri
Priyanka Dodeja : https://github.com/PriyankaDodeja
Ankit Bansal : https://github.com/ankit986
  • Jupyter Notebook 100.0%

Health Insurance Cross Sell Prediction


An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

For this project, we are using the dataset that was found on Kaggle. https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction . The data is about an Insurance company that has provided Health Insurance to its customers in past year and is now interested in providing Vehicle Insurance to its policy holders.

To predict if an insurance policy holder would be interested to buy a vehicle insurance as well. Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

The aim of this project is to leverage the machine learning algorithms such as Logistic Regression and Random Forest to create a predictive model using statistically significant variables from the given data set.

Model accuracy will be assessed using different techniques such as ROC (Receiver operating characteristic), AUC (Area under the ROC curve) and Confusion Matrix.


Data manipulation.

  • From the above data, we see that most of the Insurance related details are outreached through policy sales channel 152.
  • From the above data, we can observe that the mean annual premium for both damaged and undamaged vehicles are similar.

Data Visualizations

  • There is no significant relation between Age and Annual Premium as per the above graph.

Vehicle damage based on Age

  • The proportion of vehicles that are damaged vs non-damaged is lower in the range in the range of 20-30 age group whereas, for age group 45-80, the proportion of damaged vehicles is more than non-damaged ones.

Vehicle damage based on Gender

  • The proportion of females who have damaged vehicles is less when compared to that of men.

Response count vs Vehicle Damage

  • From the above graph we can observe that those having a vehicle damage tend to subscribe more for a vehicle insurance.

Response count vs Vehicle Age

  • From the above graph we can observe that majority of new vehicle owners i.e. < 1 year are not interested in a vehicle insurance.
  • Also, a significant number of vehicle owners i.e. with vehicle age 1-2 year are also not interested in a vehicle insurance.
  • From the group who are interested in a vehicle insurance (Response=1), majority of them have vehicle age as 1-2 year.

Creating a dataframe using SQLDF package

Correlation matrix.

  • From the above correlation matrix, we get to know that Response variable has better relationship with Vehicle Damage and Vehicle Age.

Logistic Regression

Data division into train and test, fitting logistic regression on all the variables.

The p value for id, Region Code and Vintage are not statistically significant so we remove them and then fit the logistic regression.

Histogram of predicted probability

  • We can see from the above histogram that the predicted probability threshold for predicting whether an individual will buy a vehicle insurance needs to be less than 0.50, as no predicted probability is greater than ~0.45. For this case study, we will predict that any individual with a predicted response probability greater than 0.25 is predicted as a buyer.

Confusion Matrix

Checking the accuracy.

  • The accuracy of logistic regression model we used came out to be 78.33%.

We explored other classification algorithms to compare the accuracy of our logistic regression model. Therefore, we will create another classification model using Random Forest and then compare these 2 models in order to select the better one for our problem statement.

Random Forest

Confusion matrix for random forest, accuracy for random forest.

  • The Random Forest model accuracy came out be 87.66%, which is more than 78.33% obtained from Logistic Regression.

ROC and AUC Statistics

In the above logistics regression, we found 0.25 to be the probability threshold for determining the classification by mere visual inspection. However, for deciding the probability threshold in a better way, we used ROC and AUC curves in addition.

  • On studying the ROC Curve, we could strike a trade off between TP rate and FP rate at that part of the curve where TP rate is between 80 and 100.
  • On investigating the ROC + AUC augmented graph, we observed that the AUC is more Logistic Regression. This implies that this algorithm has outperformed Logistic regression for the given sample chosen for the classification problem. However, this does not conclude that logistic regression is the optimal model for our problem since our dataset is highly imbalanced.

Hence, we are are using SMOTE that can overcome the challenge of class imbalance and then we will assess the models again on ROC AUC curve.

SMOTE (Synthetic Minority Oversampling Technique) is used in the above data set to balance out the highly skewed proportions of 1s and 0s. Proportion for 1s was less than 15% in our dataset, hence we have up-sampled the # of 1s and down-sampled the # of 0s in order to achieve a balanced dataset on which classification can be performed.

Post application of SMOTE, we achieved a 60:40 ratio of 0s and 1s respectively.

Logistic regression model for balanced dataset

  • Now we will use the updated balanced dataset, which we obtained from running SMOTE on our original dataset, to create test and train sets. These updated train and test datasets will be further utilized in creating updated classification models.
  • To check the probability threshold using which we can classify our data, we will create a prediction using type = “response”, which will create the probabilities instead of actual predicted values. Then we create a histogram and check which threshold we can take. As observed from the below histogram, we have taken 0.5 as the probability threshold.
  • Now we will create a confusion matrix to assess the accuracy of our updated logistic regression model.

Random Forest after applying SMOTE

  • To assess whether our updated model is giving better results, we create ROC AUC curve and check the area under curve. As we can observe, the AUC for Random Forest is greater than AUC for Logistic Regression. Therefore, we can conclude that Random Forest model accurately classifies our dataset.
  • We will now export the results to an external file i.e. the predicted values whether the customer would take vehicle insurance or not.

The aim of our project was to classify whether customer will take vehicle insurance or not. We used dataset from kaggle and performed exploratory data analysis for descriptive statistics.

For classification, we created 2 classification models i.e. Logistic Regression and Random Forest and calculated accuracy and AUC statistics. We observed that Logistic Regression outperformed Random Forest. However, this result was due to the fact that our dataset was highly imbalanced with proportion of 1s in the response variable less than 15%.

To mitigate the impact of imbalanced dataset, we used SMOTE technique to achieve a 60:40 ratio in the response variable. We again created both classification models and assessed their accuracy. In this case, Random Forest outperformed Logistic Regression based on its accuracy and AUC score.

  • The above table summarizes the model statistics before and after SMOTE for both the classification models. We can conclude that overall, Random Forest is a better model at classifying our dataset.

ROC and AUC for assessing classification model performance at different probability thresholds

Roc (receiver operating characterstic curve).

An ROC curve (receiver operating characteristic curve) is a graph, which shows the performance of a classification model at all classification thresholds (probability at which dependent variable is classified).

ROC curve plots two parameters:

  • True Positive Rate
  • False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

\[TPR=(TP)/(TP + FN)\]

False Positive Rate (FPR) is defined as follows:

\[FPR=(FP)/(FP + TN)\]

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve -

AUC (Area under curve)

AUC represents the measure of separability, which tells how much the classification model is capable of distinguishing between classes. Generally, higher the AUC, the better the model at predicting 1s as 1s and 0s as 0s.

An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0 which means it has the worst measure of separability. For example - When AUC is 0.9, it means there is a 90% chance that the model will be able to distinguish between positive class and negative class.

To visually represent AUC, it is the area under the ROC curve. We can plot multiple ROC & AUC curve on same axis to compare the performance of different models.

Implementation of ROC and AUC in R

To implement ROC & AUC in R, we can use roc function in R which is present in the pROC library (part of pROC package).

roc function in R outputs the value of AUC as well and plots the curve in case plot attribute is equal to TRUE. We can also print the value of AUC on the curve itself and also plot multiple models on a single curve to compare the model performance.

Example of R code for ROC -

Decision Trees


  • Decision Tree is a predictive modelling tool that are constructed via an algorithmic approach and that identifies ways to split a data set based on different conditions. It is one of the most widely used and practical methods for supervised learning. Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features

Deciding nodes for Decision Trees:

For generating best possible decision trees, the variables which act as nodes in the decision trees are decided by calculating the Information Gain of each independent variable of the dataset.

To define Information Gain precisely, we need to define a measure commonly used in information theory called entropy that measures the level of impurity in a group of examples. Mathematically, it is defined as:

\[\sum_{x = i}^{n} -p * log_2(p_i)\] \[p_i= Probability of class i\] + The following figure shows the form of the entropy function relative to a boolean classification as \(p_+\) varies between 0 and 1.

  • Now, given entropy as a measure of the impurity in a sample of training examples, we can now define information gain as a measure of the effectiveness of an attribute in classifying the training data. Information gain, Gain (S, A) of an attribute A, relative to a sample of examples S, is defined as:

\[Gain(S,A) = Entropy (S) - \sum_{x = v \epsilon values(A)} rac|S_v||S|. Entropy(S_v)\] + where Values(A) is the set of all possible values for attribute the A, and \(S_v\) is the subset of S for which attribute A has value v. Note the first term in the equation is just entropy of the original sample S, and the second term is the expected value of entropy after S is partitioned using attribute A, i.e. entropy of its children. Expected entropy described by this second term is simply the sum of entropies of each subset \(S_v\) , weighted by the fraction of examples $ rac{|S_v|}{|S|}$ that belong to \(S_v\) . Gain(S, A) is therefore the expected reduction in entropy caused by knowing the value of attribute A.

In short : \[ Information Gain = Entropy(parentnode) - Average Entropy (children)\]

Random Forests

  • Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble

A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction).

While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction. So the prerequisites for random forest to perform well are:

  • There needs to be some actual signal in our features so that models built using those features do better than random guessing.
  • The predictions (and therefore the errors) made by the individual trees need to have low correlations with each other.

R code example

\[RFModel = randomForest(Response \sim .,data=dataset)\] + Here the ~ . will regress Response varaible to all the predictor variables in the dataset

SMOTE for imbalanced classification

  • SMOTE stands for Synthetic Minority Oversampling Technique

SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

This procedure can be used to create as many synthetic examples for the minority class as are required. It suggests first using random undersampling to trim the number of examples in the majority class, then use SMOTE to oversample the minority class to balance the class distribution.

Example of SMOTE in R:

\[Balanced Data <- SMOTE(Response \sim ., trainSplit, perc.over = 100, perc.under= 100)\]

Cellular Automata (CA) Papers

by Alvy Ray Smith

Cellular Automata

Simple Nontrivial Self-Reproducing Machines

Pattern Recognition by Finite Cellular Automata

The Queen Bee Theorem: A Desynchronization Theorem for Cellular Automata

Introduction and Survey of Cellular Automata and Polyautomata Theory

Real-Time Language Recognition by One-Dimensional Cellular Automata

Existence of Finite Automata Capable of Being Synchronized although Arbitrarily Connected and Numerous

Two-Dimensional Formal Languages and Pattern Recognition by Cellular Automata

Simple Computation-Universal Cellular Spaces

Cellular Automata Complexity Trade-Offs

General Shift-Register Sequences of Arbitrary Cycle Length

Cellular Automata and Formal Languages

Cellular Automata Theory

Simple Computation-Universal Cellular Spaces and Self-Reproduction

IEEE FOCS = the Symposium on the Foundations of Computer Science of the Institute of Electrical and Electronics Engineers, formerly known as the Switching and Automata Theory Symposium (SWAT). My proudest contribution to this proceedings is a cover design used for decades.

Encyclopedia of Computer Science , 4th edition, edited by Anthony Ralston and Edwin D Reilly, International Thomson Publishing, Jan 1993. I wrote the original entry for the 1st edition and have updated it for each subsequent edition, 1976, 1983, 1993.

request reprint

Accepted for publication by Journal of Computer and System Sciences , c1974, but never completed (I changed careers to computer graphics instead, and this paper depended on the Queen Bee paper which was rejected for publication). The principal results of these two papers eventually appeared in the book Picture Languages: Formal Models for Picture Recognition , edited by Azriel Rosenfeld, Chapters 3 and 5, Academic Press, NY, 1979.

Abstract. The class of pattern sets accepted by cellular automata (CA, finite, connected subsets of cells in a 2-dimensional cellular space) is shown to be precisely the class of languages generated by the monotonic array grammars, a generalization of context-sensitive grammars to the 2-dimensional integer grid. The relationships of rectangular CA and simply-connected CA to this larger class of CA are established. The sets of rectangular patterns and simply-connected patterns are shown to be recognized within perimeter time by deterministic CA. The set of simple closed curves, the set of well-nested simple closed curves, the majority predicate, and the set of patterns with positive Euler number are shown to be recognized within diameter time by deterministic rectangular CA. The Queen Bee Theorem is used to show the equivalence of the isotonic and the monotonic array grammars and to prove the equivalence of two modes of pattern recognition. A standard derivation form, the northeast derivation form, is established for monotonic array grammars.

Abstract. Any cellular automaton (CA) can desynchronize. That is, any finite connected collection of identical finite-state machines (cells), each lying at a point in 2-dimensional integer space and interconnected with at most four of its nearest neighbors, can desynchronize in the sense that there is a cell design such that a CA of these cells, all of which are initially in the same state (synchronized), can find and distinguish, by local operations only, one and only one of its members. Furthermore, the desynchronization occurs within perimeter time, and the cell design is independent of the size or shape of the CA.

Rejected for publishing (but see above)

Artificial Life II, Santa Fe Institute Studies in the Sciences of Complexity, Vol X, edited by C G Langton, C Taylor, J D Farmer, and S Rasmussen, Addison-Wesley, 1991, 709-725 (Proceedings of the ALIFE2 (Artificial Life 2) Conference, Santa Fe, New Mexico, Feb 1990)

Abstract. A simple and brief proof of the existence of nontrivial self-reproducing machines, as cellular automata (CA) configurations, is presented, which relies only on computation universality. Earlier proofs are book length and rely on "construction universality." Furthermore, simple CA are shown to support nontrivial self-reproduction�hence, simultaneously simple and nontrivial. Nontriviality is guaranteed by the requirement that the machine which reproduces itself is also a universal computer. Biological relevance�or non-relevance�is also briefly discussed, as is trivial self-reproduction, called self-replication.

in Automata, Languages, Development , edited by A Lindenmayer and G Rozenberg, North-Holland Publishing Company, 1976, 405-422.

This is an extensive survey and bibliography of the field of CA (I was calling them "polyautomata" at the time) up to 1975. I added the words "Cellular Automata and" to the original title when I transcribed the original article into online form.

It was originally written as the introduction to the German edition of Theory of Self-Reproducing Automata, by John von Neumann, edited (posthumously) by Arthur W Burks, University of Illinois Press, Urbana, 1968. Von Neumann performed the work, completed by Burks, in 1952-53. The German publishers never published the German edition, but gave me permission to publish my survey in the book listed above, the proceedings of a conference held in Noordwijkerhout, The Netherlands, Apr 1975. I like to call this conference ALife0 - for Artificial Life 0 conference - since it was the first attempt I know to cross-fertilization between biologists and computer scientists. Many of the players at this conference were present for ALife1, ALife2, etc - cf Simple Nontrivial Self-Reproducing Machines . Other participants at ALife0 were Karel Culik, Pauline Hogeweg, John Holland, Aristid Lindenmayer, and Stanislaw Ulam .

Journal of Computer and System Sciences , Vol 6, No 3, 233-253, Jun 1972

Abstract. Pattern recognition by parallel devices is investigated by studying the formal language recognition capabilities of 1-dimensional cellular automata (CA). The precise relationships of CA to iterative automata and to Turing machines are established: In both cases, CA are inherently faster. The relationship of context-free languages to the languages recognized in real time by bounded CA is detailed. In particular, nondeterministic bounded CA can recognize the context-free languages in real time. The deterministic case remains open, but many partial results are derived. Finally, closure properties and CA transformation lemmas are presented.

By Pierre Rosenstiehl, translated from the French by Alvy Ray Smith at New York University, Nov 1971 (unauthorized). A freeform translation of the title might be The Firing Squad Synchronization Problem for Cellular Automata on a Graph .

Paper originally published in French as Existence d'automates finis capables de s'accorder bien qu'arbitrairement connectes et nombreaux , ICC Bulletin, 1966, Vol 5, 245-261.

Abstract. A collection of n finite, identical automata are considered, where each one, at each unit time step, takes a new state as a function of the state taken at the preceding step by itself and by certain other automata in the collection, called its neighbors, arbitrarily chosen, but limited in number. The neighborhood relation is assumed symmetric. One is asked to determine for a given d , and independently of n , the number of states and the state-transition function of an automaton of the type which has the following property: All automata in the collection are put in the resting state except for one of them, A' , arbitrarily chosen, put into an initial state distinct from the resting state; then at the end of a finite length of time all automata in the collection depending, by the neighborhood relation, directly or indirectly on A' , are put simultaneously and for the first time into a final state agreed upon in advance. It is said then that our arbitrarily numerous automata are connected into an arbitrary network of automata of degree d , and that those in the component connected to A' have been synchronized.

12th IEEE FOCS Conference Record, 144-152, Oct 1971

Abstract. A formal study of pattern recognition capabilities of cellular automata (CA) is undertaken based on a class of recently introduced grammars for two dimensions, the array grammars, which can be thought of as the 2-dimensional generalization of context-sensitive grammars. The class of languages (patterns) generated by array grammars is shown to be precisely the class of languages accepted by CA forming rook-connected finite subsets of the plane. Thus the usual generalization to rectangular array-bounded CA is a special case of this class of machines. The concept of perimeter time is introduced as a natural measure of computing speeds for 2-dimensional CA, and connectedness and convexity are related to this measure. The class of patterns with positive Euler number is shown to be linear-time recognizable by rectangular array-bounded CA, thus solving an open problem of Beyer.

See also Pattern Recognition by Finite Cellular Automata

Journal of the Association for Computing Machinery , Vol 18, No 3, 339-353, Jul 1971

Abstract. The specialization of the theory of cellular spaces (cellular automata (CA)) to those spaces which compute partial recursive functions is presented. Neighborhood reduction and state-set reduction are shown to be particularly simple in this special theory, and one dimension is proved to be sufficient for computation universality. Several computation-universal CA (CUCA) are exhibited which are simple in the sense that each cell has only a small number q of states and a small number p of neighbors. For example, a 1-dimensional CUCA with pq = 36 is presented. Two quite different proofs of the existence of a 1-dimensional CUCA with only two neighbors are given. Finally, one of the theorems derived is used to settle three open decidability questions.

Information and Control , Vol 18, No 5, pp 466-482, Jun 1971

Abstract. The general theory of cellular automata (CA) is investigated with special attention to structural complexity. In particular, simulation of CA by CA is used to make explicit trade-off relationships between neighborhood size and state-set cardinality. A minimum neighborhood template with d +1 elements is established for the class of d -dimensional CA. The minimum state set for this class is shown to be the binary state set. The temporal costs, if any, of structural complexity trade-offs are also studied. It is demonstrated that any linear time cost can be eliminated and, in fact, a speed-up by arbitrary positive integer factor k can be attained at an increased structural cost.

IEEE Transactions on Computers , Vol C-20, No 4, 456-459, Apr 1971

Abstract. An r -ary shift-register sequence is desired that has arbitrary cycle length L <= r k for arbitrary r and k , where k is the number of stages (degree) of the shift register. The existence of such sequences is established for "almost all" cycle lengths L . Furthermore, existence of such sequences which are "zero free" for almost all cycle lengths L is proved.

11th IEEE FOCS Conference Record, 216-224, Oct 1970

Abstract. A set of equivalences is established among cellular automata (CA), iterative acceptors, and linear-bounded automata. However, CA are shown to be inherently faster than interactive acceptors. Many positive results are presented to indicate that the context-free languages can, perhaps, be accepted in time n and space n by CA, where n is the length of the CA.

See also Real-Time Language Recognition by One-Dimensional Cellular Automata

Technical Report No 2, Digital Systems Laboratory, Stanford Electronics Laboratories, Stanford University, Stanford, CA, Dec 1969

My PhD dissertation, supervised by Prof Michael Arbib at Stanford. There are several bugs in this report, fixed in the derivative publications Cellular Automata Complexity Trade-Offs , Simple Computation-Universal Cellular Spaces , Simple Nontrivial Self-Reproducing Machines , General Shift-Register Sequences of Arbitrary Cycle Length , and Real-Time Language Recognition by One-Dimensional Cellular Automata

9th IEEE FOCS Conference Record, 269-277, Oct 1968

Abstract. Cellular spaces (aka cellular automata (CA)) computationally equivalent to any given Turing machine are exhibited which are simple in the sense that each cell has only a small number of states and a small neighborhood. Neighborhood reduction theorems are derived in this interest, and several simple computation-universal CA are presented. Conditions for computation-universality of a CA are investigated, and, in particular, the conjecture that unbounded but boundable propagation in a space is a sufficient condition is refuted. Finally, the computation-universal CA derived in the study are used to introduce, via recursive function theory, examples of simple self-reproducing universal Turing machine configurations in one and two dimensions.

See also Cellular Automata Complexity Trade-Offs , Simple Computation-Universal Cellular Spaces , Simple Nontrivial Self-Reproducing Machines , and Real-Time Language Recognition by One-Dimensional Cellular Automata

  • Search Menu
  • Browse content in A - General Economics and Teaching
  • Browse content in A1 - General Economics
  • A10 - General
  • A14 - Sociology of Economics
  • Browse content in B - History of Economic Thought, Methodology, and Heterodox Approaches
  • Browse content in B0 - General
  • B00 - General
  • Browse content in B1 - History of Economic Thought through 1925
  • B10 - General
  • B15 - Historical; Institutional; Evolutionary
  • B16 - History of Economic Thought: Quantitative and Mathematical
  • Browse content in B2 - History of Economic Thought since 1925
  • B20 - General
  • B22 - Macroeconomics
  • B25 - Historical; Institutional; Evolutionary; Austrian
  • B26 - Financial Economics
  • Browse content in B3 - History of Economic Thought: Individuals
  • B31 - Individuals
  • Browse content in B4 - Economic Methodology
  • B40 - General
  • B41 - Economic Methodology
  • Browse content in B5 - Current Heterodox Approaches
  • B52 - Institutional; Evolutionary
  • Browse content in C - Mathematical and Quantitative Methods
  • Browse content in C0 - General
  • C01 - Econometrics
  • Browse content in C1 - Econometric and Statistical Methods and Methodology: General
  • C10 - General
  • C11 - Bayesian Analysis: General
  • C12 - Hypothesis Testing: General
  • C13 - Estimation: General
  • C14 - Semiparametric and Nonparametric Methods: General
  • C16 - Specific Distributions
  • C18 - Methodological Issues: General
  • Browse content in C2 - Single Equation Models; Single Variables
  • C20 - General
  • C21 - Cross-Sectional Models; Spatial Models; Treatment Effect Models; Quantile Regressions
  • C23 - Panel Data Models; Spatio-temporal Models
  • C25 - Discrete Regression and Qualitative Choice Models; Discrete Regressors; Proportions; Probabilities
  • C26 - Instrumental Variables (IV) Estimation
  • Browse content in C3 - Multiple or Simultaneous Equation Models; Multiple Variables
  • C31 - Cross-Sectional Models; Spatial Models; Treatment Effect Models; Quantile Regressions; Social Interaction Models
  • C32 - Time-Series Models; Dynamic Quantile Regressions; Dynamic Treatment Effect Models; Diffusion Processes; State Space Models
  • C33 - Panel Data Models; Spatio-temporal Models
  • C35 - Discrete Regression and Qualitative Choice Models; Discrete Regressors; Proportions
  • C38 - Classification Methods; Cluster Analysis; Principal Components; Factor Models
  • C39 - Other
  • Browse content in C4 - Econometric and Statistical Methods: Special Topics
  • C41 - Duration Analysis; Optimal Timing Strategies
  • C46 - Specific Distributions; Specific Statistics
  • C49 - Other
  • Browse content in C5 - Econometric Modeling
  • C51 - Model Construction and Estimation
  • C52 - Model Evaluation, Validation, and Selection
  • C53 - Forecasting and Prediction Methods; Simulation Methods
  • C55 - Large Data Sets: Modeling and Analysis
  • Browse content in C6 - Mathematical Methods; Programming Models; Mathematical and Simulation Modeling
  • C60 - General
  • C63 - Computational Techniques; Simulation Modeling
  • C67 - Input-Output Models
  • Browse content in C8 - Data Collection and Data Estimation Methodology; Computer Programs
  • C81 - Methodology for Collecting, Estimating, and Organizing Microeconomic Data; Data Access
  • Browse content in C9 - Design of Experiments
  • C91 - Laboratory, Individual Behavior
  • C92 - Laboratory, Group Behavior
  • C93 - Field Experiments
  • Browse content in D - Microeconomics
  • Browse content in D0 - General
  • D02 - Institutions: Design, Formation, Operations, and Impact
  • D04 - Microeconomic Policy: Formulation; Implementation, and Evaluation
  • Browse content in D1 - Household Behavior and Family Economics
  • D12 - Consumer Economics: Empirical Analysis
  • Browse content in D2 - Production and Organizations
  • D20 - General
  • D21 - Firm Behavior: Theory
  • D22 - Firm Behavior: Empirical Analysis
  • D23 - Organizational Behavior; Transaction Costs; Property Rights
  • D24 - Production; Cost; Capital; Capital, Total Factor, and Multifactor Productivity; Capacity
  • D25 - Intertemporal Firm Choice: Investment, Capacity, and Financing
  • Browse content in D3 - Distribution
  • D31 - Personal Income, Wealth, and Their Distributions
  • Browse content in D4 - Market Structure, Pricing, and Design
  • D40 - General
  • D42 - Monopoly
  • D43 - Oligopoly and Other Forms of Market Imperfection
  • D46 - Value Theory
  • Browse content in D5 - General Equilibrium and Disequilibrium
  • D52 - Incomplete Markets
  • D53 - Financial Markets
  • D57 - Input-Output Tables and Analysis
  • Browse content in D6 - Welfare Economics
  • D62 - Externalities
  • D63 - Equity, Justice, Inequality, and Other Normative Criteria and Measurement
  • Browse content in D7 - Analysis of Collective Decision-Making
  • D71 - Social Choice; Clubs; Committees; Associations
  • D72 - Political Processes: Rent-seeking, Lobbying, Elections, Legislatures, and Voting Behavior
  • D73 - Bureaucracy; Administrative Processes in Public Organizations; Corruption
  • D74 - Conflict; Conflict Resolution; Alliances; Revolutions
  • D79 - Other
  • Browse content in D8 - Information, Knowledge, and Uncertainty
  • D80 - General
  • D81 - Criteria for Decision-Making under Risk and Uncertainty
  • D82 - Asymmetric and Private Information; Mechanism Design
  • D83 - Search; Learning; Information and Knowledge; Communication; Belief; Unawareness
  • D84 - Expectations; Speculations
  • D85 - Network Formation and Analysis: Theory
  • D86 - Economics of Contract: Theory
  • D87 - Neuroeconomics
  • Browse content in D9 - Micro-Based Behavioral Economics
  • D91 - Role and Effects of Psychological, Emotional, Social, and Cognitive Factors on Decision Making
  • D92 - Intertemporal Firm Choice, Investment, Capacity, and Financing
  • Browse content in E - Macroeconomics and Monetary Economics
  • Browse content in E0 - General
  • E00 - General
  • E01 - Measurement and Data on National Income and Product Accounts and Wealth; Environmental Accounts
  • E02 - Institutions and the Macroeconomy
  • E03 - Behavioral Macroeconomics
  • Browse content in E1 - General Aggregative Models
  • E10 - General
  • E12 - Keynes; Keynesian; Post-Keynesian
  • E13 - Neoclassical
  • E17 - Forecasting and Simulation: Models and Applications
  • Browse content in E2 - Consumption, Saving, Production, Investment, Labor Markets, and Informal Economy
  • E21 - Consumption; Saving; Wealth
  • E22 - Investment; Capital; Intangible Capital; Capacity
  • E23 - Production
  • E24 - Employment; Unemployment; Wages; Intergenerational Income Distribution; Aggregate Human Capital; Aggregate Labor Productivity
  • E27 - Forecasting and Simulation: Models and Applications
  • Browse content in E3 - Prices, Business Fluctuations, and Cycles
  • E30 - General
  • E31 - Price Level; Inflation; Deflation
  • E32 - Business Fluctuations; Cycles
  • Browse content in E4 - Money and Interest Rates
  • E44 - Financial Markets and the Macroeconomy
  • Browse content in E5 - Monetary Policy, Central Banking, and the Supply of Money and Credit
  • E50 - General
  • E52 - Monetary Policy
  • E58 - Central Banks and Their Policies
  • Browse content in E6 - Macroeconomic Policy, Macroeconomic Aspects of Public Finance, and General Outlook
  • E60 - General
  • E61 - Policy Objectives; Policy Designs and Consistency; Policy Coordination
  • E62 - Fiscal Policy
  • E63 - Comparative or Joint Analysis of Fiscal and Monetary Policy; Stabilization; Treasury Policy
  • E65 - Studies of Particular Policy Episodes
  • Browse content in F - International Economics
  • Browse content in F1 - Trade
  • F10 - General
  • F12 - Models of Trade with Imperfect Competition and Scale Economies; Fragmentation
  • F14 - Empirical Studies of Trade
  • F15 - Economic Integration
  • F16 - Trade and Labor Market Interactions
  • F17 - Trade Forecasting and Simulation
  • F18 - Trade and Environment
  • Browse content in F2 - International Factor Movements and International Business
  • F21 - International Investment; Long-Term Capital Movements
  • F22 - International Migration
  • F23 - Multinational Firms; International Business
  • Browse content in F3 - International Finance
  • F31 - Foreign Exchange
  • F32 - Current Account Adjustment; Short-Term Capital Movements
  • F33 - International Monetary Arrangements and Institutions
  • F34 - International Lending and Debt Problems
  • F36 - Financial Aspects of Economic Integration
  • F38 - International Financial Policy: Financial Transactions Tax; Capital Controls
  • Browse content in F4 - Macroeconomic Aspects of International Trade and Finance
  • F41 - Open Economy Macroeconomics
  • F42 - International Policy Coordination and Transmission
  • F43 - Economic Growth of Open Economies
  • F44 - International Business Cycles
  • F47 - Forecasting and Simulation: Models and Applications
  • Browse content in F5 - International Relations, National Security, and International Political Economy
  • F50 - General
  • F55 - International Institutional Arrangements
  • Browse content in F6 - Economic Impacts of Globalization
  • F61 - Microeconomic Impacts
  • F62 - Macroeconomic Impacts
  • F63 - Economic Development
  • F66 - Labor
  • Browse content in G - Financial Economics
  • Browse content in G0 - General
  • G01 - Financial Crises
  • G02 - Behavioral Finance: Underlying Principles
  • Browse content in G1 - General Financial Markets
  • G12 - Asset Pricing; Trading volume; Bond Interest Rates
  • G14 - Information and Market Efficiency; Event Studies; Insider Trading
  • G18 - Government Policy and Regulation
  • Browse content in G2 - Financial Institutions and Services
  • G20 - General
  • G21 - Banks; Depository Institutions; Micro Finance Institutions; Mortgages
  • G23 - Non-bank Financial Institutions; Financial Instruments; Institutional Investors
  • G24 - Investment Banking; Venture Capital; Brokerage; Ratings and Ratings Agencies
  • G28 - Government Policy and Regulation
  • G29 - Other
  • Browse content in G3 - Corporate Finance and Governance
  • G30 - General
  • G31 - Capital Budgeting; Fixed Investment and Inventory Studies; Capacity
  • G32 - Financing Policy; Financial Risk and Risk Management; Capital and Ownership Structure; Value of Firms; Goodwill
  • G33 - Bankruptcy; Liquidation
  • G34 - Mergers; Acquisitions; Restructuring; Corporate Governance
  • G35 - Payout Policy
  • G38 - Government Policy and Regulation
  • Browse content in G5 - Household Finance
  • G51 - Household Saving, Borrowing, Debt, and Wealth
  • Browse content in H - Public Economics
  • Browse content in H1 - Structure and Scope of Government
  • H10 - General
  • H11 - Structure, Scope, and Performance of Government
  • H12 - Crisis Management
  • Browse content in H2 - Taxation, Subsidies, and Revenue
  • H20 - General
  • H23 - Externalities; Redistributive Effects; Environmental Taxes and Subsidies
  • H25 - Business Taxes and Subsidies
  • Browse content in H3 - Fiscal Policies and Behavior of Economic Agents
  • H4 - Publicly Provided Goods
  • Browse content in H5 - National Government Expenditures and Related Policies
  • H50 - General
  • H51 - Government Expenditures and Health
  • H52 - Government Expenditures and Education
  • H54 - Infrastructures; Other Public Investment and Capital Stock
  • H55 - Social Security and Public Pensions
  • H56 - National Security and War
  • H57 - Procurement
  • Browse content in H6 - National Budget, Deficit, and Debt
  • H60 - General
  • H63 - Debt; Debt Management; Sovereign Debt
  • Browse content in H7 - State and Local Government; Intergovernmental Relations
  • H75 - State and Local Government: Health; Education; Welfare; Public Pensions
  • Browse content in I - Health, Education, and Welfare
  • Browse content in I0 - General
  • I00 - General
  • Browse content in I1 - Health
  • I11 - Analysis of Health Care Markets
  • I12 - Health Behavior
  • I14 - Health and Inequality
  • I15 - Health and Economic Development
  • I18 - Government Policy; Regulation; Public Health
  • Browse content in I2 - Education and Research Institutions
  • I20 - General
  • I23 - Higher Education; Research Institutions
  • I25 - Education and Economic Development
  • I3 - Welfare, Well-Being, and Poverty
  • Browse content in J - Labor and Demographic Economics
  • Browse content in J0 - General
  • J00 - General
  • J01 - Labor Economics: General
  • J08 - Labor Economics Policies
  • Browse content in J1 - Demographic Economics
  • J14 - Economics of the Elderly; Economics of the Handicapped; Non-Labor Market Discrimination
  • J16 - Economics of Gender; Non-labor Discrimination
  • Browse content in J2 - Demand and Supply of Labor
  • J20 - General
  • J21 - Labor Force and Employment, Size, and Structure
  • J22 - Time Allocation and Labor Supply
  • J23 - Labor Demand
  • J24 - Human Capital; Skills; Occupational Choice; Labor Productivity
  • J28 - Safety; Job Satisfaction; Related Public Policy
  • Browse content in J3 - Wages, Compensation, and Labor Costs
  • J30 - General
  • J31 - Wage Level and Structure; Wage Differentials
  • J33 - Compensation Packages; Payment Methods
  • Browse content in J4 - Particular Labor Markets
  • J41 - Labor Contracts
  • J42 - Monopsony; Segmented Labor Markets
  • J48 - Public Policy
  • Browse content in J5 - Labor-Management Relations, Trade Unions, and Collective Bargaining
  • J50 - General
  • J51 - Trade Unions: Objectives, Structure, and Effects
  • J53 - Labor-Management Relations; Industrial Jurisprudence
  • Browse content in J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers
  • J60 - General
  • J61 - Geographic Labor Mobility; Immigrant Workers
  • J62 - Job, Occupational, and Intergenerational Mobility
  • J63 - Turnover; Vacancies; Layoffs
  • J65 - Unemployment Insurance; Severance Pay; Plant Closings
  • Browse content in J8 - Labor Standards: National and International
  • J80 - General
  • J81 - Working Conditions
  • J82 - Labor Force Composition
  • Browse content in K - Law and Economics
  • Browse content in K1 - Basic Areas of Law
  • K10 - General
  • K11 - Property Law
  • Browse content in K2 - Regulation and Business Law
  • K21 - Antitrust Law
  • K22 - Business and Securities Law
  • K23 - Regulated Industries and Administrative Law
  • Browse content in K3 - Other Substantive Areas of Law
  • K39 - Other
  • Browse content in K4 - Legal Procedure, the Legal System, and Illegal Behavior
  • K40 - General
  • K41 - Litigation Process
  • Browse content in L - Industrial Organization
  • Browse content in L0 - General
  • L00 - General
  • Browse content in L1 - Market Structure, Firm Strategy, and Market Performance
  • L10 - General
  • L11 - Production, Pricing, and Market Structure; Size Distribution of Firms
  • L12 - Monopoly; Monopolization Strategies
  • L13 - Oligopoly and Other Imperfect Markets
  • L14 - Transactional Relationships; Contracts and Reputation; Networks
  • L15 - Information and Product Quality; Standardization and Compatibility
  • L16 - Industrial Organization and Macroeconomics: Industrial Structure and Structural Change; Industrial Price Indices
  • L17 - Open Source Products and Markets
  • L19 - Other
  • Browse content in L2 - Firm Objectives, Organization, and Behavior
  • L20 - General
  • L21 - Business Objectives of the Firm
  • L22 - Firm Organization and Market Structure
  • L23 - Organization of Production
  • L24 - Contracting Out; Joint Ventures; Technology Licensing
  • L25 - Firm Performance: Size, Diversification, and Scope
  • L26 - Entrepreneurship
  • L29 - Other
  • Browse content in L3 - Nonprofit Organizations and Public Enterprise
  • L30 - General
  • L31 - Nonprofit Institutions; NGOs; Social Entrepreneurship
  • L32 - Public Enterprises; Public-Private Enterprises
  • L33 - Comparison of Public and Private Enterprises and Nonprofit Institutions; Privatization; Contracting Out
  • L38 - Public Policy
  • Browse content in L4 - Antitrust Issues and Policies
  • L40 - General
  • L41 - Monopolization; Horizontal Anticompetitive Practices
  • L42 - Vertical Restraints; Resale Price Maintenance; Quantity Discounts
  • L43 - Legal Monopolies and Regulation or Deregulation
  • L44 - Antitrust Policy and Public Enterprises, Nonprofit Institutions, and Professional Organizations
  • L49 - Other
  • Browse content in L5 - Regulation and Industrial Policy
  • L50 - General
  • L51 - Economics of Regulation
  • L52 - Industrial Policy; Sectoral Planning Methods
  • L53 - Enterprise Policy
  • Browse content in L6 - Industry Studies: Manufacturing
  • L60 - General
  • L61 - Metals and Metal Products; Cement; Glass; Ceramics
  • L62 - Automobiles; Other Transportation Equipment; Related Parts and Equipment
  • L63 - Microelectronics; Computers; Communications Equipment
  • L64 - Other Machinery; Business Equipment; Armaments
  • L65 - Chemicals; Rubber; Drugs; Biotechnology
  • L66 - Food; Beverages; Cosmetics; Tobacco; Wine and Spirits
  • L67 - Other Consumer Nondurables: Clothing, Textiles, Shoes, and Leather Goods; Household Goods; Sports Equipment
  • L69 - Other
  • Browse content in L7 - Industry Studies: Primary Products and Construction
  • L72 - Mining, Extraction, and Refining: Other Nonrenewable Resources
  • Browse content in L8 - Industry Studies: Services
  • L80 - General
  • L81 - Retail and Wholesale Trade; e-Commerce
  • L82 - Entertainment; Media
  • L83 - Sports; Gambling; Recreation; Tourism
  • L84 - Personal, Professional, and Business Services
  • L86 - Information and Internet Services; Computer Software
  • L88 - Government Policy
  • Browse content in L9 - Industry Studies: Transportation and Utilities
  • L90 - General
  • L91 - Transportation: General
  • L92 - Railroads and Other Surface Transportation
  • L93 - Air Transportation
  • L94 - Electric Utilities
  • L96 - Telecommunications
  • L97 - Utilities: General
  • L98 - Government Policy
  • L99 - Other
  • Browse content in M - Business Administration and Business Economics; Marketing; Accounting; Personnel Economics
  • M0 - General
  • Browse content in M1 - Business Administration
  • M10 - General
  • M11 - Production Management
  • M12 - Personnel Management; Executives; Executive Compensation
  • M13 - New Firms; Startups
  • M14 - Corporate Culture; Social Responsibility
  • M15 - IT Management
  • M16 - International Business Administration
  • M19 - Other
  • Browse content in M2 - Business Economics
  • M20 - General
  • M21 - Business Economics
  • M3 - Marketing and Advertising
  • Browse content in M5 - Personnel Economics
  • M50 - General
  • M51 - Firm Employment Decisions; Promotions
  • M52 - Compensation and Compensation Methods and Their Effects
  • M53 - Training
  • M54 - Labor Management
  • M55 - Labor Contracting Devices
  • Browse content in N - Economic History
  • Browse content in N0 - General
  • N00 - General
  • Browse content in N1 - Macroeconomics and Monetary Economics; Industrial Structure; Growth; Fluctuations
  • N12 - U.S.; Canada: 1913-
  • N14 - Europe: 1913-
  • N15 - Asia including Middle East
  • Browse content in N2 - Financial Markets and Institutions
  • N20 - General, International, or Comparative
  • N21 - U.S.; Canada: Pre-1913
  • N23 - Europe: Pre-1913
  • N24 - Europe: 1913-
  • N25 - Asia including Middle East
  • Browse content in N3 - Labor and Consumers, Demography, Education, Health, Welfare, Income, Wealth, Religion, and Philanthropy
  • N30 - General, International, or Comparative
  • Browse content in N4 - Government, War, Law, International Relations, and Regulation
  • N42 - U.S.; Canada: 1913-
  • N43 - Europe: Pre-1913
  • Browse content in N5 - Agriculture, Natural Resources, Environment, and Extractive Industries
  • N52 - U.S.; Canada: 1913-
  • Browse content in N6 - Manufacturing and Construction
  • N60 - General, International, or Comparative
  • N62 - U.S.; Canada: 1913-
  • N65 - Asia including Middle East
  • Browse content in N7 - Transport, Trade, Energy, Technology, and Other Services
  • N70 - General, International, or Comparative
  • N74 - Europe: 1913-
  • Browse content in N8 - Micro-Business History
  • N80 - General, International, or Comparative
  • N81 - U.S.; Canada: Pre-1913
  • N82 - U.S.; Canada: 1913-
  • N83 - Europe: Pre-1913
  • N84 - Europe: 1913-
  • Browse content in N9 - Regional and Urban History
  • N93 - Europe: Pre-1913
  • N94 - Europe: 1913-
  • Browse content in O - Economic Development, Innovation, Technological Change, and Growth
  • Browse content in O1 - Economic Development
  • O10 - General
  • O11 - Macroeconomic Analyses of Economic Development
  • O12 - Microeconomic Analyses of Economic Development
  • O13 - Agriculture; Natural Resources; Energy; Environment; Other Primary Products
  • O14 - Industrialization; Manufacturing and Service Industries; Choice of Technology
  • O15 - Human Resources; Human Development; Income Distribution; Migration
  • O16 - Financial Markets; Saving and Capital Investment; Corporate Finance and Governance
  • O18 - Urban, Rural, Regional, and Transportation Analysis; Housing; Infrastructure
  • O19 - International Linkages to Development; Role of International Organizations
  • Browse content in O2 - Development Planning and Policy
  • O20 - General
  • O21 - Planning Models; Planning Policy
  • O23 - Fiscal and Monetary Policy in Development
  • O25 - Industrial Policy
  • Browse content in O3 - Innovation; Research and Development; Technological Change; Intellectual Property Rights
  • O30 - General
  • O31 - Innovation and Invention: Processes and Incentives
  • O32 - Management of Technological Innovation and R&D
  • O33 - Technological Change: Choices and Consequences; Diffusion Processes
  • O34 - Intellectual Property and Intellectual Capital
  • O35 - Social Innovation
  • O38 - Government Policy
  • O39 - Other
  • Browse content in O4 - Economic Growth and Aggregate Productivity
  • O40 - General
  • O43 - Institutions and Growth
  • O44 - Environment and Growth
  • O47 - Empirical Studies of Economic Growth; Aggregate Productivity; Cross-Country Output Convergence
  • O49 - Other
  • Browse content in O5 - Economywide Country Studies
  • O51 - U.S.; Canada
  • O52 - Europe
  • O53 - Asia including Middle East
  • O54 - Latin America; Caribbean
  • O55 - Africa
  • O57 - Comparative Studies of Countries
  • Browse content in P - Economic Systems
  • Browse content in P0 - General
  • P00 - General
  • Browse content in P1 - Capitalist Systems
  • P10 - General
  • P14 - Property Rights
  • P16 - Political Economy
  • Browse content in P2 - Socialist Systems and Transitional Economies
  • P21 - Planning, Coordination, and Reform
  • P23 - Factor and Product Markets; Industry Studies; Population
  • Browse content in P3 - Socialist Institutions and Their Transitions
  • P31 - Socialist Enterprises and Their Transitions
  • Browse content in P5 - Comparative Economic Systems
  • P51 - Comparative Analysis of Economic Systems
  • Browse content in Q - Agricultural and Natural Resource Economics; Environmental and Ecological Economics
  • Browse content in Q0 - General
  • Q00 - General
  • Browse content in Q1 - Agriculture
  • Q16 - R&D; Agricultural Technology; Biofuels; Agricultural Extension Services
  • Browse content in Q2 - Renewable Resources and Conservation
  • Q20 - General
  • Q21 - Demand and Supply; Prices
  • Browse content in Q3 - Nonrenewable Resources and Conservation
  • Q39 - Other
  • Browse content in Q4 - Energy
  • Q40 - General
  • Q42 - Alternative Energy Sources
  • Q48 - Government Policy
  • Browse content in Q5 - Environmental Economics
  • Q50 - General
  • Q52 - Pollution Control Adoption Costs; Distributional Effects; Employment Effects
  • Q53 - Air Pollution; Water Pollution; Noise; Hazardous Waste; Solid Waste; Recycling
  • Q55 - Technological Innovation
  • Q56 - Environment and Development; Environment and Trade; Sustainability; Environmental Accounts and Accounting; Environmental Equity; Population Growth
  • Q58 - Government Policy
  • Browse content in R - Urban, Rural, Regional, Real Estate, and Transportation Economics
  • Browse content in R0 - General
  • R00 - General
  • Browse content in R1 - General Regional Economics
  • R10 - General
  • R11 - Regional Economic Activity: Growth, Development, Environmental Issues, and Changes
  • R12 - Size and Spatial Distributions of Regional Economic Activity
  • R13 - General Equilibrium and Welfare Economic Analysis of Regional Economies
  • R15 - Econometric and Input-Output Models; Other Models
  • R19 - Other
  • Browse content in R2 - Household Analysis
  • R21 - Housing Demand
  • R23 - Regional Migration; Regional Labor Markets; Population; Neighborhood Characteristics
  • Browse content in R3 - Real Estate Markets, Spatial Production Analysis, and Firm Location
  • R31 - Housing Supply and Markets
  • R32 - Other Spatial Production and Pricing Analysis
  • R38 - Government Policy
  • Browse content in R4 - Transportation Economics
  • R41 - Transportation: Demand, Supply, and Congestion; Travel Time; Safety and Accidents; Transportation Noise
  • Browse content in R5 - Regional Government Analysis
  • R58 - Regional Development Planning and Policy
  • Browse content in Z - Other Special Topics
  • Browse content in Z1 - Cultural Economics; Economic Sociology; Economic Anthropology
  • Z11 - Economics of the Arts and Literature
  • Z13 - Economic Sociology; Economic Anthropology; Social and Economic Stratification
  • Advance articles
  • Author Guidelines
  • Submission Online
  • Submission Site
  • Open Access
  • Announcements
  • About Industrial and Corporate Change
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

  • < Previous

The Silicon Valley–Hsinchu Connection: Technical Communities and Industrial Upgrading

  • Article contents
  • Figures & tables
  • Supplementary Data

Annalee Saxenian, Jinn‐Yuh Hsu, The Silicon Valley–Hsinchu Connection: Technical Communities and Industrial Upgrading, Industrial and Corporate Change , Volume 10, Issue 4, 1 December 2001, Pages 893–920, https://doi.org/10.1093/icc/10.4.893

  • Permissions Icon Permissions

Silicon Valley in California and the Hsinchu‐Taipei region of Taiwan are among the most frequently cited ‘miracles’ of the information technology era. The dominant accounts of these successes treat them in isolation, focusing either on free markets, multinationals or the role of the state. This paper argues that the dynamism of these regional economies is attributable to their increasing interdependencies. A community of US‐educated Taiwanese engineers has coordinated a decentralized process of reciprocal industrial upgrading by transferring capital, skill and know‐how to Taiwan, and by facilitating collaborations between specialist producers in the two regions. This case underscores the significance of technical communities and their institutions in transferring technology and organizing production at the global as well as the local level.

Email alerts

Citing articles via.

  • Recommend to your Library


  • Online ISSN 1464-3650
  • Print ISSN 0960-6491
  • Copyright © 2024 Oxford University Press and the Associazione ICC
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2023 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.


  1. GitHub

    health insurance cross sell prediction research paper

  2. GitHub

    health insurance cross sell prediction research paper

  3. GitHub

    health insurance cross sell prediction research paper

  4. GitHub

    health insurance cross sell prediction research paper

  5. GitHub

    health insurance cross sell prediction research paper

  6. Predicting Health Insurance Cross-Selling

    health insurance cross sell prediction research paper


  1. How much insurance coverage you should opt for ? #shorts

  2. Informing Our Clients

  3. Health Insurance premium prediction project explanation

  4. ఈ చిన్న తప్పు తో మీ Family నష్టపోతుంది #ytshort #finance

  5. Health Insurance Cost Prediction Project in Python Flask

  6. Machine Learning Project


  1. Health Insurance Cost Prediction using Machine Learning

    Abstract: This paper represents a machine learning-based health insurance prediction system. Recently, many attempts have been made to solve this problem, as after Covid-19 pandemic, health insurance has become one of the most prominent areas of research. We have used the USA's medical cost personal dataset from kaggle, having 1338 entries.

  2. A Novel Approach for Cross-Selling Insurance Products Using Positive

    Abstract: Successful cross-selling of products is a key goal of companies operating within the insurance industry. Choosing the right customer to approach for cross-purchase opportunities has a direct effect on both decreasing customer churn rate and increasing revenue.

  3. Machine Learning-Based Regression Framework to Predict Health Insurance

    This research trained and evaluated an artificial intelligence network-based regression-based model to predict health insurance premiums. The authors predicted the health insurance cost incurred by individuals on the basis of their features.

  4. Analysing Health Insurance Customer Dataset to Determine Cross-Selling

    The paper proceeds by first providing a review of related literature in health insurance cross-selling prediction. It then outlines the research methodology employed which includes data collection, data preprocessing, exploratory data analysis, and feature selection.

  5. Predict Health Insurance Cost by using Machine Learning and DNN

    Predict Health Insurance Cost by using Machine Learning and DNN Regression Models International Journal of Innovative Technology and Exploring Engineering DOI: Authors: Mohamed Hanafy Assiut...

  6. PDF A Novel Approach for Cross-Selling Insurance Products Using Positive

    Modelling of past customer data can often be used to optimise the cross-sell recommendation process. This is rarely a straightforward process however. One issue is that much of the data is in the form of question-response text. This needs to be processed in an interpretable and effective manner.

  7. Insurance Sales Forecast Using Machine Learning Algorithms

    In this paper, the Health Insurance Cross-Sell Prediction dataset is used.

  8. Fairness metrics for the Health Insurance cross-sell prediction dataset

    Fairness metrics for the Health Insurance cross-sell prediction dataset | Download Scientific Diagram Fairness metrics for the Health Insurance cross-sell prediction dataset...

  9. Insurance Sales Forecast Using Machine Learning Algorithms

    1 Introduction The objective of this paper is to classify customers based on their probability and desire to buy insurance based on their personal information, personal preferences, and the data of their owned cars.

  10. Health Insurance Cross Sell Prediction

    Health Insurance Cross Sell Prediction


    HEALTH INSURANCE CROSS SELL PREDICTION Predicting whether a customer would be interested in buying Vehicle Insurance so that the company can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue. Table of Content Abstract Problem Statement Data Description Project Outline

  12. Prediction of Health Insurance Cross-Selling Through Binary Logical

    Abstract Cross-selling is typically a marketing strategy used to identify a customer's various wants and then sell a range of related services or products by attending to those demands. Cross-marketing refers to a variety of tactics and procedures used to encourage cross-selling.

  13. Clustering Application for Data-Driven Prediction of Health Insurance

    Health Insurance Clustering Application for Data-Driven Prediction of Health Insurance Premiums for People of Different Ages January 2021 Authors: Tallal Omar Mohamed Zohdy Oakland...

  14. Health Insurance Cross Sell Prediction

    Health Insurance Cross Sell Prediction 27/02/2021. Introduction. An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an ...

  15. PDF Health Insurance Cross Sell Prediction

    HEALTH INSURANCE CROSS SELL PREDICTION Lucky Jain, Debashish Das, Vivek Katolkar Alma Better Abstract: The dataset contains data on demographics (gender, age, region code type), vehicles (vehicle age, damage), policies (premium, sourcing channel), and so on. We predicted that the customer who has medical insurance ...

  16. Health Insurance Cross Sell Prediction-XGBOOST

    A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee. For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if God forbid, you fall ill and need to be hospitalized in that year, the insurance provider company will bear the cost of ...


    research questions, the propositions, the analyses of the data that will take place, the logic plan which links the data to the propositions, and the criteria for analysis. Yin points out . Book Review 70 that a major goal at this stage of the case study is to identify the theory that will be examined in the case study, and to list rival ...

  18. Health Insurance Cross Sell Prediction

    Predict Health Insurance Owners' who will be interested in Vehicle Insurance

  19. PDF Book Review Creswell, J. W. (2014). Research Design: Qualitative ...

    of research for better understanding of this book. There are two parts of the book. Part 1 (chapter 1-4) consists of steps for developing research proposal and part II (chapter 5-10) explains how to develop a research proposal or write a research report. A summary is given at the end of every chapter that helps the reader to recapitulate the ideas.

  20. Alvy Ray Smith Cellular Automata Papers

    The principal results of these two papers eventually appeared in the book Picture Languages: Formal Models for Picture Recognition, edited by Azriel Rosenfeld, Chapters 3 and 5 ... since it was the first attempt I know to cross-fertilization between biologists and computer scientists. Many of the players at this conference were ...

  21. Silicon Valley-Hsinchu Connection: Technical Communities and Industrial

    This paper argues that the dynamism of these regional economies is attributable to their increasing interdependencies. A community of US‐educated Taiwanese engineers has coordinated a decentralized process of reciprocal industrial upgrading by transferring capital, skill and know‐how to Taiwan, and by facilitating collaborations between ...

  22. Cross-Sell Prediction Using Machine Learning in Python

    Shipra Saxena — Updated On September 10th, 2022 Objective Understand what is Cross-sell using Vehicle insurance data. Learn how to build a model for cross-sell prediction. Introduction

  23. Machine Learning-Based Regression Framework to Predict Health Insurance

    Artificial intelligence (AI) and machine learning (ML) in healthcare are approaches to make people's lives easier by anticipating and diagnosing diseases more swiftly than most medical experts. There is a direct link between the insurer and the policyholder when the distance between an insurance business and the consumer is reduced to zero with the use of technology, especially digital ...