5 Data Science Applications in the Life Insurance Industry

How data science offers opportunities to the life insurance value chain

Jin Cui

Published in

Towards AI

6 min readMay 23, 2022

Background

After a bit of a search, I was surprised in the first instance by how little has been written on the topic of data science applications in the life insurance industry on the Medium platform. That said, I can’t say that I’m entirely surprised as there are primarily two challenges facing life insurers in embedding data science applications, namely:

Data collected has been scarce and may not lead to credible analysis
Low buy-ins from senior management (i.e. low drive from ‘top-down’)

With respect to 1), it needs to be established that the term Data Science should be defined as encompassing Data Transformation, Data Visualisation, Predictive Analytics, Machine Learning, and other AI disciplines. That is, Data Science is not restricted to the more advanced machine learning techniques such as Deep Learning which requires a lot of data to perform.

With respect to 2), one of the best ways to obtain buy-in from senior management is to showcase data science applications proven to add value in specific business applications using a few use cases. This article is written to do exactly this by setting out 5 data science applications in the life insurance industry.

1. Spatial Visualisation & Clustering

The socio-economic status of a region may have implications for life insurance outcomes. For example, there are hypotheses that:

Policyholders who live in a relatively more affluent area have a lower propensity to claim due to healthier lifestyles.
Policyholders who live in a relatively less affluent area have a higher propensity of discontinuance due to lower affordability.

Data science techniques enable life insurers to quantify as well as visualize the socio-economic status of a particular geometric area. By way of example:

Socio-Economic Indexes for Areas (“SEIFA”) is a score calibrated by the Australian Bureau of Statistics (“ABS”) to rank areas in Australia according to relative socio-economic advantage and disadvantage using the dimension reduction technique Principal Component Analysis (“PCA”). At a high level, SEIFA takes in information sourced from population Census data which represent the characteristics of a geometric area. Some examples of data include Education variables such as % of people with a degree, Occupation variables such as % of occupations classified as professionals, and Housing variables such as % of properties with less than one bedroom. These variables (with others such as Employment and Infrastructure) are used to form principal components which are then collapsed into a singular score based on the first principal component. This is discussed in detail in this technical paper published by the ABS.
This article shows a visualization use case where discontinuance rates by geometric areas are plotted using the Python library Folium. The short video below demonstrates how the discontinuance rate can be shown interactively at a locality of interest.

Video 1: Interactive map demo. Video by author

2. Natural Language Processing

There are numerous applications of Natural Language Processing (“NLP”) for various stages of the life insurance value chain. One of these is discussed in this article relating to an automatic grouping of free-text claims causes or occupation descriptions.

A more unorthodox NLP application in life insurance is the prediction of “churn”, or discontinuance, based on conversation data with a policyholder. The technical aspect of this application is discussed in this article. Essentially, with the help of NLP, the conversation data emanating from chatbots, emails, or audio recordings can be embedded into a particular topic or feature which may add predictive power to a churn prediction model.

In a broader scope, there has been increasing interest from the regulators in the use of advanced analytics. For example, NLP has been used to understand regulatory compliance with responsible lending obligations for banks in Australia. In particular, the Australian Securities and Investments Commission (“ASIC”) had hosted a problem-solving event aimed at detecting whether banks under its supervision are providing a suitable amount of credit to their customers. In one of the solutions published by ASIC, NLP has been used to group actual transactions by customers by descriptions into either income or expense categories. These were then compared against the income and expenses declared in the loan application forms. The screenshot below shows an example of non-compliance for a customer.

Image 2: Non-compliance with responsible lending. Skillful Analytics, ASIC Regtech Responsible Lending Demonstration Webinar, August 20, 2020

3. Claims Modelling

Understanding claims is one of the most critical tasks in the life insurance value chain. It allows the insurer to charge appropriate premiums, set aside sufficient claims reserves, and ultimately provide an indication of profitability.

For countries with credible claims data such as the US and Australia, life insurers in the industry would collectively contribute towards an industry study aimed at identifying claims drivers, by providing individual company data on risk attributes such as Gender, Smoker Status, Occupation, Claims Cause and etc.

Generally speaking, the methodology for identifying potential claims drivers could be to firstly fit a Gradient Boosting Machine (“GBM”) model through the data to understand the order as well as the significance of features (i.e. risk attributes that influence claims costs), and then fit a Generalised Linear Model (“GLM”) using the significant features previously identified (allowing for interactions between the features). The use of both the GBM and GLM ensures that:

Causation is clearly identified and separated from correlations. For example, the gender of a policy and smoker status may be highly correlated (e.g. Male and Smoker). Which attribute is the true underlying claims driver? If they both were, what is the marginal impact of each on claims costs?
The results of the model can be clearly explained. This can be done by examining the size as well as the sign of the GLM coefficients relative to a baseline scenario.

4. Computer Vision

In a recent presentation by Reinsurance Group of America’s underwriter and actuary at the 2022 Actuaries Summit in Australia, it was discussed that Optical Character Recognition (“OCR”) techniques can be deployed in the underwriting stage in order to collect more data from the policy applicants.

Underwriting refers to the stage where the life insurer decides whether to accept a policy for an applicant or what conditions/pricing will be applied to the policy based on medical and lifestyle information provided by the applicant. On the application form which typically comes in a PDF format, the applicant discloses a lot of risk-related data such as Gender, Occupation, Income, and Pastimes. OCR can then be used to ‘scrape’ these data from the PDF forms and store them as structured data, as opposed to the underwriters manually populating these data. Whilst the OCR functionality is typically available on subscription-based software such as Adobe, it can be carried out free of charge using the Python libraries by tesseract and PyPDF2.

Moreover, predictive modeling can be used to improve the accuracy and speed of the risk assessment process at the underwriting stage. In particular, having spoken to a senior data scientist with a reinsurer of the company I work for, predictive modeling can be used by underwriters as a decision support tool to further classify the manual cases (i.e. the non-automatic acceptance cases) into more granular risk groups.

5. Discontinuance Modelling

In the event that a policyholder discontinues the policy shortly after the sale (say in a year or two), the life insurance company would incur a loss as it would not be able to recoup the high upfront acquisition costs associated with selling the policy, such as the initial commission paid to the intermediaries. In fact, it normally takes 5–7 years for a life insurance company to start making money on the policy. This is why discontinuance matters and most companies make significant investments in retaining existing policyholders.

Similar to Claims Modelling, a GBM or GLM can be fit to a company’s discontinuance data to understand discontinuance drivers. Depending on the size of the company’s retention operations, metrics of model evaluation such as Recall may need to be traded off for the Precision of a model prediction (i.e. accuracy of prediction is prioritized over the identification).

Closing Remarks

A lot of data science projects failed in an organization when practitioners prioritize the technology overuse cases. For example, the practitioner shouldn’t bring to senior management fancy data science techniques such as Deep Learning and ask what they can do for the organization. Instead, data science techniques should be used as a response where appropriate to a potential business problem, the resolution of which closely aligns with the need of the organization.

In addition, it is the writer’s view that data science applications such as Data Transformation and Visualisation are often overlooked. Optimizing for these applications alone would already lead to better insights and operational efficiencies for an organization.

Like what you are reading? Be sure to follow the writer for more!