Harmonizing Data and tackling quality challenges for leading Insurance firm

Learning Objectives

Parse and Ingest Data in Databricks

Authoring Reusable code to tackle schema and data inconsistencies

Implementation of delta lakes as a part of medallion architecture

Implementation of schema evolution in Delta lake

Overview

"PrimerInsurance a MediLife brand faces challenges with data inaccuracies, schema inconsistencies, and a lack of trust in data systems from stakeholders. What measures are necessary to address and resolve these issues?"

MediLife, a global leader in insurance, financial services, and employee benefits, stands as one of the world's largest and most respected insurance companies. With operations in over 40 countries and serving approximately 100 million customers, MetLife provides a wide range of services including life, accident, health insurance, annuities, and retirement and savings products.

PrimerInsurance is a subsidy of MediLife. It has been acquired and controlled by the Insurance giant and has embraced data-driven decision-making to enhance its operations, from underwriting and risk assessment to customer service and product development. To report accurate and refreshed data to stakeholders the data from both the systems of PrimerInsurance and Metlife should represent data as one voice.

Unfortunately, this has become a great challenge and bottleneck for the Insurance giant and its subsidized company. The journey that started as a way to enhance operational efficiency and decision-making is now leading lot of friction between stakeholders.

Columns missing in the same set of files. That is if one customer file has a particular column it might or might not be missing in another customer file
Misrepresentation of column headers
Data inconsistency in columns of different customer files. For instance, the education column of one data file has tertiary but another file has a value called “terto”
Data duplication and much more

These issues led to a lack of trust in data systems rendering them useless.

The head of Data practices has decided to solve this problem once and for all by designing a single idempotent batch-processing pipeline (as mentioned in Architecture below) to harmonize the data, ensure data quality, and report the business data as needed by the stakeholders.

My Image

Prerequisites

Knowledge of how Databricks and ADLS works
Proficiency in Data Analysis, Data Processing and Data cleaning using Pyspark
Understanding on Capabilities of Delta table
Knowledge of ETL using Medallion Architecture