Enqurious logo
Go back

Harmonizing Data and tackling quality challenges for leading Insurance firm

5 Scenarios
7 Hours 30 Minutes
project poster
Industry
insurance
Skills
batch-etl
cloud-management
data-wrangling
data-storage
data-quality
Tools
databricks

Learning Objectives

Parse and Ingest Data in Databricks
Authoring Reusable code to tackle schema and data inconsistencies
Implementation of delta lakes as a part of medallion architecture
Implementation of schema evolution in Delta lake

Overview

"PrimerInsurance a MediLife brand faces challenges with data inaccuracies, schema inconsistencies, and a lack of trust in data systems from stakeholders. What measures are necessary to address and resolve these issues?"

MediLife, a global leader in insurance, financial services, and employee benefits, stands as one of the world's largest and most respected insurance companies. With operations in over 40 countries and serving approximately 100 million customers, MetLife provides a wide range of services including life, accident, health insurance, annuities, and retirement and savings products.

PrimerInsurance is a subsidy of MediLife. It has been acquired and controlled by the Insurance giant and has embraced data-driven decision-making to enhance its operations, from underwriting and risk assessment to customer service and product development. To report accurate and refreshed data to stakeholders the data from both the systems of PrimerInsurance and Metlife should represent data as one voice.

Unfortunately, this has become a great challenge and bottleneck for the Insurance giant and its subsidized company. The journey that started as a way to enhance operational efficiency and decision-making is now leading lot of friction between stakeholders.
  • Columns missing in the same set of files. That is if one customer file has a particular column it might or might not be missing in another customer file
  • Misrepresentation of column headers
  • Data inconsistency in columns of different customer files. For instance, the education column of one data file has tertiary but another file has a value called “terto”
  • Data duplication and much more
These issues led to a lack of trust in data systems rendering them useless.

The head of Data practices has decided to solve this problem once and for all by designing a single idempotent batch-processing pipeline (as mentioned in Architecture below) to harmonize the data, ensure data quality, and report the business data as needed by the stakeholders.

My Image

Prerequisites

  • Knowledge of how Databricks and ADLS works
  • Proficiency in Data Analysis, Data Processing and Data cleaning using Pyspark
  • Understanding on Capabilities of Delta table
  • Knowledge of ETL using Medallion Architecture