This file contains the relevant code used to generate the examples in chapter 7 of the CAS Data Quality Management in the P&C Insurance Sector Monograph. The following files should be downloaded from the CAS website and stored in the same file directory as this script before compiling:
The first two datasets represent the original data with data quality issues, including duplicated records. For some packages we use the smaller dataset with 5,000 claims to improve processing time. The final dataset represents an improved dataset where the data quality issues have been largely remediated.
This script uses the RecordLinkage package to identify potentially duplicate records in a dataset. The script generates figure 11 and figure 12 from Chapter 7.
library(RecordLinkage)
wcdata5k <- read.csv(file="wc_claims_problems_5k.csv")
wcdata_modified5k <- wcdata5k %>%
select(-c("Claim_Number","Incd_Med_Loss"))
wcdata_identity5k <- wcdata5k %>%
select("Claim_Number")
wcdata_identity5k <- as.numeric(unlist(wcdata_identity5k))
wcpairs5k <- compare.dedup(dataset = wcdata_modified5k, identity = wcdata_identity5k)
wcweightpairs5k <- epiWeights(wcpairs5k)
#To generate figure 11:
summary(wcweightpairs5k)
##
## Deduplication Data Set
##
## 5542 records
## 15354111 record pairs
##
## 542 matches
## 15353569 non-matches
## 0 pairs with unknown status
##
##
## Weight distribution:
##
## [0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35] (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9]
## 2588969 3834146 3586894 2261369 1334293 1102876 570468 8728 20869 12890 9445 7980 3035 3060 1239 2553 2525 541
## (0.9,0.95] (0.95,1]
## 1194 1037
#To generate figure 12:
getPairs(wcweightpairs5k,0.52,0.5)
## id Accident_State Accident_Date Report_Date Claim_Type Claim_Status Body_Part Injury_Type Weight
## 1 4 TX 5/22/2017 5/24/2017 MED O Back Strain
## 2 1136 CA 5/21/2017 5/24/2017 MED O Wrist Strain 0.5082398
## 3
## 4 7 CA 5/2/2017 5/4/2017 MED O Wrist Laceration
## 5 4125 TX 5/3/2017 5/4/2017 MED O Hand Laceration 0.5082398
## 6
## 7 7 CA 5/2/2017 5/4/2017 MED O Wrist Laceration
## 8 4913 TX 5/3/2017 5/4/2017 MED O Wrist Strain 0.5082398
## 9
## 10 7 CA 5/2/2017 5/4/2017 MED O Wrist Laceration
## 11 5016 TX 5/4/2017 5/4/2017 MED O Wrist Strain 0.5082398
## 12
## 13 9 CA 1/22/2017 1/25/2017 MED O Hand Laceration
## 14 1965 TX 1/24/2017 1/25/2017 MED O Back Laceration 0.5082398
## 15
## 16 9 CA 1/22/2017 1/25/2017 MED O Hand Laceration
## 17 4665 TX 1/23/2017 1/25/2017 MED O Wrist Laceration 0.5082398
## 18
## 19 10 TX 6/17/2017 6/20/2017 MED O Hand Strain
## 20 87 NY 6/18/2017 6/20/2017 MED O Hand Laceration 0.5082398
## 21
## 22 10 TX 6/17/2017 6/20/2017 MED O Hand Strain
## 23 2872 CA 6/18/2017 6/20/2017 MED O Back Strain 0.5082398
## 24
## 25 10 TX 6/17/2017 6/20/2017 MED O Hand Strain
## 26 4335 NY 6/16/2017 6/20/2017 MED O Back Strain 0.5082398
## 27
## 28 10 TX 6/17/2017 6/20/2017 MED O Hand Strain
## 29 5473 CA 6/18/2017 6/20/2017 MED O Hand Laceration 0.5082398
## 30
## 31 13 TX 2/10/2017 2/10/2017 MED O Hand Laceration
## 32 2509 CA 2/9/2017 2/10/2017 MED O Wrist Laceration 0.5082398
## 33
## 34 13 TX 2/10/2017 2/10/2017 MED O Hand Laceration
## 35 5006 CA 2/8/2017 2/10/2017 MED O Hand Strain 0.5082398
## 36
## 37 13 TX 2/10/2017 2/10/2017 MED O Hand Laceration
## 38 5057 NY 2/8/2017 2/10/2017 MED O <NA> Laceration 0.5082398
## 39
## 40 14 NY 8/23/2017 8/26/2017 MED O Hand Laceration
## 41 566 TX 8/25/2017 8/26/2017 MED O Wrist Laceration 0.5082398
## 42
## 43 14 NY 8/23/2017 8/26/2017 MED O Hand Laceration
## 44 2210 TX 8/25/2017 8/26/2017 MED O Hand Strain 0.5082398
## 45
## 46 15 NY 9/18/2017 9/19/2017 MED O Wrist Laceration
## 47 226 TX 9/17/2017 9/19/2017 MED O Hand Laceration 0.5082398
## 48
## 49 15 NY 9/18/2017 9/19/2017 MED O Wrist Laceration
## 50 2413 TX 9/17/2017 9/19/2017 MED O Hand Laceration 0.5082398
## 51
## 52 15 NY 9/18/2017 9/19/2017 MED O Wrist Laceration
## 53 2917 CA 9/16/2017 9/19/2017 MED O Hand Laceration 0.5082398
## 54
## 55 15 NY 9/18/2017 9/19/2017 MED O Wrist Laceration
## [ reached getOption("max.print") -- omitted 19679 rows ]
The following script uses the dlookr package to generate a “Data Quality Diagnosis Report”. The full report can only be outputted in an “HTML” or “PDF” format, and thus cannot be shown within this markdown file. To see the report, we recommend evaluating the script chunk outside of this file.
One of the tables provided in the full report is the “Data Quality Overview” table, included in the Monograph as figure 13. An unformatted version of this table can be outputted directly in R markdown. The script chunks to generate both the full report and the unformatted version of figure 13 are provided below.
library(dlookr)
#To generate figure 13:
wcdata %>%
select(-"Claim_Number") %>%
diagnose()
## # A tibble: 8 x 6
## variables types missing_count missing_percent unique_count unique_rate
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Accident_State character 0 0 3 0.0000545
## 2 Accident_Date character 0 0 351 0.00638
## 3 Report_Date character 0 0 356 0.00647
## 4 Claim_Type character 0 0 2 0.0000363
## 5 Claim_Status character 0 0 2 0.0000363
## 6 Body_Part character 1691 3.07 5 0.0000908
## 7 Injury_Type character 123 0.223 5 0.0000908
## 8 Incd_Med_Loss numeric 0 0 53512 0.972
#To generate the full report (run this code chunk outside of R markdown):
wcdata %>%
select(-"Claim_Number") %>%
diagnose_report(output_format = "html",
output_file = "cas_Monograph_dq.html")
The following script uses the VIM package to further analyze missing data. This package allows for visualization of missing values across multiple data fields. The below script is used to generate figures 14, 15 and 16 in the Monograph.
library(VIM)
#To generate figure 14:
aggr(wcdata,
props=TRUE,
sort.vars=TRUE,
only.miss=TRUE,
combined=TRUE,
numbers = TRUE,
varheight=FALSE,
col="red",
cex.axis=.75,
labels=c('Claim#','State','LossDate','ReptDate','ClmType','ClmStat','Body_Part','Inj_Type','Incd_Loss')
)
#To generate figure 15:
spineMiss(wcdata[, c("Injury_Type", "Body_Part")],
col="red") #generates figure 15
#To generate figure 16:
wcdata$Accident_Date<-as.Date(wcdata$Accident_Date)
wcdata$Accident_State<-as.factor(wcdata$Accident_State)
wcdata$Claim_Type<-as.factor(wcdata$Claim_Type)
wcdata$Body_Part<-as.factor(wcdata$Body_Part)
wcdata$Injury_Type<-as.factor(wcdata$Injury_Type)
wcdata %>%
select(c("Accident_Date","Accident_State","Claim_Type","Body_Part","Injury_Type")) %>%
parcoordMiss(highlight = 'Injury_Type',
alpha = 1.0,
col = c("grey", "grey40","gold", "firebrick2","firebrick4","gold4"),
labels=c("LossDate",'State','ClmTyp','Body_Part','Inj_Type'))
The following script uses the validate package to demonstrate the application of data quality rules. This script is used to generate figure 17 in the Monograph.
library(validate)
wcdata$Report_Date<-as.Date(wcdata$Report_Date)
validatereport<-check_that(wcdata,
Incd_Med_Loss<1000000,
Incd_Med_Loss>1,
Report_Date-Accident_Date>=0,
if(Injury_Type=='Concussion') Body_Part=='Head'
)
summary(validatereport)
## name items passes fails nNA error warning expression
## 1 V1 55056 55052 4 0 FALSE FALSE Incd_Med_Loss < 1e+06
## 2 V2 55056 49977 5079 0 FALSE FALSE Incd_Med_Loss > 1
## 3 V3 55056 54788 268 0 FALSE FALSE Report_Date - Accident_Date >= 0
## 4 V4 55056 54401 336 319 FALSE FALSE !(Injury_Type == "Concussion") | (Body_Part == "Head")
#To generate figure 17:
barplot(validatereport,
main='Validate on wc_claims',
colors=c(fails='firebrick2',passes='grey',nNA='gold4'),
xlab="Items",
mgp=c(3,.6,0))
The following script demonstrates the imputation of missing values as discussed in Example 7.5 of the Monograph. We use the nnet package to fit a multinomial log-linear model via a neural network to carry out classification mode imputation. This script is used to generate figure 18 in the Monograph.
library(nnet)
#There is no mode function in Base R so we define one:
mode <- function(x){
ux <- unique(x)
ux[which.max(tabulate(match(x,ux)))]
}
missing<-is.na(wcdata$Body_Part)|is.na(wcdata$Injury_Type!='NA')
wcdata_nomiss<-wcdata[!missing,]
wcdata_miss<-wcdata[missing,]
#Body Part mode imputation
wcdata_miss$Mode_BP_Imp <- mode(wcdata_nomiss$Body_Part)
#Fitting a classification model to impute the missing values
bp_model <- multinom(Body_Part ~ Injury_Type+Claim_Type+Incd_Med_Loss, data = wcdata_nomiss)
## # weights: 28 (18 variable)
## initial value 73813.243258
## iter 10 value 52265.241914
## iter 20 value 45788.279413
## iter 30 value 43337.314542
## iter 40 value 43317.892233
## iter 50 value 43317.845947
## final value 43317.844274
## converged
wcdata_miss$Classifaction_BP_Imp <- predict(bp_model, wcdata_miss)
wcdata_miss_mod <- wcdata_miss%>%
select(-c("Accident_Date","Report_Date","Claim_Status"))
wcdata_miss_mod[1:10,]
## Claim_Number Accident_State Claim_Type Body_Part Injury_Type Incd_Med_Loss Mode_BP_Imp Classifaction_BP_Imp
## 62 62 NY IND <NA> RSI 305.1800000 Hand Back
## 68 68 TX MED <NA> Laceration 2078.7800000 Hand Hand
## 93 93 TX MED <NA> Strain 572.6300000 Hand Back
## 108 108 TX MED <NA> Laceration 454.2100000 Hand Hand
## 122 122 TX MED <NA> Concussion 4630.9600000 Hand Head
## 134 134 NY MED <NA> Strain 0.3365156 Hand Back
## 164 164 TX MED <NA> Concussion 3441.4800000 Hand Head
## 228 228 CA MED <NA> Laceration 3299.8200000 Hand Hand
## 239 239 CA MED <NA> Laceration 0.2817514 Hand Hand
## 273 273 TX MED <NA> Strain 464.9400000 Hand Back
The following script demonstrates the imputation of missing values as discussed in Example 7.6 of the Monograph. We use the nnet package to fit a multinomial log-linear model via a neural network to generate the probabilities of anomalies. This script is used to generate figure 19 in the Monograph.
library(nnet)
missing<-is.na(wcdata$Body_Part)|is.na(wcdata$Injury_Type!='NA')
wcdata_nomiss<-wcdata[!missing,]
wcdata_miss<-wcdata[missing,]
bp_model <- multinom(Body_Part ~ Injury_Type+Claim_Type+Incd_Med_Loss, data = wcdata_nomiss)
## # weights: 28 (18 variable)
## initial value 73813.243258
## iter 10 value 52265.241914
## iter 20 value 45788.279413
## iter 30 value 43337.314542
## iter 40 value 43317.892233
## iter 50 value 43317.845947
## final value 43317.844274
## converged
summary(bp_model)
## Call:
## multinom(formula = Body_Part ~ Injury_Type + Claim_Type + Incd_Med_Loss,
## data = wcdata_nomiss)
##
## Coefficients:
## (Intercept) Injury_TypeLaceration Injury_TypeRSI Injury_TypeStrain Claim_TypeMED Incd_Med_Loss
## Hand 8.966529 -7.391632 -9.119106 -12.1631121 0.7121227 -0.000008950721
## Head 14.329091 -12.612010 -24.607763 -26.0959304 -2.0540311 0.000004304243
## Wrist -2.520651 2.606998 2.482866 0.7019022 1.1473849 -0.000004918727
##
## Std. Errors:
## (Intercept) Injury_TypeLaceration Injury_TypeRSI Injury_TypeStrain Claim_TypeMED Incd_Med_Loss
## Hand 0.000000000014739239 0.000000000006593870 0.00000000001021641486 0.0000000000018899604516 0.000000000008174309 0.0000007186854
## Head 0.000000000018547234 0.000000000015810924 0.00000000000002385952 0.0000000000000009244619 0.000000000007829503 0.0000010956044
## Wrist 0.000000000008831431 0.000000000007677564 0.00000000000408122073 0.0000000000042709580837 0.000000000005596047 0.0000005507955
##
## Residual Deviance: 86635.69
## AIC: 86671.69
preds<-data.frame(fitted(bp_model))
wcdata_nomiss$Probability_Body_Part<-ifelse(wcdata_nomiss$Body_Part=='Back',preds$Back,
ifelse(wcdata_nomiss$Body_Part=='Hand',preds$Hand,
ifelse(wcdata_nomiss$Body_Part=='Head',preds$Head,preds$Wrist)))
anomalies<-wcdata_nomiss[wcdata_nomiss$Probability_Body_Part<.025,]
anomalies_mod <- anomalies%>%
select(-c("Accident_Date","Report_Date","Claim_Status"))
anomalies_mod[1:10,]
## Claim_Number Accident_State Claim_Type Body_Part Injury_Type Incd_Med_Loss Probability_Body_Part
## 174 174 TX IND Hand Concussion 18423.13 0.003659508
## 613 613 NY IND Hand Concussion 55669.27 0.002236830
## 1387 1387 NY IND Hand Concussion 40781.51 0.002723474
## 1507 1507 CA IND Hand Concussion 3809.06 0.004438231
## 1530 1530 TX IND Wrist Laceration 337653.84 0.008199359
## 1846 1846 NY IND Wrist Laceration 431219.41 0.003546244
## 2391 2391 TX MED Hand Concussion 175221.58 0.007253680
## 4666 4666 CA IND Hand Concussion 15096.05 0.003823874
## 5086 5086 CA IND Hand Concussion 10068.35 0.004086311
## 5424 5424 NY IND Hand Concussion 5132.59 0.004361385
The following script demonstrates the impact of improving data quality on Actuarial Analyses, as detailed in Example 7.7. Specifically this involves comparing the fit of glm models to the original dataset versus the datasets with improved data quality. This script is used to generate figures 20 and figure 21 in the Monograph.
wcdata__orig<-read.csv('wc_claims_problems.csv')
wcdata_cleaned<-read.csv('wc_claims_cleaned.csv')
wcdata__orig$Report_Lag<-difftime(wcdata__orig$Report_Date ,wcdata__orig$Accident_Date , units = c("days"))
wcdata_cleaned$Report_Lag<-difftime(wcdata_cleaned$Report_Date ,wcdata_cleaned$Accident_Date , units = c("days"))
WeightedGini <- function(solution, weights, submission){
df = data.frame(solution, weights, submission)
n <- nrow(df)
df <- df[order(df$submission, decreasing = TRUE),]
df$random = cumsum(df$weights/sum(df$weights))
df$cumPosFound <- cumsum(df$solution * df$weights)
df$Lorentz <- df$cumPosFound / df$cumPosFound[n]
sum(df$Lorentz[-1]*df$random[-n]) - sum(df$Lorentz[-n]*df$random[-1]) }
NormGini <- function(solution, weights, submission) {
WeightedGini(solution, weights, submission) / WeightedGini(solution, weights, solution)
}
glm_orig<-glm(Incd_Med_Loss~Accident_State+Report_Lag+Claim_Type+Claim_Status+Body_Part+Injury_Type,
data=wcdata__orig,
family=Gamma(link='log'))
glm_cleaned<-glm(Incd_Med_Loss~Accident_State+Report_Lag+Claim_Type+Claim_Status+Body_Part+Injury_Type,
data=wcdata_cleaned,
family=Gamma(link='log'))
weights_orig<-rep(1,nrow(wcdata__orig))
weights_cleaned<-rep(1,nrow(wcdata_cleaned))
summary(glm_orig)
##
## Call:
## glm(formula = Incd_Med_Loss ~ Accident_State + Report_Lag + Claim_Type +
## Claim_Status + Body_Part + Injury_Type, family = Gamma(link = "log"),
## data = wcdata__orig)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.8070 -1.7484 -0.8905 0.0549 10.9908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.900682 0.063962 170.425 < 0.0000000000000002 ***
## Accident_StateNY 0.011595 0.025650 0.452 0.65123
## Accident_StateTX -0.187219 0.020471 -9.146 < 0.0000000000000002 ***
## Report_Lag -0.010195 0.006506 -1.567 0.11714
## Claim_TypeMED -1.127081 0.028066 -40.158 < 0.0000000000000002 ***
## Claim_StatusO -0.238651 0.025519 -9.352 < 0.0000000000000002 ***
## Body_PartHand -0.539720 0.030787 -17.531 < 0.0000000000000002 ***
## Body_PartHead 0.148312 0.050495 2.937 0.00331 **
## Body_PartWrist -0.322912 0.026911 -11.999 < 0.0000000000000002 ***
## Injury_TypeLaceration -0.737664 0.046703 -15.795 < 0.0000000000000002 ***
## Injury_TypeRSI -0.095259 0.057484 -1.657 0.09750 .
## Injury_TypeStrain -0.938017 0.054552 -17.195 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 4.16853)
##
## Null deviance: 229590 on 53244 degrees of freedom
## Residual deviance: 190937 on 53233 degrees of freedom
## (1811 observations deleted due to missingness)
## AIC: 994876
##
## Number of Fisher Scoring iterations: 10
summary(glm_cleaned)
##
## Call:
## glm(formula = Incd_Med_Loss ~ Accident_State + Report_Lag + Claim_Type +
## Claim_Status + Body_Part + Injury_Type, family = Gamma(link = "log"),
## data = wcdata_cleaned)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7773 -1.5417 -0.8149 0.0750 12.0426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.503732 0.065534 160.280 < 0.0000000000000002 ***
## Accident_StateNY 0.085011 0.024653 3.448 0.000565 ***
## Accident_StateTX -0.098575 0.019679 -5.009 0.0000005482631460 ***
## Report_Lag -0.011756 0.006336 -1.855 0.063549 .
## Claim_TypeMED -1.084581 0.027343 -39.666 < 0.0000000000000002 ***
## Claim_StatusO -0.172449 0.024674 -6.989 0.0000000000027987 ***
## Body_PartHand -0.534671 0.029853 -17.910 < 0.0000000000000002 ***
## Body_PartHead 0.384790 0.050661 7.595 0.0000000000000312 ***
## Body_PartWrist -0.319363 0.025949 -12.307 < 0.0000000000000002 ***
## Injury_TypeLaceration -0.418992 0.048615 -8.619 < 0.0000000000000002 ***
## Injury_TypeRSI 0.294664 0.059392 4.961 0.0000007023677399 ***
## Injury_TypeStrain -0.605690 0.056001 -10.816 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 3.626746)
##
## Null deviance: 141931 on 49999 degrees of freedom
## Residual deviance: 107244 on 49988 degrees of freedom
## AIC: 980313
##
## Number of Fisher Scoring iterations: 7
NormGini(wcdata__orig$Incd_Med_Loss,weights_orig,predict(glm_orig,wcdata__orig,type='response'))
## [1] 0.5630662
NormGini(wcdata_cleaned$Incd_Med_Loss,weights_cleaned,predict(glm_cleaned,wcdata_cleaned,type='response'))
## [1] 0.6135776
#To generate figure 20:
plot(glm_orig)
#To generate figure 21:
plot(glm_cleaned)