Chapter 7 - Examples - R Script


This file contains the relevant code used to generate the examples in chapter 7 of the CAS Data Quality Management in the P&C Insurance Sector Monograph. The following files should be downloaded from the CAS website and stored in the same file directory as this script before compiling:

The first two datasets represent the original data with data quality issues, including duplicated records. For some packages we use the smaller dataset with 5,000 claims to improve processing time. The final dataset represents an improved dataset where the data quality issues have been largely remediated.

Example 7.1 (using RecordLinkage package in R)

This script uses the RecordLinkage package to identify potentially duplicate records in a dataset. The script generates figure 11 and figure 12 from Chapter 7.

library(RecordLinkage)
wcdata5k <- read.csv(file="wc_claims_problems_5k.csv")
wcdata_modified5k <- wcdata5k %>%
        select(-c("Claim_Number","Incd_Med_Loss"))
wcdata_identity5k <- wcdata5k %>%
        select("Claim_Number")
wcdata_identity5k <- as.numeric(unlist(wcdata_identity5k))
wcpairs5k <- compare.dedup(dataset = wcdata_modified5k, identity = wcdata_identity5k)
wcweightpairs5k <- epiWeights(wcpairs5k)

#To generate figure 11:
summary(wcweightpairs5k) 
## 
## Deduplication Data Set
## 
## 5542 records 
## 15354111 record pairs 
## 
## 542 matches
## 15353569 non-matches
## 0 pairs with unknown status
## 
## 
## Weight distribution:
## 
##   [0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35] (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9] 
##    2588969    3834146    3586894    2261369    1334293    1102876     570468       8728      20869      12890       9445       7980       3035       3060       1239       2553       2525        541 
## (0.9,0.95]   (0.95,1] 
##       1194       1037
#To generate figure 12:
getPairs(wcweightpairs5k,0.52,0.5) 
##         id Accident_State Accident_Date Report_Date Claim_Type Claim_Status Body_Part Injury_Type    Weight
## 1        4             TX     5/22/2017   5/24/2017        MED            O      Back      Strain          
## 2     1136             CA     5/21/2017   5/24/2017        MED            O     Wrist      Strain 0.5082398
## 3                                                                                                          
## 4        7             CA      5/2/2017    5/4/2017        MED            O     Wrist  Laceration          
## 5     4125             TX      5/3/2017    5/4/2017        MED            O      Hand  Laceration 0.5082398
## 6                                                                                                          
## 7        7             CA      5/2/2017    5/4/2017        MED            O     Wrist  Laceration          
## 8     4913             TX      5/3/2017    5/4/2017        MED            O     Wrist      Strain 0.5082398
## 9                                                                                                          
## 10       7             CA      5/2/2017    5/4/2017        MED            O     Wrist  Laceration          
## 11    5016             TX      5/4/2017    5/4/2017        MED            O     Wrist      Strain 0.5082398
## 12                                                                                                         
## 13       9             CA     1/22/2017   1/25/2017        MED            O      Hand  Laceration          
## 14    1965             TX     1/24/2017   1/25/2017        MED            O      Back  Laceration 0.5082398
## 15                                                                                                         
## 16       9             CA     1/22/2017   1/25/2017        MED            O      Hand  Laceration          
## 17    4665             TX     1/23/2017   1/25/2017        MED            O     Wrist  Laceration 0.5082398
## 18                                                                                                         
## 19      10             TX     6/17/2017   6/20/2017        MED            O      Hand      Strain          
## 20      87             NY     6/18/2017   6/20/2017        MED            O      Hand  Laceration 0.5082398
## 21                                                                                                         
## 22      10             TX     6/17/2017   6/20/2017        MED            O      Hand      Strain          
## 23    2872             CA     6/18/2017   6/20/2017        MED            O      Back      Strain 0.5082398
## 24                                                                                                         
## 25      10             TX     6/17/2017   6/20/2017        MED            O      Hand      Strain          
## 26    4335             NY     6/16/2017   6/20/2017        MED            O      Back      Strain 0.5082398
## 27                                                                                                         
## 28      10             TX     6/17/2017   6/20/2017        MED            O      Hand      Strain          
## 29    5473             CA     6/18/2017   6/20/2017        MED            O      Hand  Laceration 0.5082398
## 30                                                                                                         
## 31      13             TX     2/10/2017   2/10/2017        MED            O      Hand  Laceration          
## 32    2509             CA      2/9/2017   2/10/2017        MED            O     Wrist  Laceration 0.5082398
## 33                                                                                                         
## 34      13             TX     2/10/2017   2/10/2017        MED            O      Hand  Laceration          
## 35    5006             CA      2/8/2017   2/10/2017        MED            O      Hand      Strain 0.5082398
## 36                                                                                                         
## 37      13             TX     2/10/2017   2/10/2017        MED            O      Hand  Laceration          
## 38    5057             NY      2/8/2017   2/10/2017        MED            O      <NA>  Laceration 0.5082398
## 39                                                                                                         
## 40      14             NY     8/23/2017   8/26/2017        MED            O      Hand  Laceration          
## 41     566             TX     8/25/2017   8/26/2017        MED            O     Wrist  Laceration 0.5082398
## 42                                                                                                         
## 43      14             NY     8/23/2017   8/26/2017        MED            O      Hand  Laceration          
## 44    2210             TX     8/25/2017   8/26/2017        MED            O      Hand      Strain 0.5082398
## 45                                                                                                         
## 46      15             NY     9/18/2017   9/19/2017        MED            O     Wrist  Laceration          
## 47     226             TX     9/17/2017   9/19/2017        MED            O      Hand  Laceration 0.5082398
## 48                                                                                                         
## 49      15             NY     9/18/2017   9/19/2017        MED            O     Wrist  Laceration          
## 50    2413             TX     9/17/2017   9/19/2017        MED            O      Hand  Laceration 0.5082398
## 51                                                                                                         
## 52      15             NY     9/18/2017   9/19/2017        MED            O     Wrist  Laceration          
## 53    2917             CA     9/16/2017   9/19/2017        MED            O      Hand  Laceration 0.5082398
## 54                                                                                                         
## 55      15             NY     9/18/2017   9/19/2017        MED            O     Wrist  Laceration          
##  [ reached getOption("max.print") -- omitted 19679 rows ]

Example 7.2 (using dlookr package in R)

The following script uses the dlookr package to generate a “Data Quality Diagnosis Report”. The full report can only be outputted in an “HTML” or “PDF” format, and thus cannot be shown within this markdown file. To see the report, we recommend evaluating the script chunk outside of this file.

One of the tables provided in the full report is the “Data Quality Overview” table, included in the Monograph as figure 13. An unformatted version of this table can be outputted directly in R markdown. The script chunks to generate both the full report and the unformatted version of figure 13 are provided below.

library(dlookr)
#To generate figure 13:
wcdata %>%
        select(-"Claim_Number") %>%
        diagnose()
## # A tibble: 8 x 6
##   variables      types     missing_count missing_percent unique_count unique_rate
##   <chr>          <chr>             <int>           <dbl>        <int>       <dbl>
## 1 Accident_State character             0           0                3   0.0000545
## 2 Accident_Date  character             0           0              351   0.00638  
## 3 Report_Date    character             0           0              356   0.00647  
## 4 Claim_Type     character             0           0                2   0.0000363
## 5 Claim_Status   character             0           0                2   0.0000363
## 6 Body_Part      character          1691           3.07             5   0.0000908
## 7 Injury_Type    character           123           0.223            5   0.0000908
## 8 Incd_Med_Loss  numeric               0           0            53512   0.972
#To generate the full report (run this code chunk outside of R markdown):
wcdata %>%
        select(-"Claim_Number") %>%
        diagnose_report(output_format = "html",
                   output_file = "cas_Monograph_dq.html")

Example 7.3 (using VIM package in R)

The following script uses the VIM package to further analyze missing data. This package allows for visualization of missing values across multiple data fields. The below script is used to generate figures 14, 15 and 16 in the Monograph.

library(VIM)
#To generate figure 14:
aggr(wcdata,
     props=TRUE,
     sort.vars=TRUE,
     only.miss=TRUE,
     combined=TRUE,
     numbers = TRUE,
     varheight=FALSE,
     col="red",
     cex.axis=.75,
     labels=c('Claim#','State','LossDate','ReptDate','ClmType','ClmStat','Body_Part','Inj_Type','Incd_Loss')
     ) 

#To generate figure 15:
spineMiss(wcdata[, c("Injury_Type", "Body_Part")],
          col="red") #generates figure 15

#To generate figure 16:
wcdata$Accident_Date<-as.Date(wcdata$Accident_Date)
wcdata$Accident_State<-as.factor(wcdata$Accident_State)
wcdata$Claim_Type<-as.factor(wcdata$Claim_Type)
wcdata$Body_Part<-as.factor(wcdata$Body_Part)
wcdata$Injury_Type<-as.factor(wcdata$Injury_Type)

wcdata %>%
  select(c("Accident_Date","Accident_State","Claim_Type","Body_Part","Injury_Type")) %>%
  parcoordMiss(highlight = 'Injury_Type', 
               alpha = 1.0,
               col = c("grey", "grey40","gold", "firebrick2","firebrick4","gold4"),
               labels=c("LossDate",'State','ClmTyp','Body_Part','Inj_Type'))

Example 7.4 (using validate package in R)

The following script uses the validate package to demonstrate the application of data quality rules. This script is used to generate figure 17 in the Monograph.

library(validate)
wcdata$Report_Date<-as.Date(wcdata$Report_Date)
validatereport<-check_that(wcdata,
               Incd_Med_Loss<1000000,
               Incd_Med_Loss>1,
               Report_Date-Accident_Date>=0,
               if(Injury_Type=='Concussion')  Body_Part=='Head'
               )

summary(validatereport)
##   name items passes fails nNA error warning                                             expression
## 1   V1 55056  55052     4   0 FALSE   FALSE                                  Incd_Med_Loss < 1e+06
## 2   V2 55056  49977  5079   0 FALSE   FALSE                                      Incd_Med_Loss > 1
## 3   V3 55056  54788   268   0 FALSE   FALSE                       Report_Date - Accident_Date >= 0
## 4   V4 55056  54401   336 319 FALSE   FALSE !(Injury_Type == "Concussion") | (Body_Part == "Head")
#To generate figure 17:
barplot(validatereport,
        main='Validate on wc_claims',
        colors=c(fails='firebrick2',passes='grey',nNA='gold4'),
        xlab="Items",
        mgp=c(3,.6,0)) 

Example 7.5 (Imputation of Missing Values)

The following script demonstrates the imputation of missing values as discussed in Example 7.5 of the Monograph. We use the nnet package to fit a multinomial log-linear model via a neural network to carry out classification mode imputation. This script is used to generate figure 18 in the Monograph.

library(nnet)
#There is no mode function in Base R so we define one:
mode <- function(x){
        ux <- unique(x)
        ux[which.max(tabulate(match(x,ux)))]
}

missing<-is.na(wcdata$Body_Part)|is.na(wcdata$Injury_Type!='NA')
wcdata_nomiss<-wcdata[!missing,]
wcdata_miss<-wcdata[missing,]

#Body Part mode imputation
wcdata_miss$Mode_BP_Imp <- mode(wcdata_nomiss$Body_Part)

#Fitting a classification model to impute the missing values
bp_model <- multinom(Body_Part ~ Injury_Type+Claim_Type+Incd_Med_Loss, data = wcdata_nomiss)
## # weights:  28 (18 variable)
## initial  value 73813.243258 
## iter  10 value 52265.241914
## iter  20 value 45788.279413
## iter  30 value 43337.314542
## iter  40 value 43317.892233
## iter  50 value 43317.845947
## final  value 43317.844274 
## converged
wcdata_miss$Classifaction_BP_Imp <- predict(bp_model, wcdata_miss)
wcdata_miss_mod <- wcdata_miss%>%
        select(-c("Accident_Date","Report_Date","Claim_Status"))
wcdata_miss_mod[1:10,]
##     Claim_Number Accident_State Claim_Type Body_Part Injury_Type Incd_Med_Loss Mode_BP_Imp Classifaction_BP_Imp
## 62            62             NY        IND      <NA>         RSI   305.1800000        Hand                 Back
## 68            68             TX        MED      <NA>  Laceration  2078.7800000        Hand                 Hand
## 93            93             TX        MED      <NA>      Strain   572.6300000        Hand                 Back
## 108          108             TX        MED      <NA>  Laceration   454.2100000        Hand                 Hand
## 122          122             TX        MED      <NA>  Concussion  4630.9600000        Hand                 Head
## 134          134             NY        MED      <NA>      Strain     0.3365156        Hand                 Back
## 164          164             TX        MED      <NA>  Concussion  3441.4800000        Hand                 Head
## 228          228             CA        MED      <NA>  Laceration  3299.8200000        Hand                 Hand
## 239          239             CA        MED      <NA>  Laceration     0.2817514        Hand                 Hand
## 273          273             TX        MED      <NA>      Strain   464.9400000        Hand                 Back

Example 7.6 (Anomaly Scoring on Non-missing Data)

The following script demonstrates the imputation of missing values as discussed in Example 7.6 of the Monograph. We use the nnet package to fit a multinomial log-linear model via a neural network to generate the probabilities of anomalies. This script is used to generate figure 19 in the Monograph.

library(nnet)
missing<-is.na(wcdata$Body_Part)|is.na(wcdata$Injury_Type!='NA')

wcdata_nomiss<-wcdata[!missing,]
wcdata_miss<-wcdata[missing,]

bp_model <- multinom(Body_Part ~ Injury_Type+Claim_Type+Incd_Med_Loss, data = wcdata_nomiss)
## # weights:  28 (18 variable)
## initial  value 73813.243258 
## iter  10 value 52265.241914
## iter  20 value 45788.279413
## iter  30 value 43337.314542
## iter  40 value 43317.892233
## iter  50 value 43317.845947
## final  value 43317.844274 
## converged
summary(bp_model)
## Call:
## multinom(formula = Body_Part ~ Injury_Type + Claim_Type + Incd_Med_Loss, 
##     data = wcdata_nomiss)
## 
## Coefficients:
##       (Intercept) Injury_TypeLaceration Injury_TypeRSI Injury_TypeStrain Claim_TypeMED   Incd_Med_Loss
## Hand     8.966529             -7.391632      -9.119106       -12.1631121     0.7121227 -0.000008950721
## Head    14.329091            -12.612010     -24.607763       -26.0959304    -2.0540311  0.000004304243
## Wrist   -2.520651              2.606998       2.482866         0.7019022     1.1473849 -0.000004918727
## 
## Std. Errors:
##                (Intercept) Injury_TypeLaceration         Injury_TypeRSI        Injury_TypeStrain        Claim_TypeMED   Incd_Med_Loss
## Hand  0.000000000014739239  0.000000000006593870 0.00000000001021641486 0.0000000000018899604516 0.000000000008174309 0.0000007186854
## Head  0.000000000018547234  0.000000000015810924 0.00000000000002385952 0.0000000000000009244619 0.000000000007829503 0.0000010956044
## Wrist 0.000000000008831431  0.000000000007677564 0.00000000000408122073 0.0000000000042709580837 0.000000000005596047 0.0000005507955
## 
## Residual Deviance: 86635.69 
## AIC: 86671.69
preds<-data.frame(fitted(bp_model))

wcdata_nomiss$Probability_Body_Part<-ifelse(wcdata_nomiss$Body_Part=='Back',preds$Back,
                    ifelse(wcdata_nomiss$Body_Part=='Hand',preds$Hand,
                      ifelse(wcdata_nomiss$Body_Part=='Head',preds$Head,preds$Wrist)))

anomalies<-wcdata_nomiss[wcdata_nomiss$Probability_Body_Part<.025,]
anomalies_mod <- anomalies%>%
        select(-c("Accident_Date","Report_Date","Claim_Status"))
anomalies_mod[1:10,]
##      Claim_Number Accident_State Claim_Type Body_Part Injury_Type Incd_Med_Loss Probability_Body_Part
## 174           174             TX        IND      Hand  Concussion      18423.13           0.003659508
## 613           613             NY        IND      Hand  Concussion      55669.27           0.002236830
## 1387         1387             NY        IND      Hand  Concussion      40781.51           0.002723474
## 1507         1507             CA        IND      Hand  Concussion       3809.06           0.004438231
## 1530         1530             TX        IND     Wrist  Laceration     337653.84           0.008199359
## 1846         1846             NY        IND     Wrist  Laceration     431219.41           0.003546244
## 2391         2391             TX        MED      Hand  Concussion     175221.58           0.007253680
## 4666         4666             CA        IND      Hand  Concussion      15096.05           0.003823874
## 5086         5086             CA        IND      Hand  Concussion      10068.35           0.004086311
## 5424         5424             NY        IND      Hand  Concussion       5132.59           0.004361385

Example 7.7 (Impact of Improving Data Quality on an Actuarial Analysis)

The following script demonstrates the impact of improving data quality on Actuarial Analyses, as detailed in Example 7.7. Specifically this involves comparing the fit of glm models to the original dataset versus the datasets with improved data quality. This script is used to generate figures 20 and figure 21 in the Monograph.

wcdata__orig<-read.csv('wc_claims_problems.csv')
wcdata_cleaned<-read.csv('wc_claims_cleaned.csv')

wcdata__orig$Report_Lag<-difftime(wcdata__orig$Report_Date ,wcdata__orig$Accident_Date , units = c("days"))
wcdata_cleaned$Report_Lag<-difftime(wcdata_cleaned$Report_Date ,wcdata_cleaned$Accident_Date , units = c("days"))

WeightedGini <- function(solution, weights, submission){
   df = data.frame(solution, weights, submission)
   n <- nrow(df)
   df <- df[order(df$submission, decreasing = TRUE),]
   df$random = cumsum(df$weights/sum(df$weights))
   df$cumPosFound <- cumsum(df$solution * df$weights)
   df$Lorentz <- df$cumPosFound / df$cumPosFound[n]
   sum(df$Lorentz[-1]*df$random[-n]) - sum(df$Lorentz[-n]*df$random[-1]) }

NormGini <- function(solution, weights, submission) {
   WeightedGini(solution, weights, submission) / WeightedGini(solution, weights, solution)
                }

glm_orig<-glm(Incd_Med_Loss~Accident_State+Report_Lag+Claim_Type+Claim_Status+Body_Part+Injury_Type,
            data=wcdata__orig,
            family=Gamma(link='log'))
glm_cleaned<-glm(Incd_Med_Loss~Accident_State+Report_Lag+Claim_Type+Claim_Status+Body_Part+Injury_Type,
            data=wcdata_cleaned,
            family=Gamma(link='log'))

weights_orig<-rep(1,nrow(wcdata__orig))
weights_cleaned<-rep(1,nrow(wcdata_cleaned))

summary(glm_orig)
## 
## Call:
## glm(formula = Incd_Med_Loss ~ Accident_State + Report_Lag + Claim_Type + 
##     Claim_Status + Body_Part + Injury_Type, family = Gamma(link = "log"), 
##     data = wcdata__orig)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.8070  -1.7484  -0.8905   0.0549  10.9908  
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)           10.900682   0.063962 170.425 < 0.0000000000000002 ***
## Accident_StateNY       0.011595   0.025650   0.452              0.65123    
## Accident_StateTX      -0.187219   0.020471  -9.146 < 0.0000000000000002 ***
## Report_Lag            -0.010195   0.006506  -1.567              0.11714    
## Claim_TypeMED         -1.127081   0.028066 -40.158 < 0.0000000000000002 ***
## Claim_StatusO         -0.238651   0.025519  -9.352 < 0.0000000000000002 ***
## Body_PartHand         -0.539720   0.030787 -17.531 < 0.0000000000000002 ***
## Body_PartHead          0.148312   0.050495   2.937              0.00331 ** 
## Body_PartWrist        -0.322912   0.026911 -11.999 < 0.0000000000000002 ***
## Injury_TypeLaceration -0.737664   0.046703 -15.795 < 0.0000000000000002 ***
## Injury_TypeRSI        -0.095259   0.057484  -1.657              0.09750 .  
## Injury_TypeStrain     -0.938017   0.054552 -17.195 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 4.16853)
## 
##     Null deviance: 229590  on 53244  degrees of freedom
## Residual deviance: 190937  on 53233  degrees of freedom
##   (1811 observations deleted due to missingness)
## AIC: 994876
## 
## Number of Fisher Scoring iterations: 10
summary(glm_cleaned)
## 
## Call:
## glm(formula = Incd_Med_Loss ~ Accident_State + Report_Lag + Claim_Type + 
##     Claim_Status + Body_Part + Injury_Type, family = Gamma(link = "log"), 
##     data = wcdata_cleaned)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7773  -1.5417  -0.8149   0.0750  12.0426  
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)           10.503732   0.065534 160.280 < 0.0000000000000002 ***
## Accident_StateNY       0.085011   0.024653   3.448             0.000565 ***
## Accident_StateTX      -0.098575   0.019679  -5.009   0.0000005482631460 ***
## Report_Lag            -0.011756   0.006336  -1.855             0.063549 .  
## Claim_TypeMED         -1.084581   0.027343 -39.666 < 0.0000000000000002 ***
## Claim_StatusO         -0.172449   0.024674  -6.989   0.0000000000027987 ***
## Body_PartHand         -0.534671   0.029853 -17.910 < 0.0000000000000002 ***
## Body_PartHead          0.384790   0.050661   7.595   0.0000000000000312 ***
## Body_PartWrist        -0.319363   0.025949 -12.307 < 0.0000000000000002 ***
## Injury_TypeLaceration -0.418992   0.048615  -8.619 < 0.0000000000000002 ***
## Injury_TypeRSI         0.294664   0.059392   4.961   0.0000007023677399 ***
## Injury_TypeStrain     -0.605690   0.056001 -10.816 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 3.626746)
## 
##     Null deviance: 141931  on 49999  degrees of freedom
## Residual deviance: 107244  on 49988  degrees of freedom
## AIC: 980313
## 
## Number of Fisher Scoring iterations: 7
NormGini(wcdata__orig$Incd_Med_Loss,weights_orig,predict(glm_orig,wcdata__orig,type='response'))
## [1] 0.5630662
NormGini(wcdata_cleaned$Incd_Med_Loss,weights_cleaned,predict(glm_cleaned,wcdata_cleaned,type='response'))
## [1] 0.6135776
#To generate figure 20:
plot(glm_orig) 

#To generate figure 21:
plot(glm_cleaned)