The R Programming Language—My “Go To” Computational Software
My involvement in a number of predictive modeling projects in the past few years has given me the opportunity to work with professional statisticians. These statisticians introduced me to something that I believe will be useful to many actuaries.
The R programming language is a software environment for statistical computing and graphics. R is widely used for statistical software development and data analysis. R’s source code is free and available at the Web Site www.r-project.org where precompiled binary versions are provided for Microsoft Windows, Mac OS X, and other UNIX-like operating systems.
R is the result of a collaborative effort with contributions from all over the world. R was initially written by Robert Gentleman and Ross Ihaka—also known as “R & R” of the Statistics Department of the University of Auckland. Since mid-1997 there has been a core group with access to write the actual source code for R.
R supports a wide variety of statistical and numerical techniques. R is also highly extensible through the use of packages, which are user-submitted libraries for specific functions or areas of study. A core set of packages are included with the installation of R, with over 700 more available at the Comprehensive R Archive Network (CRAN) as of 2006.
The models you can fit with R include generalized linear models, various tree-based models, and neural nets. This is a very incomplete list. One of my favorites is the generalized additive model, which is similar to the generalized linear model except that it allows non-linear relationships with the independent variables.
R also allows you to build your own functions and it even has functions that operate on functions. For example, I once wanted to find the limited average severity for a log-t distribution. To do this I wrote a function for 1 minus the cumulative distribution function for the log-t, and used a function called “integrate” that takes a function and the limits of integration as input.
R also has an all-purpose optimizer function that I use to calculate maximum likelihood estimates in fitting claim severity distributions and loss reserving models. To illustrate these applications, I placed R code on the CAS Web Site that is connected with my submission to last year’s COTOR Challenge and my recent CAS Forum paper on loss reserving.
If you look at the material in these links, you will see another strong feature of R-graphics. For example, Figure 5 in the paper shows a matrix plot that illustrates how fitted loss development factors vary by insurer size. Statistical computing has placed a strong emphasis on data visualization in recent years and R includes many of these new tools.
I found the learning curve for R rather steep at first. I went about learning it by selecting a project (last year’s COTOR Challenge) and forcing myself to do it with R. After doing that and some other predictive modeling projects, R replaced Excel as my personal “go to” computational software.
While the software itself is free, I have found it worthwhile to buy some books for reference. Here are three that will help you to get started.
R Reference Manual, Base Package (both volumes) by the R Development Core Team. This is simply a print-out of the help menus arranged by subject and alphabetical order within subject. I found them a helpful reference for the names of commands. While looking for some commands, I frequently stumbled across others that proved to be very useful.
R-Graphics by Paul Murrell. This book focuses on drawing neat graphs. It also has a good general introduction to R.
Modern Applied Statistics with S by W. N. Venables and B. D. Ripley, Fourth Edition. This book shows how to use R for a wide variety of statistical methods. Don’t let the “S” in the title fool you—R and S code are very similar. The fourth edition of this book addresses both software environments.
One additional comment—R code can be written on Notepad but there are other text editors specifically designed for R that make writing code easier. The one I use is called Tinn-R and you can download it for free from the Web Site.
I am not going to argue that R is the single best package for actuaries to use in their statistical analysis. It is important to keep current with other statistical software packages. However, I use R because there are many others in our profession who also use it. Many students are learning it and a recent CAS Limited Attendance Seminar on Predictive Modeling also used it. Because of its “open source” philosophy, I agree with the assessment currently offered in Wikipedia that R “has become a de facto standard among statisticians for the development of statistical software.”