The Trailing Edge

Tuesday, June 25, 2013

Options now expire Friday night in the US

Stock options in the US used to stop trading on Friday and expire on Saturday. They now expire Friday night. So now you know!

Monday, March 11, 2013

RExcel doesn't

RExcel is a commercial product (free for educational use) that is supposed to allow using R from within Excel. The website is here. Unfortunately, it's a pain to install and in the end I couldn't get this to work using R 2.15.3 and Excel 2010. I get the error message "RServer seems to have been disconnected or closed." If you Google this phrase, others have had the same problem.

There are undoubtedly commercial uses for R within Excel, but R has far better programability, better graphics, and better reproducibility. (It's a lot easier to share R code with someone than to share a spreadsheet formula). R can make Excel better, but you shouldn't think that using Excel as a crutch will make it easier to learn R.

RExcel might work, but it also might just waste your time. Even if it works, you'll end up with an Excel/R hybrid that might not work for your colleagues, friends and coworkers, and that has separate licenses distinguishing noncommercial, commercial, and educational uses. (You'll get a headache just trying to figure out the difference between the licenses.)

If you have a business need for RExcel, buy a license (approximately $450 per computer) and get support. However, if you want to learn R, learn R. Don't worry about Excel integration. Excel is platform-dependent and clumsy to work with if you're doing anything non-trivial. If you need to pass data back and forth between R and Excel, use CSV files. It's an old technology, but at least it works.

Wednesday, February 20, 2013

Why I Love R

I'm updating the volatility portion of my Derivatives notes for class. I wanted to freshen a plot of the VIX index. Here is the R code to grab the complete VIX times series from Yahoo and create a pdf file:

 vix <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^VIX",  
         stringsAsFactors=FALSE)[, c(1, 7)]  
 pdf("vixplot.pdf")  
 plot(as.Date(vix[, 1]), vix[, 2], main="VIX, 1990 to Present",  
    xlab='Date', ylab='VIX', cex=0.3, type='p', ylim=c(0, 80))  
 dev.off()

And it's a decent looking plot:

Sunday, January 20, 2013

Open Source vs Commercial Software and R

Will your investment in R become worthless? Will R still exist and be maintained ten years from now? Is it foolish to rely on the whims of the open source community, instead of the profit-maximizing motives of commercial software companies and developers? My money is on R.

Recognize that in order to survive, a commercial software company must get you to pay every year or two, typically by inducing you to upgrade. These induced upgrades sometimes occur because there are new valuable features that you are willing to pay for, but often they occur because the company deliberately creates incompatibilities between old and new versions, or refuses to fix bugs in old versions. It is probably obvious to you that in many cases there is no functional need to reinvent software and operating systems every few years. Consider Microsoft Office. In terms of features, most users would find Office 97 completely adequate. Nevertheless, you must upgrade: "This new version is wonderful and has a shiny new file format and new features. (Note: It will destroy your old documents and ruin compatibility with coauthors who haven't upgraded. Too bad.)" But if Microsoft didn't create incompatibilities, users might not upgrade.

Moreover, in many arenas, open source software has been hugely successful. Linux is widely used for powering websites, datacenters, and by the research community for high-performance computing clusters (for example, the 7,000 CPU cluster at Northwestern runs Linux). Linux is also the basis for Google's Android, which has become the dominant smartphone operating system. The open source Apache and Nginx webservers run on 67% of all domains and 72% of the million busiest sites globally, compared to 11.3% and 13.3% for Microsoft. MySQL, an open-source relational database program, is used by Facebook, Wikipedia, Twitter, and in many other applications. Numerous widely-used and important programming languages are open source, freely available, and widely used, among them Perl, Python, and PhP.

Open source can be commercialized: Companies can base a business on open source software by selling support or enhanced (partially closed-source) versions of the software. Red Hat, Canonical, Suse, and Oracle offer commercial versions of Linux along with support. Oracle sells an enhanced version of MySQL. Revolution Analytics offers a commercial, enhanced version of R, and Rstudio sells training and support for R. The fact that the underlying software is open source constrains bad behavior by the commercial firm. In addition, commercial companies often find they have to support successful open source projects. R, for example, is supported by both Mathematica and SAS.

Will your investment in R pay off? No one can say for sure, but it's not a bad bet.

Wednesday, January 16, 2013

R and explicit loops

How much difference does it make if you use explicit or implicit loops in R? By "implicit loop", I mean using a sequence within a function. Here is a code snippet illustrating three ways to draw 200,000 normal random variables. Two are explicit loops within functions, the third is an implicit loop, simply supplying an argument to rnorm. The second function states the size of the vector when creating it.

 randnorm <- function(n) {  
  y <- vector()  
  for (i in 1:n) y[i] <- rnorm(1)  
  return(y)  
 }  
 randnorm2 <- function(n) {  
  y <- vector(length=n, mode='numeric')  
  for (i in 1:n) y[i] <- rnorm(1)  
  return(y)  
 }  
 n <- 2e05  
 system.time(y <- randnorm(n))  
 system.time(y <- randnorm2(n))  
 system.time(y <- rnorm(n))

Here are the results:

 > n <- 2e05  
 > system.time(y <- randnorm(n))  
   user system elapsed   
  28.170  0.352 28.618   
 > system.time(y <- randnorm2(n))  
   user system elapsed   
  0.856  0.000  0.858   
 > system.time(y <- rnorm(n))  
   user system elapsed   
  0.032  0.000  0.031

What do we conclude?

The implicit loop (rnorm(n)) is much faster than the alternatives.
If you do have to allocate a vector, it is better to create the vector in advance, as in randnorm2, specifying the length and mode
The unsized vector in randnorm creates a very slow function. Moreover, in tests it appears to become exponentially slower as the size of the vector grows. A function written like randnorm should be a last resort.

The moral is, if possible, to to avoid constructs like that in randnorm.

R and the year 2038 problem

I am teaching an R workshop, which has met twice. There are things I either don't get to in class, or fumble when discussing them. I'm going to occasionally discuss a few of these issues here for the benefit of students, for me (this will serve as a reference in the future), and for anyone else who may be interested.

Dates in base R can be represented using the Date class, POSIXct, or POSIXlt. We can examine date handling by considering the year 2038 problem. The year 2038 problem arises as follows: if you use a signed 32-bit integer to represent the number of seconds since January 1, 1970 (as Unix systems traditionally have), you exhaust the capacity of that integer on January 19, 2038. Using R to verify this requires first that we know what it means to use a "signed 32-bit integer."

Background: Suppose you want to represent an integer as a "signed 3-bit integer". You can do this by representing integers in binary, with the leftmost integer telling you whether the integer is positive or negative. So we have that decimal 0 is 000, decimal 1 is 001, decimal 2 is 010, and decimal 3 is 011. What happens when we go to the next value? Binary 100 is a negative number because the leftmost number is a 1. Thus, when we have a 3-bit signed integer, the largest positive value we can represent is 3 = 2^2 - 1. In general, for an n-bit representation, the largest value is 2^(n-1) - 1. With a 32-bit signed integer, the largest (positive) value you can represent is 2^31 - 1.

R's POSIXlt class provides a general purpose function for displaying and manipulating dates. The Unix epoch (0 seconds) occurred on January 1, 1970. When will the 32-bit time integer be exhausted? We can use POSIXlt to find out:

 > as.POSIXlt(2^31 - 1, origin = '1970-1-1', tz="GMT")  
 [1] "2038-01-19 03:14:07 GMT"

Thus, unpatched 32-bit UNIX systems will come to a screeching and erratic halt on January 19, 2038, a few seconds after 3:14am, GMT. Note that as.POSIXlt assumes that the number you supply is in seconds, and performs a reasonable (and correct) calculation. The function will try to guess at what you intend.

The POSIXlt calculations in R continue to work for larger integers. You may wish to amuse yourself by discovering the largest integer for which as.POSIXlt will continue to provide a valid date. (Hint: it's between 2^45 and 2^50, and it is millions of years in the future.)

Sunday, December 30, 2012

R in the New York Times

The New York Times graphics department maintains a fascinating blog providing a behind-the-scenes look at the construction of graphics appearing in the paper and on the web site. Many of the posts deal with R, which the graphics department apparently uses a lot.

In this post they show you some graphs pertaining to political party dominance in state government, and this subsequent post shows the R code used to design the charts. The post also provides the raw data, so you can recreate the plots. What I really like about this post (and the blog in general) is that it doesn't just advertise the finished product (which is typically excellent), but it shows you the process by which the team arrived at the finished product.

Two of my favorite R-related posts are those depicting all passes made by each player on the US Women's Soccer team, and this one comparing the Facebook IPO to other IPOs (this post doesn't mention R, but I'm 99.9% sure that the graphs are produced in R --- axis variable names like test$date give it away).

Wednesday, June 27, 2012

Computing the Normal Distribution

I teach options, so I am interested in how accurately different software packages compute the cumulative normal distribution function. Not being a statistician, my usual approach is to compare different packages and look for discrepancies. A few years ago I noticed a tiny discrepancy in option prices computed using Excel and OpenOffice. I became curious, and started checking the calculation of the area under the curve of the standard normal distribution from minus infinity to -5.5. I found agreement between Matlab and Gnumeric, a free, open-source spreadsheet. Everything else differed.

I am revising my derivatives text, so I checked again. Things appear to have changed for the better. At least there is convergence among the different programs.

Recent versions of Matlab and Gnumeric return the same values as several years ago. The current version of Octave matches Gnumeric, as does R. Excel appears to have caught up with Gnumeric (wait, shouldn't that have been the other way around??), and Libreoffice 3.5 comes much closer than it did.

It's hard to generalize other than to note that both commercial and open source software had hits and misses a few years ago. Excel has long been pilloried for sloppy statistical programming, so it's good to see that Microsoft may finally be getting its act together. The percentage differences below were all computed in Excel 2010.

Blogger's table support is non-existent, so you may not be able to read the table below. Here is a pdf version.

Software	Function	Value	% diff from Gnumeric
Gnumeric 1.6.3	Normsdist(-5.5)	1.89895624658877E-08	0.0000E+00
Matlab 2006b	Normcdf(-5.5)	1.89895624658877E-08	0.0000E+00
Octave 3.0.1	Normcdf(-5.5)	1.89895624780334E-08	6.3960E-10
Excel 2007	Normsdist(-5.5)	1.89895624780334E-08	6.3960E-10
OO 2.4.1/3.0	Normsdist(-5.5)	1.89901047664698E-08	2.8558E-05

Excel 2010	Normsdist(-5.5)	1.89895624658877E-08	-5.2272E-16
R 2.15.1	pnorm(-5.5)	1.89895624658877E-08	0.0000E+00
Octave 3.6.1	Normcdf(-5.5)	1.89895624658877E-08	0.0000E+00
Libreoffice 3.5.4	Normsdist(-5.5)	1.89895624658900E-08	1.2110E-13