Sunday, January 20, 2013

Open Source vs Commercial Software and R

Will your investment in R become worthless? Will R still exist and be maintained ten years from now?  Is it foolish to rely on the whims of the open source community, instead of the profit-maximizing  motives of commercial software companies and developers? My money is on R.

Recognize that in order to survive, a commercial software company must get you to pay every year or two, typically by inducing you to upgrade. These induced upgrades sometimes occur because there are new valuable features that you are willing to pay for, but often they occur because the company deliberately creates incompatibilities between old and new versions, or refuses to fix bugs in old versions. It is probably obvious to you that in many cases there is no functional need to reinvent software and operating systems every few years. Consider Microsoft Office. In terms of features, most users would find Office 97 completely adequate. Nevertheless, you must upgrade: "This new version is wonderful and has a shiny new file format and new features. (Note: It will destroy your old documents and ruin compatibility with coauthors who haven't upgraded. Too bad.)" But if Microsoft didn't create incompatibilities, users might not upgrade.

Moreover, in many arenas, open source software has been hugely successful. Linux is widely used for powering websites, datacenters, and by the research community for high-performance computing clusters (for example, the 7,000 CPU  cluster at Northwestern runs Linux). Linux is also the basis for Google's Android,  which has become the dominant smartphone operating system. The open source Apache and Nginx webservers run on 67% of all domains and 72% of the million busiest sites globally, compared to 11.3% and 13.3% for Microsoft. MySQL, an open-source relational database program, is used by Facebook, Wikipedia, Twitter, and in many other applications. Numerous widely-used and important programming languages are open source, freely available, and widely used, among them Perl, Python, and PhP.

Open source can be commercialized: Companies can base a business on open source software by selling support or enhanced (partially closed-source) versions of the software. Red Hat, Canonical,  Suse, and Oracle offer commercial versions of Linux along with support. Oracle sells an enhanced version of MySQL. Revolution Analytics offers a commercial, enhanced version of R, and Rstudio sells training and support for R. The fact that the underlying software is open source constrains bad behavior by the commercial firm. In addition, commercial companies often find they have to support successful open source projects. R, for example, is supported by both Mathematica and SAS.

Will your investment in R pay off? No one can say for sure, but it's not a bad bet.

Wednesday, January 16, 2013

R and explicit loops

How much difference does it make if you use explicit or implicit loops in R? By "implicit loop", I mean using a sequence within a function. Here is a code snippet illustrating three ways to draw 200,000 normal random variables. Two are explicit loops within functions, the third is an implicit loop, simply supplying an argument to rnorm. The second function states the size of the vector when creating it.
 randnorm <- function(n) {  
  y <- vector()  
  for (i in 1:n) y[i] <- rnorm(1)  
  return(y)  
 }  
 randnorm2 <- function(n) {  
  y <- vector(length=n, mode='numeric')  
  for (i in 1:n) y[i] <- rnorm(1)  
  return(y)  
 }  
 n <- 2e05  
 system.time(y <- randnorm(n))  
 system.time(y <- randnorm2(n))  
 system.time(y <- rnorm(n))  
Here are the results:
 > n <- 2e05  
 > system.time(y <- randnorm(n))  
   user system elapsed   
  28.170  0.352 28.618   
 > system.time(y <- randnorm2(n))  
   user system elapsed   
  0.856  0.000  0.858   
 > system.time(y <- rnorm(n))  
   user system elapsed   
  0.032  0.000  0.031      
What do we conclude?

  • The implicit loop (rnorm(n)) is much faster than the alternatives.
  • If you do have to allocate a vector, it is better to create the vector in advance, as in randnorm2, specifying the length and mode
  • The unsized vector in randnorm creates a very slow function. Moreover, in tests it appears to become exponentially slower as the size of the vector grows. A function written like randnorm should be a last resort.

The moral is, if possible, to to avoid constructs like that in randnorm.

R and the year 2038 problem

I am teaching an R workshop, which has met twice. There are things I either don't get to in class, or fumble when discussing them. I'm going to occasionally discuss a few of these issues here for the benefit of students, for me (this will serve as a reference in the future), and for anyone else who may be interested.

Dates in base R can be represented using the Date class, POSIXct, or POSIXlt. We can examine date handling by considering the year 2038 problem. The year 2038 problem arises as follows: if you use a signed 32-bit integer to represent the number of seconds since January 1, 1970 (as Unix systems traditionally have), you exhaust the capacity of that integer on January 19, 2038. Using R to verify this requires first that we know what it means to use a "signed 32-bit integer."

Background: Suppose you want to represent an integer as a "signed 3-bit integer". You can do this by representing integers in binary, with the leftmost integer telling you whether the integer is positive or negative. So we have that decimal 0 is 000, decimal 1 is 001, decimal 2 is 010, and decimal 3 is 011. What happens when we go to the next value? Binary 100 is a negative number because the leftmost number is a 1. Thus, when we have a 3-bit signed integer, the largest positive value we can represent is 3 = 2^2 - 1. In general, for an n-bit representation, the largest value is 2^(n-1) - 1. With a 32-bit signed integer, the largest (positive) value you can represent is 2^31 - 1.

R's POSIXlt class provides a general purpose function for displaying and manipulating dates. The Unix epoch (0 seconds) occurred on January 1, 1970. When will the 32-bit time integer be exhausted? We can use POSIXlt to find out:

 > as.POSIXlt(2^31 - 1, origin = '1970-1-1', tz="GMT")  
 [1] "2038-01-19 03:14:07 GMT"  

Thus, unpatched 32-bit UNIX systems will come to a screeching and erratic halt on January 19, 2038, a few seconds after 3:14am, GMT. Note that as.POSIXlt assumes that the number you supply is in seconds, and performs a reasonable (and correct) calculation. The function will try to guess at what you intend.

The POSIXlt calculations in R continue to work for larger integers. You may wish to amuse yourself by discovering the largest integer for which as.POSIXlt will continue to provide a valid date. (Hint: it's between 2^45 and 2^50, and it is millions of years in the future.)