Using R from C/C++

The freely available R statistical data analysis software and programming language offers a wealth of statistical methods and algorithms. This example shows how to incorporate the R interpreter into an existing C/C++ program and transfer data back and forth to do statistical calculations in R while most of your other code is in C or C++. The simplest C/C++ frontend program for R (r_frontend.c) looks like this:

#include <stdlib.h>

#include <Rembedded.h>
#include <Rinternals.h>
#include <R_ext/Parse.h>

int main()
{
  char *argv[] = {"r_frontend", "--quiet"};
  int argc = 2;
  
  const char *str = "print(pi)";
  
  SEXP code, expr = R_NilValue;

  ParseStatus status;

  
  /* This can only be called once */
  Rf_initEmbeddedR(argc, argv);

  
  PROTECT(code = Rf_allocVector(STRSXP, 1));
  SET_STRING_ELT(code, 0, Rf_mkChar(str));

  expr = PROTECT(R_ParseVector(code, -1, &status, R_NilValue));

  R_tryEval(VECTOR_ELT(expr, 0), R_GlobalEnv, NULL);

  UNPROTECT(2);
  

  Rf_endEmbeddedR(0);
  
  return 0;
}

It creates the R interpreter and asks it to print the value of its built in variable π. To compile the code, you must have installed the R shared libraries and headers necessary for development. To locate the headers in your file system, use the following command:

R CMD config --cppflags

It should give you an output like the following:

-I/usr/share/R/include

On my system, I use the GNU C compiler to create the executable and link it against the R library:

gcc `R CMD config --cppflags` -o r_frontend r_frontend.c `R CMD config --ldflags`

The frontend program can then be executed like this:

R CMD ./r_frontend
[1] 3.141593

Note that the call of the executable must be wrapped inside a call of R in order to have R’s environment variables set correctly.

In lines 3 through 5 of the above code, the necessary headers are included that define the R symbols and functions used. Lines 9 and 10 override the normal program arguments to supply the –silent flag to the R interpreter. This suppresses verbose output like the version information that is printed every time the R interpreter is started in an interactive session. Line 12 defines the R command to be executed as a C string. Any object the R interpreter handles is represented internally as a SEXP (S expression, i’d figure), a pointer to a C data structure that holds the data of the corresponding R object. Line 14 defines two SEXPs, one to store the code to be executed as a plain R character object, and the other to receive an intermediate parsed version of the code that the R interpreter understands. The latter SEXP is initialized as R_NilValue, which is the internal representation of R’s NULL value. ParseStatus is an enumeration type defined in R_ext/Parse.h to recieve the interpreter’s state. The interpreter itself is initialized by the function Rf_initEmbeddedR. The function sets up a workspace and temporary directory and also processes all command line arguments known to R and can only be called once: there can only be one R interpreter in any program. To use it, we must express a command as an R character vector: In line 23 memory is allocated for an R vector of size one and type character (which, internally, is a SEXP whose SEXPTYPE is STRSXP). Any memory allocated inside the current R workspace (e.g. by a call to Rf_allocVector) must be protected from garbage collection using the PROTECT macro. The R interpreter has a garbage collector that keeps track of the number of references (i.e. symbols) an R object is bound to. If an object is no longer referenced, it is deleted from memory. If an object is created outside the R interpreter in a different part of a program, the interpreter has no way of knowing that there is an external reference to the object and schedules it for deletion if it is not protected on the outside. Any protections must be released using UNPROTECT when the external code returns or exists. Protections are organized in a stack-like manner: line 30 in the example above releases the top two protections on the stack. The number of PROTECT and UNPROTECT calls in a code block must be balanced. In line 24, SET_STRING_ELT sets the contents of a single element of the newly allocated character vector to the equivalent of the C string str. Next, the R character vector is parsed into an R vector of expressions (a SEXP of type EXPRSXP) in line 26. The function R_tryEval expects a SEXP that is an R expression and tries to evaluate it in the context of a given environment (most likely the workspace or global environment (R_GlobalEnv). In line 28 in the example, the function is passed the first element of the expression vector expr, which is VECTOR_ELT(expr, 1). The result is the output of the value of the built in variable pi. Note that, because this is not an interactive R session you have to generate the output explicitly using R’s print function. In an interactive R session, a callback is triggered for every command that is not an assignment, to the effect that the value of the expression that constitutes the command is printed to the terminal. Just before the end of the program, a call to Rf_endEmbeddedR shuts down the R interpreter, cleaning removing all temporary files that may have been created.

A more sophisticated example

Note that R_ParseVector in the above example produces a vector of R expressions, and contrary to the assumption made above, it may have more than just one element. This happens if the user supplies a character string with commands concatenated by semicolons, e.g.:

print(pi); v <- c(1,2,3); print(v); print(summary(v))

This one line actually contains four separate commands or expressions, and so the vector of R expressions passed to R_tryEval may have more than one element. Slightly more work is necessary to provide an interface that supports this behavior. Consider the function r_exec in the following code, which also provides essential error handling:

#include <stdlib.h>
#include <stdio.h>

#include <Rembedded.h>
#include <Rinternals.h>
#include <R_ext/Parse.h>


int exec_r(const char* str)
{
  int error;
  SEXP code, expr = R_NilValue;
  ParseStatus status;

  PROTECT(code = Rf_allocVector(STRSXP, 1));
  SET_STRING_ELT(code, 0, Rf_mkChar(str));

  expr = PROTECT(R_ParseVector(code, -1, &status, R_NilValue));

  switch (status) {
  case PARSE_OK:
    break;
  case PARSE_INCOMPLETE:
    fprintf(stderr, "Expression incomplete! Missing brackets?\n");
    UNPROTECT(2);
    return -1;
  case PARSE_ERROR:
    fprintf(stderr, "Syntax error!\n");
    UNPROTECT(2);
    return -1;
  default:
    UNPROTECT(2);
    return -1;
  }
  
  int i;
  for (i=0; i<LENGTH(expr); i++) {
    R_tryEval(VECTOR_ELT(expr, i), R_GlobalEnv, &error);
  }

  UNPROTECT(2);

  return error;
}



int main()
{
  char *argv[] = {"exec_r", "--quiet"};
  int argc = 2;
  
  Rf_initEmbeddedR(argc, argv);

  exec_r("print(pi); v <- c(1,2,3); print(v); print(summary(v))");
  
  Rf_endEmbeddedR(0);
  
  return 0;
}

Working with variables

In the example above, a variable v is defined in the workspace using interpreted R code. However, a much faster way to shove large amounts of data into a workspace is to use compiled code, i.e. C functions provided by the R library. Establishing a workspace variable through the larger back door requires two steps:

  1. Declare a symbol in the workspace to represent the variable.
  2. Define the variable by binding its data to the symbol.

The following procedure defines an R object in the current workspace, or global environment:

void r_object(SEXP data, const char* name)
{
  SEXP symb = Rf_install(name);
  Rf_defineVar(symb, data, R_GlobalEnv);
}

The function Rf_install installs a symbol with the given name in the symbol tree of the R interpreter and provides a pointer to it (more precisely, a SEXP to a SEXPREC with SEXPTYPE SYMSXP). In the next step, the variable is defined by binding some data, given as a pointer to an R object, to the symbol. The smallest possible program using this procedure would be:

int main(int argc, char *argv[])
{
  Rf_initEmbeddedR(argc, argv);

  r_object(R_NilValue, "my.var");
  
  exec_r("print(my.var)");
  
  Rf_endEmbeddedR(0);
  
  return 0;
}

with output:

NULL

Internally, NULL is represented by R_NilValue, a SEXP of type NILSXP, which is just the null pointer in C/C++. Of course it’s more interesting to handle actual common R objects, such as data frames. Suppose we have a map of strings and vectors in C++ that we want to turn into a data frame in R. The vectors form the data frame’s columns and the strings are the column names. Data frames in R are lists of column vectors. Thus, the function below returns a pointer to a SEXPREC of type VECSXP (meaning a list or generic vector):

SEXP data_frame(std::map<std::string, std::vector<double> > tab)
{
  // Allocate space for the columns (vector of vectors):
  SEXP obj;
  PROTECT( obj = Rf_allocVector(VECSXP, tab.size()) );

  // Character vector of column names:
  SEXP col_names;
  PROTECT( col_names = Rf_allocVector(STRSXP, tab.size()) );

  
  // Allocate and set columns and their names:
  
  int i, j = 0, n;
  
  std::map<std::string, std::vector<double> >::iterator it;
  for (it=tab.begin(); it!=tab.end(); ++it) {

    if (it==tab.begin()) {
      n = it->second.size();
    } else {
      ;
    }
    
    SEXP col;
    PROTECT( col = Rf_allocVector(REALSXP, n) );

    for (i=0; i<n; i++) REAL(col)[i] = it->second[i];

    SET_VECTOR_ELT(obj, j, col);

    SET_STRING_ELT(col_names, j, mkChar(it->first.c_str()));

    j += 1;
  }

  
  // Allocate and set row names:
  SEXP row_names;
  PROTECT( row_names = Rf_allocVector(INTSXP, n) );

  for (i=0; i<n; i++) INTEGER(row_names)[i] = i+1;

  // Set R class, names and row names attributes:
  SEXP rclass;
  PROTECT( rclass = allocVector(STRSXP, 1) );
  SET_STRING_ELT(rclass, 0, mkChar("data.frame"));
  classgets(obj, rclass);

  setAttrib(obj, R_NamesSymbol, col_names);
  setAttrib(obj, R_RowNamesSymbol, row_names);

  // Clean up protections stack:
  UNPROTECT(4 + tab.size());
  
  return obj;
}

The SEXP‘s internal structure together with the class, names and row names attributes make the returned object a valid data frame. In lines 4 to 9, memory is allocated for a generic vector of vectors (the data frame’s columns) and a vector of character strings (the column names). The for loop of lines 17 to 35 iterates through the map, allocates memory for the column vectors, copies the vector data and sets the column names. Note that when the size of the column vectors is determined in lines 19 to 32, the code should be improved to ensure columns of equal length. Although this requirement is checked when a data frame is created through interpreted R code, it is not enforced if you break in through the back door! If the column vectors in the map have different lengths, missing elements are silently filled with NA values. Lines 39 through 51 allocate and set the R class, names and row names attributes that must be present for a valid data frame object. We must also keep track of the number of protected memory allocations and balance the stack appropriately before returning the result. The example program below uses the functions defined above to create a data frame and print a summary of it:

int main(int argc, char *argv[])
{
  Rf_initEmbeddedR(argc, argv);

  
  // Create map of strings to vectors and set some (random) data:
  
  std::map<std::string, std::vector<double> > table;

  std::vector<double> u, v;

  std::srand(std::time(0));
  
  int i;
  for (i=1; i<=1000; i++) {
    u.push_back(i);
    v.push_back(std::rand()%1000 + 1);
  }

  table["variable.u"] = u;
  table["feature.v"]  = v;


  // Create data frame in the current workspace and print its summary
  
  SEXP data;
  PROTECT( data = data_frame(table) );

  r_object(data, "my.tab");
  
  exec_r("print(summary(my.tab))");

  
  // Clean up
  
  UNPROTECT(1);
  
  Rf_endEmbeddedR(0);
  
  return 0;
}

If you like, you can download the complete source file and try it yourself, e.g. using the GNU compiler collection on a UNIX-like system:

g++ `R CMD config --cppflags` -o r_object r_object.cc `R CMD config --ldflags`
R CMD ./r_object --quiet

   feature.v       variable.u    
 Min.   :  2.0   Min.   :   1.0  
 1st Qu.:263.0   1st Qu.: 250.8  
 Median :506.5   Median : 500.5  
 Mean   :510.8   Mean   : 500.5  
 3rd Qu.:760.5   3rd Qu.: 750.2  
 Max.   :999.0   Max.   :1000.0

Using multiple processors

Calculations in R can be distributed among multiple processors or cores using subprocesses.