Scrypt package back on CRAN

The scrypt package is back on CRAN and I have become the maintainer. The package allow password hashing and verification using Colin Percival’s scrypt scheme. The advantage of the scrypt hashing scheme over other cryptographic hash functions such as SHA is that calculation of the hash takes much more time and memory and a random seed is always used. This makes it much more expensive and time-consuming for attackers to retrieve passwords from hashes obtained through database hacks.

Thanks to RStudio and Andy Kipp in particular for doing all the heavy lifting of creating and writing the package and allowing me to take over maintainership of the CRAN package and the GitHub-repo. Issues and patches welcome!

CSV benchmarking 1/n

In this series of posts we will be looking at a number of ways to store data using R in as little space as possible and also consider the portability of the different solutions. As an example, the New York City Flights data set for 2013 is used (available through CRAN). The first rows and columns are shown below:

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

For portability across programming languages, CSV’s, optionally compressed, area great option as any platform for doing anything with data should be able to read CSV. Alternative formats such as fst, parquet and ORC have a number of advantages such as smaller sizes, better fidelity and built in integrity checking. These will be examined in a later post. For now, gzipped CSV is used as reference. If created from the R data.table package, it will take 2 cores a bit less than one second to write the file containing the 2013 flights. The file size is 6.6mb.

library(dplyr)
library(data.table)
# We drop the column with timestamps for reasons explained below.
flights <- nycflights13::flights %>% select(-time_hour)
flightsDt <- as.data.table(flights)
system.time(data.table::fwrite(flightsDt, 'flights.csv.gz'))
    user  system elapsed 
   1.886   0.027   0.984 

If you want to read or write a large amount of data to CSV using R and you want to do it rather quickly, there are two good options at the moment: data.table and vroom.

data.table and vroom in the current development version support writing gzip-compressed csv’s as well. Roughly, zip-like compression algorithms works by creating a mapping of shorter sequences to longer sequences of bits in such a way that the mapping + the input mapped from the set of longer sequences to the shorter sequences takes up less space than the original. Compression algorithms use clever techniques to create and maintain this mapping but for reasons of speed and memory use this mapping can’t grow without bounds. We can help the algorithm a bit by first transposing the data.

Transpose for a free lunch

There is a lot of repetition in the first three columns year, month and day.

> flightsDt[,    
  lapply(.SD, function(.) length(unique(.))),    
  .SDcols = c('year', 'month', 'day') 
]
    year month day
 1:    1    12  31

If all these are put close together, it helps the compression algorithm a bit:

TimeAndSize <- function(FUN, fileName) {
  filePath <- file.path(tempdir(), fileName)
  on.exit(unlink(filePath))
  timing <- unclass(system.time(FUN(filePath)))
  fileSize <- file.info(filePath)$size
  c(timing, FileSizeInMB = round(fileSize / 1024L / 1024L, 1))
}
dt_csv_fun <- function(fileName) fwrite(flightsDt, fileName) 
dt_gz_fun <- dt_fun vroom_csv_fun <- function(.) 
vroom_write(flights, .) 
vroom_gz_fun <- function(.) vroom_write(flights, pipe(paste('gzip >', .)))
vroom_mgz_fun <- function(.) vroom_write(flights, pipe(paste('pigz >', .)))
vroom_zstd_fun <- function(.) vroom_write(flights, pipe(paste('zstd >', .)))
dt_csv_fun_transposed <- function(.) fwrite(transpose(flightsDt), .)
dt_gz_fun_transposed <- function(.) fwrite(transpose(flightsDt), .)
experiments <- list(
  dt_csv_fun = 'flights.csv', dt_gz_fun = 'flights.csv.gz',
  vroom_csv_fun = 'flights.csv', vroom_gz_fun = 'flights.csv.gz',  
  vroom_mgz_fun = 'flights.csv.gz', vroom_zstd_fun = 'flights.csv.zstd',
  dt_csv_fun_transposed = 'flights.tcsv', 
  dt_gz_fun_transposed = 'flights.tcsv.gz'
 )
 PerformExperiments <- function(experiments) {
   t(sapply(names(experiments), function(.) {
     TimeAndSize(match.fun(.), experiments[[.]])
   }))
 }
 PerformExperiments(experiments)
                  user.self sys.self elapsed user.child sys.child FileSizeInMB
dt_csv_fun            0.403    0.012   0.215      0.000     0.000         22.8
dt_gz_fun             1.891    0.000   0.961      0.000     0.000          7.5
vroom_csv_fun         1.327    0.019   0.360      0.000     0.000         22.9
vroom_gz_fun          1.359    0.030   1.517      1.474     0.024          7.5
vroom_mgz_fun         1.341    0.046   0.943      1.560     0.039          7.5
vroom_zstd_fun        1.364    0.037   0.600      0.181     0.030          6.6
dt_csv_fun_transposed 2.318    0.047   1.964      0.000     0.000         22.8
dt_gz_fun_transposed  3.004    0.003   1.893      0.000     0.000          5.5

Comparing the results of data.table and vroom we see that timing and file sizes are almost equal. Both are equally good at writing CSV files. However, if the table is first transposed and then written, the total process takes a bit longer but file size is reduce by around 25%!

Warts

The column time_hour needs to be dropped as data.table‘s transpose() unpacks this column which is saved as more data and transposing these columns back doesn’t result in the same values in any case. In any case, transposing is slow even with `data.table` and handling very wide tables can be problematic for some software packages.

Conclusion

Compressed csv’s are a portable data format that require significantly less space than their uncompressed counterparts. The size of compressed csv’s can be further reduced by transposing the csv’s first and then compressing. Whether this extra compression is worth it depends on use case.

FastR sudoku’s

A few days ago @JozefHajnala pointed me to Oracle’s FastR which is an implementation of R on the Java Virtual Machine (GraalVM). This implementation aims to be completely compatible with the GNU R implementation we all know and love.

In September 2018 I gave a talk (GitHub, presentation) on using Rcpp for solving sudoku’s at Amsterdam SatRday. Point of the talk was to show the performance benefits that can be achieved using the Rcpp-package. This post describes my adventure of solving sudoku’s using FastR and compares performance between the Rcpp solver and the plain R sudoku solver ran using FastR.

Installation of FastR

The easiest way to get started is to obtain the GraalVM community edition from the GitHub releases page. Download the graalvm-ce tar.gz relevant for your platform and untar it and use the gu-tool to install and start R:

tar xzvf graal-ce-*-19.0.0.tar.gz
cd graal-ce-19.0.0/bin
gu install R
./R

If all went well, you will now see a somewhat familiar sight:

R version 3.5.1 (FastR)
Copyright (c) 2013-19, Oracle and/or its affiliates
Copyright (c) 1995-2018, The R Core Team
Copyright (c) 2018 The R Foundation for Statistical Computing
Copyright (c) 2012-4 Purdue University
Copyright (c) 1997-2002, Makoto Matsumoto and Takuji Nishimura
All rights reserved.

FastR is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information.

Type 'q()' to quit R.

A quick test drive

Installing packages is as easy as with GNU R so doing the first test doesn’t take long:

install.packages('microbenchmark')
sqr <- function(x) x * x microbenchmark::microbenchmark(sqr(runif(1000))) 
Unit: microseconds
            expr min  lq mean median  uq max neval
sqr(runif(1000)) 145 154  179    165 178 653   100 

To be fair, this is not very promising. Running the same snippet in GNU 3.4.4 gives much better performance:

sqr <- function(x) x * x
microbenchmark::microbenchmark(sqr(runif(1000)))
Unit: microseconds
expr min lq mean median uq max neval
sqr(runif(1000)) 24 25 152 25 25 11919 100

Hmm, appears to me they should have called it SlowR instead. But before we dish out our final judgement let’s tweak the benchmark a bit (sqr.R in this gist):

library(microbenchmark)                                                                                                                                                                      
sqr <- function(x) x * x
f1 <- function() for (n in 1:1000) sqr(runif(1:n))
print(R.version)
print(microbenchmark(f1(), control = list(warmup = 100L)))

Running this snippet in both R’s motivated me to continue my research and actually made me excited to see what FastR has to offer:

            
> source('~/code/FastR/sqr.R')
... language R
version.string R version 3.4.4 (2018-03-15) Unit: milliseconds expr min lq mean median uq max neval f1() 16.3209 17.15926 18.16575 17.4746 17.73782 51.27648 100
> source('~/code/FastR/sqr.R')
... language R engine FastR version.string FastR version 3.5.1 (2018-07-02) Unit: milliseconds expr min lq mean median uq max neval f1() 6.053445 6.374638 8.226994 6.765636 6.99983 30.11883 100

Solving sudoku’s

The sudoku below has been claimed to be the world’s hardest sudoku:

sudokuTxt <- "
8 0 0 0 0 0 0 0 0
0 0 3 6 0 0 0 0 0
0 7 0 0 9 0 2 0 0
0 5 0 0 0 7 0 0 0
0 0 0 0 4 5 7 0 0
0 0 0 1 0 0 0 3 0
0 0 1 0 0 0 0 6 8
0 0 8 5 0 0 0 1 0
0 9 0 0 0 0 4 0 0"

On my machine, the R/Rcpp solver finds a solution within a second. The source can be found in the gist as sudoku.cpp and sudoku.R. Microbenchmarking this solution in GNU R gives

R> source('sudoku.R')
R> microbenchmark(cpp = solve2(sudoku, findChoicesCpp))
Unit: milliseconds
expr min lq mean median uq max neval
cpp 721.5412 735.6565 756.2961 748.2816 773.9773 835.9321 100

while plain R run using the FastR-engine is about two and a half times as fast:

$> ./R --vm.Xss5000k
R> source('plain_R.R')
R> microbenchmark::microbenchmark(
solve2(sudoku, findChoices2), control = list(warmup = 20L)
)
Unit: milliseconds
expr min lq mean median uq max neval
solve2(sudoku, findChoices2) 257 260 293 266 271 1217 100

Caveats

Note in the example above the command line parameter to increase the stack size. If the stack size is not increased like that the function will crash.

Another important thing to note is that FastR will use more cores and will need more memory to run than GNU R.

Conclusion

Oracle FastR is an interesting project that can improve the runtime of R calculations by quite a bit and I will be watching its development closely. Resources use can be a problem, after running the benchmark, almost 4GB of RAM is used against less than 500mb for the process running the Rcpp code.

Longstaff Schwarz Method: First table

The Longstaff-Schwarz method described in “Valuing American Options by Simulation: A Simple Least Squares Approach” by Francis A. Longstaff and Eduardo S. Schwartz allows valuation of non-European options to a reasonable degree of accuracy using simulations and a clever way to choose between early exercise of the option or continuation. I’ve implemented in Python and put the code on GitHub. After installing the requirements from requirements.txt it should be easy to run the below code in a Jupyter notebook and get results quite close to those presented in the original paper.