After all my posts on efficiently computing the Hamming distance in R, two questions remain for me. First, can we compute the Hamming distance even faster when implementing it in a low-level programming language such as C/C++, and then use the great Rcpp package to directly call this low-level C/C++ function from R? Second, how much would parallelization help? I will answer the second question in a separate post, and address the first question here.

I implemented two C/C++ functions for computing the Hamming distance between all pairs of columns of a binary matrix X:

- A straightforward approach that I called “hamming__C_simple”: this function simply iterates through all combinations of columns of a matrix X to compare all elements between each pair of columns.
- A more involved approach that I called “hamming__C_bitwise”: This function exploits the ability one has in C to directly manipulate individual bits in memory, which should typically be very fast. The function first interprets the R binary vectors as sequences of unsigned integers, to then perform bitwise operations on these integers.

Note that focusing on binary matrices does not limit the applicability of the C/C++ functions, because multi-value Hamming distances can be easily computed by adding multiple binary Hamming distances.

The code for both implementations is shown below.

#include "Rcpp.h" #include "math.h" #include "stdlib.h" #include "stdio.h" #include "math.h" #include "algorithm" #include "climits" using namespace Rcpp; unsigned long long binvec2int(IntegerMatrix x) { unsigned long long u = 0; int n = x.nrow(); for (int i = 0; i < n; i++) { u += x(i, 0) * (unsigned long long) pow(2, n - 1 - i); } return u; } // [[Rcpp::export]] NumericMatrix hamming__C_bitwise(IntegerMatrix X) { int i, j, k, from, to, h; int stepsize = sizeof(unsigned long long) * CHAR_BIT; int n = ceil(((double) X.nrow()) / stepsize); IntegerMatrix x; NumericMatrix H(X.ncol(), X.ncol()); // convert X to unsigned long long unsigned long long *xx = (unsigned long long *) malloc(n * X.ncol() * sizeof(unsigned long long)); i = 0; while ((from = i * stepsize) < X.nrow()) { for (j = 0; j < X.ncol(); j++) { to = std::min(X.nrow(), (i + 1) * stepsize); x = X(Range(from, to - 1), Range(j, j)); xx[i + j * n] = binvec2int(x); } i++; } for (i = 0; i < X.ncol(); i++) { for (j = i + 1; j < X.ncol(); j++) { for (k = 0; k < n; k++) { h = __builtin_popcountll(xx[k + i * n] ^ xx[k + j * n]); H(i, j) += h; H(j, i) += h; } } } free(xx); return H; } // [[Rcpp::export]] NumericMatrix hamming__C_simple(IntegerMatrix X) { int i, j, k, h; NumericMatrix H(X.ncol(), X.ncol()); for (i = 0; i < X.ncol(); i++) { for (j = i + 1; j < X.ncol(); j++) { for (k = 0; k < n; k++) { h = (X(k, i) != X(k, j)); H(i, j) += h; H(j, i) += h; } } } return H; }

I compared the performance of the two C/C++ functions to a very fast pure R function based on matrix multiplication that I wrote about before:

Interestingly, we can see that the straightforward C++ implementation (C_simple) is *not faster* than the pure R function (pure_R). In other words, efficient use of pure R can easily be faster than fairly regular use of C++ within R. We can see that if we actually want to be faster than the pure R function, one way of doing that is exploiting C’s bitwise operations (C_bitwise).