Empty rows in sparse arrays2015-12-06
Here is an indexing trick I learned recently. It’s not clever, but I’m putting it here partly because I learned a little bit about sparse arrays, and partly so I don’t have to search for the solution again.
Sparse arrays in the Python machine learning ecosystem
By default, scikit-learn’s feature extraction functions return data in a sparse array. This is actually good. If what you plan to do with the array is sparse-friendly — which is true of scikit-learn’s own classifiers and lda — then you save a ton of memory by keeping things sparse, and many common machine learning operations are faster when data is represented this way.
But some libraries such as the neural network library keras
don’t accept sparse arrays. This introduces another processing step (usually
.toarray() method) to densify the array.
Perhaps more seriously, you lose some of the numpy feature set when working with sparse arrays. It was this problem that I ran into when I wanted to find the empty rows of a Compressed Sparse Row (CSR) array.
Finding the non-empty rows of a sparse array
Finding the non-empty rows of a regular dense array is a one-liner:
>>> import numpy as np >>> Y = np.array([[1, 2, 3], ... [0, 0, 0], # empty row ... [4, 0, 5], ... [0, 6, 7], ... [0, 0, 0]]) # empty row >>> Y.any(axis=1) array([ True, False, True, True, False], dtype=bool)
But a CSR array doesn’t have the
>>> from scipy.sparse import csr_matrix >>> X = csr_matrix(Y) >>> X <5x3 sparse matrix of type '<class 'numpy.int64'>' with 7 stored elements in Compressed Sparse Row format> >>> hasattr(X, 'any') False
np.any is apparently just the identity function when applied to a sparse
>>> np.any(X) <5x3 sparse matrix of type '<class 'numpy.int64'>' with 7 stored elements in Compressed Sparse Row format> >>> (np.any(X) != X).toarray() array([[False, False, False], [False, False, False], [False, False, False], [False, False, False], [False, False, False]], dtype=bool)
So the dense
any() approach doesn’t work.
Here’s the solution:
>>> np.diff(X.indptr) != 0 array([ True, False, True, True, False], dtype=bool)
You can stop reading here (or skip to the final section) if you don’t care why this works!
The reason this works
To understand why this works, you have to understand a little bit about the Compressed Sparse Row array representation used by scipy.
The goal of a sparse representation it to consume less memory than the
ncolumns * 4 bytes (for integer arrays) required by the dense representation.
To do this, the array stores the values of only its non-zero elements. If the
array is sparse, i.e. mostly zeros, then this is tremendous memory saving.
In scipy’s CSR approach, the non-zero values are kept in the array’s
>>> X.toarray() array([[1, 2, 3], [0, 0, 0], [4, 0, 5], [0, 6, 7], [0, 0, 0]]) >>> X.data array([1, 2, 3, 4, 5, 6, 7])
data, the array must hold attribute(s) that determine where in the
array the values are. Perhaps the most obvious way to do this would be to have
arrays that give the coordinates of each non-zero value. In fact that is the
approach taken by
But it turns out that, for normal machine learning problems, where rows
represent samples and columns represent features, and the likely operations on
the data are somewhat constrained, CSR is usually more efficient, if a little
harder to grok.
In CSR the array holds
indices attributes. The
attribute is an array of length
nrows + 1. Each successive pair in
tells scipy where to look in the
indices array to find the columns it should
populate in that row, and where to look in
data to find the values it should
put in those columns.
>>> X.toarray() array([[1, 2, 3], [0, 0, 0], [4, 0, 5], [0, 6, 7], [0, 0, 0]]) >>> X.data array([1, 2, 3, 4, 5, 6, 7]) >>> X.indptr array([0, 3, 3, 5, 7, 7], dtype=int32) >>> X.indices array([0, 1, 2, 0, 2, 1, 2], dtype=int32)
So, to populate the 0th row of
X, you look at
X.indptr. In the example above, these are
3. You then take the
X.indices to get the columns in the 0th row that should be
populated. In this case
X.indices[0:3] == [0, 1, 2], i.e. every element of
the 0th row has a value. Finally you put the contents of the slice
X.data[0:3] in those positions.
To populate the 1st row of
X, you look at
3. The slice
[3:3] is empty, so this row is empty too.
To populate the 2nd row of
X, you look at
indices[3:5] == [0, 2], so those are the two columns
in the 2nd row we populate with
More generally and precisely, quoting from the scipy documentation:
the column indices for row
iare stored in
indices[indptr[i]:indptr[i+1]]and their corresponding values are stored in
All this means that, if successive elements of
indptr are identical, then
corresponding row is empty.
The last thing piece of the puzzle is the utility function
takes an array of length
n and returns an array of length
n-1 whose values
are the differences between successive elements of the argument.
All of which is to say, if the ith element of this thing is
then the ith row of
X is all zeros.
Incidentally, this means that finding empty rows in a big CSR array is much
faster than in a dense array, since it’s
O(nrows) rather than
Indexing regular Python sequences with numpy boolean arrays
One last thing. The output of this (and many other numpy operations) is a boolean numpy array. This can be used to index a numpy array:
>>> X.toarray() array([[1, 2, 3], [0, 0, 0], [4, 0, 5], [0, 6, 7], [0, 0, 0]]) >>> X[np.diff(X.indptr) != 0].toarray() array([[1, 2, 3], [4, 0, 5], [0, 6, 7]], dtype=int64)
But it can’t be used to index an arbitrary Python sequence:
>>> lst = list('abcde') >>> lst[np.diff(X.indptr) != 0] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: only integer arrays with one element can be converted to an index
The solution is in the ever-useful
compress takes two
mask. It returns an iterator that yields the values of
seq when the corresponding value of
mask is truthy. Hence,
>>> list(compress(lst, np.diff(X.indptr) != 0)) ['a', 'c', 'd']