Empty rows in sparse arrays
2015-12-06Here is an indexing trick I learned recently. It’s not clever, but I’m putting it here partly because I learned a little bit about sparse arrays, and partly so I don’t have to search for the solution again.
Sparse arrays in the Python machine learning ecosystem
By default, scikit-learn’s feature extraction functions return data in a sparse array. This is actually good. If what you plan to do with the array is sparse-friendly — which is true of scikit-learn’s own classifiers and lda — then you save a ton of memory by keeping things sparse, and many common machine learning operations are faster when data is represented this way.
But some libraries such as the neural network library keras
don’t accept sparse arrays. This introduces another processing step (usually
the .toarray()
method) to densify the array.
Perhaps more seriously, you lose some of the numpy feature set when working with sparse arrays. It was this problem that I ran into when I wanted to find the empty rows of a Compressed Sparse Row (CSR) array.
Finding the non-empty rows of a sparse array
Finding the non-empty rows of a regular dense array is a one-liner:
>>> import numpy as np
>>> Y = np.array([[1, 2, 3],
... [0, 0, 0], # empty row
... [4, 0, 5],
... [0, 6, 7],
... [0, 0, 0]]) # empty row
>>> Y.any(axis=1)
array([ True, False, True, True, False], dtype=bool)
But a CSR array doesn’t have the any
method:
>>> from scipy.sparse import csr_matrix
>>> X = csr_matrix(Y)
>>> X
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
>>> hasattr(X, 'any')
False
and np.any
is apparently just the identity function when applied to a sparse
array:
>>> np.any(X)
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
>>> (np.any(X) != X).toarray()
array([[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
So the dense any()
approach doesn’t work.
The solution
Here’s the solution:
>>> np.diff(X.indptr) != 0
array([ True, False, True, True, False], dtype=bool)
You can stop reading here (or skip to the final section) if you don’t care why this works!
The reason this works
To understand why this works, you have to understand a little bit about the Compressed Sparse Row array representation used by scipy.
The goal of a sparse representation it to consume less memory than the nrows * ncolumns * 4
bytes (for integer arrays) required by the dense representation.
To do this, the array stores the values of only its non-zero elements. If the
array is sparse, i.e. mostly zeros, then this is tremendous memory saving.
In scipy’s CSR approach, the non-zero values are kept in the array’s data
attribute:
>>> X.toarray()
array([[1, 2, 3],
[0, 0, 0],
[4, 0, 5],
[0, 6, 7],
[0, 0, 0]])
>>> X.data
array([1, 2, 3, 4, 5, 6, 7])
Alongside data
, the array must hold attribute(s) that determine where in the
array the values are. Perhaps the most obvious way to do this would be to have
arrays that give the coordinates of each non-zero value. In fact that is the
approach taken by
COO.
But it turns out that, for normal machine learning problems, where rows
represent samples and columns represent features, and the likely operations on
the data are somewhat constrained, CSR is usually more efficient, if a little
harder to grok.
In CSR the array holds indptr
and indices
attributes. The indptr
attribute is an array of length nrows + 1
. Each successive pair in indptr
tells scipy where to look in the indices
array to find the columns it should
populate in that row, and where to look in data
to find the values it should
put in those columns.
>>> X.toarray()
array([[1, 2, 3],
[0, 0, 0],
[4, 0, 5],
[0, 6, 7],
[0, 0, 0]])
>>> X.data
array([1, 2, 3, 4, 5, 6, 7])
>>> X.indptr
array([0, 3, 3, 5, 7, 7], dtype=int32)
>>> X.indices
array([0, 1, 2, 0, 2, 1, 2], dtype=int32)
So, to populate the 0th row of X
, you look at X.indptr[0]
and
X.indptr[1]
. In the example above, these are 0
and 3
. You then take the
slice 0:3
from X.indices
to get the columns in the 0th row that should be
populated. In this case X.indices[0:3] == [0, 1, 2]
, i.e. every element of
the 0th row has a value. Finally you put the contents of the slice
X.data[0:3]
in those positions.
To populate the 1st row of X
, you look at X.indptr[1]
and X.indptr[2]
,
which are 3
and 3
. The slice [3:3]
is empty, so this row is empty too.
To populate the 2nd row of X
, you look at X.indptr[2]
and X.indptr[3]
,
which are 3
and 5
. indices[3:5] == [0, 2]
, so those are the two columns
in the 2nd row we populate with data[3:5]
More generally and precisely, quoting from the scipy documentation:
the column indices for row
i
are stored inindices[indptr[i]:indptr[i+1]]
and their corresponding values are stored indata[indptr[i]:indptr[i+1]]
All this means that, if successive elements of indptr
are identical, then
corresponding row is empty.
The last thing piece of the puzzle is the utility function np.diff()
which
takes an array of length n
and returns an array of length n-1
whose values
are the differences between successive elements of the argument.
All of which is to say, if the ith element of this thing is 0
np.diff(X.indptr)
then the ith row of X
is all zeros.
Incidentally, this means that finding empty rows in a big CSR array is much
faster than in a dense array, since it’s O(nrows)
rather than O(nrows * ncolumns)
.
Indexing regular Python sequences with numpy boolean arrays
One last thing. The output of this (and many other numpy operations) is a boolean numpy array. This can be used to index a numpy array:
>>> X.toarray()
array([[1, 2, 3],
[0, 0, 0],
[4, 0, 5],
[0, 6, 7],
[0, 0, 0]])
>>> X[np.diff(X.indptr) != 0].toarray()
array([[1, 2, 3],
[4, 0, 5],
[0, 6, 7]], dtype=int64)
But it can’t be used to index an arbitrary Python sequence:
>>> lst = list('abcde')
>>> lst[np.diff(X.indptr) != 0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer arrays with one element can be converted to an index
The solution is in the ever-useful itertools
module. compress
takes two
sequences, seq
and mask
. It returns an iterator that yields the values of
seq
when the corresponding value of mask
is truthy. Hence,
>>> list(compress(lst, np.diff(X.indptr) != 0))
['a', 'c', 'd']