# Empty rows in sparse arrays

2015-12-06Here is an indexing trick I learned recently. It’s not clever, but I’m putting it here partly because I learned a little bit about sparse arrays, and partly so I don’t have to search for the solution again.

## Sparse arrays in the Python machine learning ecosystem

By default, scikit-learn’s feature extraction functions return data in a sparse array. This is actually good. If what you plan to do with the array is sparse-friendly — which is true of scikit-learn’s own classifiers and lda — then you save a ton of memory by keeping things sparse, and many common machine learning operations are faster when data is represented this way.

But some libraries such as the neural network library keras
don’t accept sparse arrays. This introduces another processing step (usually
the `.toarray()`

method) to densify the array.

Perhaps more seriously, you lose some of the numpy feature set when working with sparse arrays. It was this problem that I ran into when I wanted to find the empty rows of a Compressed Sparse Row (CSR) array.

## Finding the non-empty rows of a sparse array

Finding the non-empty rows of a regular dense array is a one-liner:

```
>>> import numpy as np
>>> Y = np.array([[1, 2, 3],
... [0, 0, 0], # empty row
... [4, 0, 5],
... [0, 6, 7],
... [0, 0, 0]]) # empty row
>>> Y.any(axis=1)
array([ True, False, True, True, False], dtype=bool)
```

But a CSR array doesn’t have the `any`

method:

```
>>> from scipy.sparse import csr_matrix
>>> X = csr_matrix(Y)
>>> X
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
>>> hasattr(X, 'any')
False
```

and `np.any`

is apparently just the identity function when applied to a sparse
array:

```
>>> np.any(X)
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
>>> (np.any(X) != X).toarray()
array([[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
```

So the dense `any()`

approach doesn’t work.

## The solution

Here’s the solution:

```
>>> np.diff(X.indptr) != 0
array([ True, False, True, True, False], dtype=bool)
```

You can stop reading here (or skip to the final section) if you don’t care why this works!

## The reason this works

To understand why this works, you have to understand a little bit about the Compressed Sparse Row array representation used by scipy.

The goal of a sparse representation it to consume less memory than the ```
nrows *
ncolumns * 4
```

bytes (for integer arrays) required by the dense representation.
To do this, the array stores the values of only its non-zero elements. If the
array is sparse, i.e. mostly zeros, then this is tremendous memory saving.

In scipy’s CSR approach, the non-zero values are kept in the array’s `data`

attribute:

```
>>> X.toarray()
array([[1, 2, 3],
[0, 0, 0],
[4, 0, 5],
[0, 6, 7],
[0, 0, 0]])
>>> X.data
array([1, 2, 3, 4, 5, 6, 7])
```

Alongside `data`

, the array must hold attribute(s) that determine where in the
array the values are. Perhaps the most obvious way to do this would be to have
arrays that give the coordinates of each non-zero value. In fact that is the
approach taken by
COO.
But it turns out that, for normal machine learning problems, where rows
represent samples and columns represent features, and the likely operations on
the data are somewhat constrained, CSR is usually more efficient, if a little
harder to grok.

In CSR the array holds `indptr`

and `indices`

attributes. The `indptr`

attribute is an array of length `nrows + 1`

. Each successive pair in `indptr`

tells scipy where to look in the `indices`

array to find the columns it should
populate in that row, and where to look in `data`

to find the values it should
put in those columns.

```
>>> X.toarray()
array([[1, 2, 3],
[0, 0, 0],
[4, 0, 5],
[0, 6, 7],
[0, 0, 0]])
>>> X.data
array([1, 2, 3, 4, 5, 6, 7])
>>> X.indptr
array([0, 3, 3, 5, 7, 7], dtype=int32)
>>> X.indices
array([0, 1, 2, 0, 2, 1, 2], dtype=int32)
```

So, to populate the 0th row of `X`

, you look at `X.indptr[0]`

and
`X.indptr[1]`

. In the example above, these are `0`

and `3`

. You then take the
slice `0:3`

from `X.indices`

to get the columns in the 0th row that should be
populated. In this case `X.indices[0:3] == [0, 1, 2]`

, i.e. every element of
the 0th row has a value. Finally you put the contents of the slice
`X.data[0:3]`

in those positions.

To populate the 1st row of `X`

, you look at `X.indptr[1]`

and `X.indptr[2]`

,
which are `3`

and `3`

. The slice `[3:3]`

is empty, so this row is empty too.

To populate the 2nd row of `X`

, you look at `X.indptr[2]`

and `X.indptr[3]`

,
which are `3`

and `5`

. `indices[3:5] == [0, 2]`

, so those are the two columns
in the 2nd row we populate with `data[3:5]`

More generally and precisely, quoting from the scipy documentation:

the column indices for row

`i`

are stored in`indices[indptr[i]:indptr[i+1]]`

and their corresponding values are stored in`data[indptr[i]:indptr[i+1]]`

All this means that, if successive elements of `indptr`

are identical, then
corresponding row is empty.

The last thing piece of the puzzle is the utility function `np.diff()`

which
takes an array of length `n`

and returns an array of length `n-1`

whose values
are the differences between successive elements of the argument.

All of which is to say, if the ith element of this thing is `0`

```
np.diff(X.indptr)
```

then the ith row of `X`

is all zeros.

Incidentally, this means that finding empty rows in a big CSR array is much
faster than in a dense array, since it’s `O(nrows)`

rather than ```
O(nrows *
ncolumns)
```

.

## Indexing regular Python sequences with numpy boolean arrays

One last thing. The output of this (and many other numpy operations) is a boolean numpy array. This can be used to index a numpy array:

```
>>> X.toarray()
array([[1, 2, 3],
[0, 0, 0],
[4, 0, 5],
[0, 6, 7],
[0, 0, 0]])
>>> X[np.diff(X.indptr) != 0].toarray()
array([[1, 2, 3],
[4, 0, 5],
[0, 6, 7]], dtype=int64)
```

But it can’t be used to index an arbitrary Python sequence:

```
>>> lst = list('abcde')
>>> lst[np.diff(X.indptr) != 0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer arrays with one element can be converted to an index
```

The solution is in the ever-useful `itertools`

module. `compress`

takes two
sequences, `seq`

and `mask`

. It returns an iterator that yields the values of
`seq`

when the corresponding value of `mask`

is truthy. Hence,

```
>>> list(compress(lst, np.diff(X.indptr) != 0))
['a', 'c', 'd']
```