Probabilistic programming products

## Probabilistic programming products

#### Mike Lee Williams • [@mikepqr](http://twitter.com/mikepqr) • <mike@fastforwardlabs.com> • [mike.place/talks/ppp](http://mike.place/talks/ppp/)

#### Fast Forward Labs • [@fastforwardlabs](https://twitter.com/fastforwardlabs) • [fastforwardlabs.com](http://www.fastforwardlabs.com)

???

---

![](img/ff05.gif)

---

### Bayesian inference is great in theory...

- Quantify risk
 - Insert institutional knowledge
 - Interpretability
 - Online learning

### but it's slow if you're not careful.

### Being careful requires cleverness...
 
 - Metropolis Hastings
 - Hamiltonian Monte Carlo with automatic differentiation and NUTS
 - ADVI

### the cleverness is now ready to abstracted away 😎

---

![](img/ab.png)

???

Let's start with a story. I conduct this A/B test and get these results.

Clearly layout B is better.

If I had a million visitors to both layouts then yes, very probably.

But what if I had 25 visitors to layout A, and 20 visitors to layout B. Are you
confident that these results would generalize to the real world? And what if it
costs you $100k to deploy the change? Do you want to risk it?

---

![](img/baddata.png)

???

This is the central problem of data analysis.

I don't care how big your data is. It's still bad.

Measurement error, incomplete.

Given that's a fact of live. We need a way to quantify uncertainty.

And what if layout A has been running for weeks already before the test.
We need a way to incorporate that knowledge into our analysis.

---

![](img/bayesrule.png)

???

This is Bayes rule. The only equation I'm going to show today.

It's the central idea of data analysis. It's extremely simple.

The thing on the left is the probability a hypothesis is true, given our data.
That hypothesis might be "the conversion fraction for layout A is 3.57%". This
is the posterior distribution, and it's almost always the thing we want to
know. We can use it to answer questions like "is layout A better than layout
B".

How do we get it? From the things on the right.

Far right is the likelihood. That is a thing you can calculate. It says: what
if we _assumed_ the hypothesis were true. How likely is it we would have got
the observations we in fact got.

And the other thing on the right is the _prior_. That's how likely we though
the hypothesis was prior to our collection of data.

---

![](img/layouta.png)

???

How do we use this rule? Let's start with layout A.

Let's say layout A had been running for a few days, and we suspected its
conversion fraction was around 3.5%. That's the prior on the left. We're not
ruling out 1% or 5%, but 3.5% is our best guess.

Note here: we're incorporating institutional knowledge.

Now we use Bayes rule to update the prior in the light of our data, to get the
posterior.

The right hand plot is the probability of each possible conversion fraction for
layout A given the data, which was 4 conversions out of 100 visitors.

---

![](img/layoutb.png)

???

Maybe layout B is a new layout. We had no idea what it's conversion fraction
would be, so we use what's called a uninformative prior. We assume every
possible value between 0 and 1 is equally likely.

You can still use this approach when you have no institutional knowledge!

Now we use Bayes rule to update the prior in the light of our data, to get the
posterior.

The right hand plot is the probability of each possible conversion fraction for
layout B given the data, which was 2 out of 50.

---

![](img/compareposteriors.png)

???

Finally to answer the question is layout A or layout B better, we compare the
two posteriors.

What's the answer? Layout A or layout B? We can't say!

What we can say is precisely how likely it is that layout B is better than
layout A. I think it's about 60% in this case.

So if it costs you hundreds of thousands of dollars to change your website,
you're probably not confident enough.

You don't only have the ability to say how likely possibility 1 is. You can say
how likely all the other possibilities are! In that way you can quantify risk.

This is the first huge advantage of this approach. We can assign probabilities
to all possible outcomes. If we don't like what we can see, we can run the
experiment for longer, until we're more confident. And if we can't run the
experiment for longer, we can decide how much to bet, now we know the odds.

That's the power of having the full probabilistic picture. The ability to
quantify exactly how likely every possibility is, is not just a slight
improvement. It opens up qualitatively new kinds of products.

---

![](img/online.png)

???

I'm now going to enumerate a couple of other advantages of this approach.

The first is that it naturally incorporates the idea that you become more
certain as you get more data.

So if we run the A/B test for longer our posterior becomes _narrower_. We
assign more and more probabiltiy to a narrower and narrower region. We become
more confident.

In fact, you can take this idea to the limit and feed in data one example at a
time. This is _online learning_. At any point in time, we have a full picture
of what we know, and that picture is getting more and more complete.

This is great for bootstrapping a startup that wants to use machine learning,
but doesn't have any customers. In the early days, when you don't have much
data, your predictions will look like your prior. They may not be perfect, but
it's a start. As you gather data, your model will evolve to better reflect your
customers' behaviour and become more confident.

---

![](img/hierarch.png)

???

The next advantage is that it allows hierarchical modeling.

This is particularly useful when you have segmented data, and some of the
segments are empty.

So if you have users in different towns and countries, and one week you happen
to have no visitors from Philadelphia. But you do have visitors from New York.

You can set up a hierarchical model that allows you to fill in the blanks, and
take advantage of the fact that people from Philadelphia are demographically
more similar to people from New York than to people from Kuala Lumpar.

In the absence of direct measurements, you've got the next best thing.

I'll show a prototype we built in a second that depends critically on this
idea.

---

![](img/multiple.png)

???

Another advantage is you can use multiple datasets.

Let's say we're an insurance company, and we want to know the life expectancy
of all Americans.

We have data on our own customers, and clearly their health outcomes depend on
this average American life expectancy.

But we have access to another dataset: Medicare, which makes public statistics
that also depend on the thing we want to measure.

We can _combine_ these two datasets to learn as much as possible using the
Bayesian approach.

---

![](img/inference.png)

???

Finally there's interpretability.

Bayesian inference requires you to write down a model that specifies how your
data was made. You don't know the values of the parameters until you do the
inference, but you the structure of their relation.

Given you created the model it is _interpretable by construction_.

You will always have an answer to the question: why was a particular prediction
made.

This can be used to create actionable insights. The model says this customer is
going to churn for reason X, so let's change X about that customer.

And interpretability is in some cases not just nice to have, it's a legal
requirement!

---

```python
def abayes(data, prior_sampler, simulate, compare):
    '''Yield samples from the posterior by Approximate Bayesian Computation.'''
    for p in prior_sampler:
        if compare(simulate(p), data):
            yield p
```

???

That all sounds great, right?

Unfortunately, generally speaking, it's not possible to use the equation
directly. For interesting problems modeled in realistic ways, the calculations
are expensive or impossible. So we solve problems by simulation.

So here's the brute force way of doing it, which is called Approximte Bayesian
Computation.

This is the entire algorithm. There was a week long conference about it in the
United States last month, but the basic idea is one line of code.

How do we actually use it to solve a concrete problem?

---

```python
import random

n_b, obs_b =  400, 20  # 400 visitors, 20 conversions

def uniform_prior_sampler():
    '''Yield stream of random numbers in interval (0, 1).'''
    while True:
        yield random.random()

def simulate_conversion(p, n_visitors=n_b):
    '''Returns number of vistors who convert given conversion fraction p.'''
    return sum(random.random() < p for _ in range(n_visitors))

def compare_conversion(obs1, obs2):
    '''Return true if two observations are the same.'''
    return obs1 == obs2

posterior_sampler_b = abayes(obs_b,
                             uniform_prior_sampler(),
                             simulate_conversion,
                             compare_conversion)

```

???

Our abc function requires:
 - data
 - a generator that yields a stream of samples from the prior
 - a function to run a simulation, given a sample from the prior
 - a function to compare the simulation to the observation

Here are those functions for the B of our A/B test (a uniform prior, 400
visitors, 20 conversions).

Now we have a generator `posterior_sampler_b` that is pregnant with
possibility.

---

```python
>>> from itertools import islice
>>> next(posterior_sampler_b)
0.04223951410146609

>>> from tqdm import tqdm
>>> samples = []
>>> for i in tqdm(range(500)):
...     samples.append(next(posterior_sampler_b))
 47%|██████████████████▎                    | 234/500 [00:07<00:10, 25.84it/s]

>>> plt.hist(samples, normed=True)  # plot pdf
```
![](img/abayes_posterior.png)

???

Here's how you get 3 samples from the generator.

Obviously to build up a complete picture you need more, so here's 500. I'm
using `tqdm` to visualize the progress of the sampler.

Note that we're getting 20-30 samples per second.

This, by the way, is the Beta distribution. This is where it comes from!

---

```python
>>> n_b *= 2
>>> obs_b *= 2
>>> for i in tqdm(range(500)):
...     samples.append(next(posterior_sampler_b))
 47%|██████████████████▎                    | 234/500 [00:05<01:09,  6.62it/s]
```

???

That all sounds great, right?

It's how I made the plots I showed you earlier.

The problem is that compare function. To accept a sample we require exact
agreement between the simulation and the observation.

Even if our trial value of the conversion fraction is "correct", the
probability of the simulation yielding exactly 20 conversions out of 400 is
tiny.

What if we make the experiment twice as big. With 800 visitors and 40
conversions we get only 5 to 10 samples per second!

For A/B tests, our algorithm is Θ(n^2)! This is disastrous!

One fix is to make the comparison less exact, i.e. require only approximate
agreement. I'll leave that to you to try, and think through the consequences
for your posterior. This is basically the academic field of ABC in a nutshell.

But for more complicated problems, with more than one parameter it gets even
worse. Exploring a high-dimensional space at random you are spectacularly
unlikely to find the high probability regions, and the high probability regions
aren't very high probability!

---

```python
import numpy as np
import itertools as it

def sampler_mh(dist, x0=0, burnin=1000, alpha=0.5, verbose=False):
    x = x0
    samples_accept = 0
    for i in it.count(1):
        candidate = np.random.normal(loc=x, scale=alpha)
        candidate_prob = min([1.0, dist(candidate) / dist(x)])
        accept = np.random.rand()
        if accept < candidate_prob:
            samples_accept += 1
            x = candidate
        if i > burnin:
            yield x, i, samples_accept
```

???

If rejection sampling is too slow, the next thing we can try is MH. It's only a
few lines of code.

Conceptually: the sampler chooses a value for the parameter and returns that.
To get the next sample, it evaluates the probability for that choice, and also
in some other location.

If the other location is higher probability, the sampler tosses a coin to
decide whether to move over to that new location for the next round.

For this talk, the details aren't important: what is important is that this
algorithm guarantees that the sampler will generally move towards values that
are higher probability.

Spending time in these locations makes the acceptance rate _much_ higher!
Mission accomplished. But sometimes even this is too slow.

---

![](img/hmc.png)

???

Hamiltonian Monte Carlo is the latest and greatest.

It's _much_ faster at finding the interesting, high probability bits of the
problem. It's practical even if there are thousands of parameters.

It does this it at a cost: it needs to know not just the probability of a given
choice of parameters, but also the rate of change of that probability.

That's the gradient. You get the gradient of a function by differentiating. If
you remember your calculus, you'll remember that this wasn't always an easy
thing to do, and it was rarely fun.

The solution is automatic differentiation.

I promised there would be only one equation in this talk. I lied. Here's two
more.

They may look simple, but they're actually pretty complicated, because `e` is a
thing called an abstract number.

What this approach allows, however, is for you to leave the calculus to the
computer. If you give it the function, it's able to efficiently and accurately
evaluate its derivative, or gradient.

That allows Hamiltonian Monte Carlo to explore the probability space
strategically, rather than at random as in rejection sampling, or semi-randomly
as in Metropolis Hastings.

Hamiltonian Monte Carlo was further improved a couple of years ago by NUTS (or
the No U-Turn Sampler). That makes it more _robust_, which is the nice way of
saying "idiot proof".

Thanks to autodiff you nonger have to differentiate a function to use HMC.

And thanks to NUTS, you no longer have to tune sensitive parameters to get it
to work.

---

![](img/advi.png)

???

And this is variational inference by automatic differentiation (ADVI), the new
hotness. Computational statisticians are still trying to figure out to what
extent the results can be trusted, but it's blazingly fast.

---

### Bayesian inference is great in theory...

- Quantify risk
 - Insert institutional knowledge
 - Interpretability
 - Online learning

### but it's slow if you're not careful.

### Being careful requires cleverness...
 
 - Metropolis Hastings
 - Hamiltonian Monte Carlo with automatic differentiation and NUTS
 - ADVI

### the cleverness is now ready to abstracted away 😎

---

![](img/stan.png)![](img/pymc3.png)

???

These complexities and more get abstracted away in a programming paradigm
called probabilistic programming.

In these systems, industrial strength implementations of the latest and
greatest inference algorithms are baked in to the language as one-liners. This
is similar to the way a fiddly algorithm like back propagation is available in
deep learning frameworks like Tensorflow or Torch.

All that's left to do for the user of these systems is specify the model, i.e.
describe the random process that, for a given choice of parameters, generates
the observations.

For a problem like A/B testing that's very simple. There's an unknown
conversion probability, and each visitor converts with that probability.

For more complicated problems it can be trickier to specify the model.
Probabilistic programming languages make things easier by including fundamental
ideals like probability distributions and random variables as primitives of the
language.

Now at this point, if you're a certain age, say my age or up, you may be saying
this sounds a lot like BUGS (or WinBUGS). And yes, they were early
probabilistic programming systems.

But what's changed in the last 2 or so years with the arrival of Stan and pymc3
is that HMC is now widely available. And that means that internet-scale
problems are now tractable.

---

```C
data {
    int<lower=0> N;
    int<lower=0> N_features;
    matrix[N, N_features] X;
    int<lower=0,upper=1> repaid[N];
}
parameters {
    vector[N_features] p_coef;
}
model {
    vector[N] p;
    p_coef ~ cauchy(0, 2.5);
    p = logit(X * p_coef);
    repaid ~ bernoulli(p);
}
```

???

What does working with these languages look like?

Well, here's almost the complete Stan listing for a product I'm going to show
you in a second that predicts loan repayment based on income, outstanding debt,
loan purpose, etc.

And yes, it's very short! This is a side-product of distributions and RVs being
primitives of the language.

---

```python
# munge data
X = pd.read_csv('mydata.csv')
repaid = X['repaid']
X = X.drop('repaid', axis=1)
data = {'N': len(X),
        'N_features': len(X.columns),
        'X': X.values,
        'repaid': repaid}

# compile model
import pystan
with open('model.stan') as f:
    model_code = f.read()
model = pystan.StanModel(model_code=model_code)

# run HMC
fit = model.sampling(data)
posterior = fit.extract()
```

???

You write this Stan code. Then you write some data munging code in the general
purpose language of your choice (Python in our case), and you call Stan from
within that language.

Stan translates the stan code to C++, which it compiles and runs for you on the
data you feed in from your glue code.

If that sounds a bit hacky, I am not necessarily going to disagree with you.
But it works! You've got the posterior distribution for each paramter of your
model.

---

```python
def inv_logit(u):
    return 1 / (1 + np.exp(-u))

def predict_from_posterior_samples(X, posterior_sample):
    p_coef = posterior_sample['p_coef'].values
    p = inv_logit(np.dot(X, p_coef.T)).squeeze()
    return np.random.binomial(1, p) == 1
```

???

The downside of this hackiness is that in order to use these posterior samples
to predict outcomes for completely new data, we need to _reimplement_ the model
in our general purpose language.

---

???

But once you've done that you can wrap it all in a Flask API and build a live,
interactive product.

---

---

---

## Options

#### Stan

- great for offline analysis 👍
 
 - but it's very awkward to productize

#### pymc3

- algorithmically half a step behind (much less true than it used to be)

- much easier to build products with 😎

#### Others

- Anglican
 - Edward
 - Figaro

???

Every time you change your stan model, you need to change your application code
too. They risk getting out of sync in ways that break things spectactularly or,
even worse, subtly.

---

## Facebook Prophet

![](img/prophet.png)

???

Facebook released a probabilistic programming product last month!

It's Python and R applications that use Stan under the hood for general purpose
seasonal forecasting.

---

![](img/mauna.png)

???

We made this plot with it in 30 minutes last week. It's great!

---

## Use probabilistic programming

When you

- need to **quantify the probability of all possibilities**, not just determine
   which is most likely
 - want to make use of **institutional knowledge**
 - want to do **online learning** (i.e., continually update your model as data
   arrives)
 - want to do **active learning** (i.e., gather more information until your
   predictions reach some threshold of confidence)
 - want to use data to **decide if a more complicated model is justified**
 - have several **heterogeneous datasets**
 - need to **explain your decisions** to customers or regulators
 - have **sparse data** with a shared or hierarchical structure

---

## Homework!

- Use `abayes()` to implement online learning
 - Implement a fuzzy comparison to speed up `abayes()`
 - Use it to solve the German Tank Problem (and check you get the analytic
   solution)

## Further reading

- NYC real estate prototype, [www.fastforwardlabs.com/pre](http://www.fastforwardlabs.com/pre/)

- [Understanding Bayes: How to become a Bayesian in eight easy
   steps](https://alexanderetz.com/2016/02/07/understanding-bayes-how-to-become-a-bayesian-in-eight-easy-steps/)

- Our blog post on [the algorithms behind probabilsitic
   programming](http://blog.fastforwardlabs.com/2017/01/30/the-algorithms-behind-probabilistic-programming.html)

- [The history of pymc3](http://stronginference.com/pymc3-release.html)

---

## Probabilistic programming products

#### Mike Lee Williams • [@mikepqr](http://twitter.com/mikepqr) • <mike@fastforwardlabs.com> • [mike.place/talks/ppp](http://mike.place/talks/ppp/)

#### Fast Forward Labs • [@fastforwardlabs](https://twitter.com/fastforwardlabs) • [fastforwardlabs.com](http://www.fastforwardlabs.com)

???

Last slide