class: center, middle, full-bleed
## Probabilistic programming products #### Mike Lee Williams • [@mikepqr](http://twitter.com/mikepqr) •
• [mike.place/talks/ppp](http://mike.place/talks/ppp/) #### Fast Forward Labs • [@fastforwardlabs](https://twitter.com/fastforwardlabs) • [fastforwardlabs.com](http://www.fastforwardlabs.com) ??? --- class: full-bleed, center, middle ![](img/ff05.gif) --- ### Bayesian inference is great in theory... - Quantify risk - Insert institutional knowledge - Interpretability - Online learning ### but it's slow if you're not careful. ### Being careful requires cleverness... - Metropolis Hastings - Hamiltonian Monte Carlo with automatic differentiation and NUTS - ADVI ### the cleverness is now ready to abstracted away 😎 --- class: center, middle, full-bleed ![](img/ab.png) ??? Let's start with a story. I conduct this A/B test and get these results. Clearly layout B is better. If I had a million visitors to both layouts then yes, very probably. But what if I had 25 visitors to layout A, and 20 visitors to layout B. Are you confident that these results would generalize to the real world? And what if it costs you $100k to deploy the change? Do you want to risk it? --- class: center, middle, full-bleed ![](img/baddata.png) ??? This is the central problem of data analysis. I don't care how big your data is. It's still bad. Measurement error, incomplete. Given that's a fact of live. We need a way to quantify uncertainty. And what if layout A has been running for weeks already before the test. We need a way to incorporate that knowledge into our analysis. --- class: center, middle, full-bleed ![](img/bayesrule.png) ??? This is Bayes rule. The only equation I'm going to show today. It's the central idea of data analysis. It's extremely simple. The thing on the left is the probability a hypothesis is true, given our data. That hypothesis might be "the conversion fraction for layout A is 3.57%". This is the posterior distribution, and it's almost always the thing we want to know. We can use it to answer questions like "is layout A better than layout B". How do we get it? From the things on the right. Far right is the likelihood. That is a thing you can calculate. It says: what if we _assumed_ the hypothesis were true. How likely is it we would have got the observations we in fact got. And the other thing on the right is the _prior_. That's how likely we though the hypothesis was prior to our collection of data. --- class: center, middle, full-bleed ![](img/layouta.png) ??? How do we use this rule? Let's start with layout A. Let's say layout A had been running for a few days, and we suspected its conversion fraction was around 3.5%. That's the prior on the left. We're not ruling out 1% or 5%, but 3.5% is our best guess. Note here: we're incorporating institutional knowledge. Now we use Bayes rule to update the prior in the light of our data, to get the posterior. The right hand plot is the probability of each possible conversion fraction for layout A given the data, which was 4 conversions out of 100 visitors. --- class: center, middle, full-bleed ![](img/layoutb.png) ??? Maybe layout B is a new layout. We had no idea what it's conversion fraction would be, so we use what's called a uninformative prior. We assume every possible value between 0 and 1 is equally likely. You can still use this approach when you have no institutional knowledge! Now we use Bayes rule to update the prior in the light of our data, to get the posterior. The right hand plot is the probability of each possible conversion fraction for layout B given the data, which was 2 out of 50. --- class: center, middle, full-bleed ![](img/compareposteriors.png) ??? Finally to answer the question is layout A or layout B better, we compare the two posteriors. What's the answer? Layout A or layout B? We can't say! What we can say is precisely how likely it is that layout B is better than layout A. I think it's about 60% in this case. So if it costs you hundreds of thousands of dollars to change your website, you're probably not confident enough. You don't only have the ability to say how likely possibility 1 is. You can say how likely all the other possibilities are! In that way you can quantify risk. This is the first huge advantage of this approach. We can assign probabilities to all possible outcomes. If we don't like what we can see, we can run the experiment for longer, until we're more confident. And if we can't run the experiment for longer, we can decide how much to bet, now we know the odds. That's the power of having the full probabilistic picture. The ability to quantify exactly how likely every possibility is, is not just a slight improvement. It opens up qualitatively new kinds of products. --- class: center, middle, full-bleed ![](img/online.png) ??? I'm now going to enumerate a couple of other advantages of this approach. The first is that it naturally incorporates the idea that you become more certain as you get more data. So if we run the A/B test for longer our posterior becomes _narrower_. We assign more and more probabiltiy to a narrower and narrower region. We become more confident. In fact, you can take this idea to the limit and feed in data one example at a time. This is _online learning_. At any point in time, we have a full picture of what we know, and that picture is getting more and more complete. This is great for bootstrapping a startup that wants to use machine learning, but doesn't have any customers. In the early days, when you don't have much data, your predictions will look like your prior. They may not be perfect, but it's a start. As you gather data, your model will evolve to better reflect your customers' behaviour and become more confident. --- class: center, middle, full-bleed ![](img/hierarch.png) ??? The next advantage is that it allows hierarchical modeling. This is particularly useful when you have segmented data, and some of the segments are empty. So if you have users in different towns and countries, and one week you happen to have no visitors from Philadelphia. But you do have visitors from New York. You can set up a hierarchical model that allows you to fill in the blanks, and take advantage of the fact that people from Philadelphia are demographically more similar to people from New York than to people from Kuala Lumpar. In the absence of direct measurements, you've got the next best thing. I'll show a prototype we built in a second that depends critically on this idea. --- class: center, middle, full-bleed ![](img/multiple.png) ??? Another advantage is you can use multiple datasets. Let's say we're an insurance company, and we want to know the life expectancy of all Americans. We have data on our own customers, and clearly their health outcomes depend on this average American life expectancy. But we have access to another dataset: Medicare, which makes public statistics that also depend on the thing we want to measure. We can _combine_ these two datasets to learn as much as possible using the Bayesian approach. --- class: center, middle, full-bleed ![](img/inference.png) ??? Finally there's interpretability. Bayesian inference requires you to write down a model that specifies how your data was made. You don't know the values of the parameters until you do the inference, but you the structure of their relation. Given you created the model it is _interpretable by construction_. You will always have an answer to the question: why was a particular prediction made. This can be used to create actionable insights. The model says this customer is going to churn for reason X, so let's change X about that customer. And interpretability is in some cases not just nice to have, it's a legal requirement! --- class: middle ```python def abayes(data, prior_sampler, simulate, compare): '''Yield samples from the posterior by Approximate Bayesian Computation.''' for p in prior_sampler: if compare(simulate(p), data): yield p ``` ??? That all sounds great, right? Unfortunately, generally speaking, it's not possible to use the equation directly. For interesting problems modeled in realistic ways, the calculations are expensive or impossible. So we solve problems by simulation. So here's the brute force way of doing it, which is called Approximte Bayesian Computation. This is the entire algorithm. There was a week long conference about it in the United States last month, but the basic idea is one line of code. How do we actually use it to solve a concrete problem? --- class: middle ```python import random n_b, obs_b = 400, 20 # 400 visitors, 20 conversions def uniform_prior_sampler(): '''Yield stream of random numbers in interval (0, 1).''' while True: yield random.random() def simulate_conversion(p, n_visitors=n_b): '''Returns number of vistors who convert given conversion fraction p.''' return sum(random.random() < p for _ in range(n_visitors)) def compare_conversion(obs1, obs2): '''Return true if two observations are the same.''' return obs1 == obs2 posterior_sampler_b = abayes(obs_b, uniform_prior_sampler(), simulate_conversion, compare_conversion) ``` ??? Our abc function requires: - data - a generator that yields a stream of samples from the prior - a function to run a simulation, given a sample from the prior - a function to compare the simulation to the observation Here are those functions for the B of our A/B test (a uniform prior, 400 visitors, 20 conversions). Now we have a generator `posterior_sampler_b` that is pregnant with possibility. --- class: middle ```python >>> from itertools import islice >>> next(posterior_sampler_b) 0.04223951410146609 >>> from tqdm import tqdm >>> samples = [] >>> for i in tqdm(range(500)): ... samples.append(next(posterior_sampler_b)) 47%|██████████████████▎ | 234/500 [00:07<00:10, 25.84it/s] >>> plt.hist(samples, normed=True) # plot pdf ``` ![](img/abayes_posterior.png) ??? Here's how you get 3 samples from the generator. Obviously to build up a complete picture you need more, so here's 500. I'm using `tqdm` to visualize the progress of the sampler. Note that we're getting 20-30 samples per second. This, by the way, is the Beta distribution. This is where it comes from! --- class: middle ```python >>> n_b *= 2 >>> obs_b *= 2 >>> for i in tqdm(range(500)): ... samples.append(next(posterior_sampler_b)) 47%|██████████████████▎ | 234/500 [00:05<01:09, 6.62it/s] ``` ??? That all sounds great, right? It's how I made the plots I showed you earlier. The problem is that compare function. To accept a sample we require exact agreement between the simulation and the observation. Even if our trial value of the conversion fraction is "correct", the probability of the simulation yielding exactly 20 conversions out of 400 is tiny. What if we make the experiment twice as big. With 800 visitors and 40 conversions we get only 5 to 10 samples per second! For A/B tests, our algorithm is Θ(n^2)! This is disastrous! One fix is to make the comparison less exact, i.e. require only approximate agreement. I'll leave that to you to try, and think through the consequences for your posterior. This is basically the academic field of ABC in a nutshell. But for more complicated problems, with more than one parameter it gets even worse. Exploring a high-dimensional space at random you are spectacularly unlikely to find the high probability regions, and the high probability regions aren't very high probability! --- class: middle ```python import numpy as np import itertools as it def sampler_mh(dist, x0=0, burnin=1000, alpha=0.5, verbose=False): x = x0 samples_accept = 0 for i in it.count(1): candidate = np.random.normal(loc=x, scale=alpha) candidate_prob = min([1.0, dist(candidate) / dist(x)]) accept = np.random.rand() if accept < candidate_prob: samples_accept += 1 x = candidate if i > burnin: yield x, i, samples_accept ``` ??? If rejection sampling is too slow, the next thing we can try is MH. It's only a few lines of code. Conceptually: the sampler chooses a value for the parameter and returns that. To get the next sample, it evaluates the probability for that choice, and also in some other location. If the other location is higher probability, the sampler tosses a coin to decide whether to move over to that new location for the next round. For this talk, the details aren't important: what is important is that this algorithm guarantees that the sampler will generally move towards values that are higher probability. Spending time in these locations makes the acceptance rate _much_ higher! Mission accomplished. But sometimes even this is too slow. --- class: center, middle ![](img/hmc.png)
f(x) = y
f(x + ze) = y + f'(z)e
??? Hamiltonian Monte Carlo is the latest and greatest. It's _much_ faster at finding the interesting, high probability bits of the problem. It's practical even if there are thousands of parameters. It does this it at a cost: it needs to know not just the probability of a given choice of parameters, but also the rate of change of that probability. That's the gradient. You get the gradient of a function by differentiating. If you remember your calculus, you'll remember that this wasn't always an easy thing to do, and it was rarely fun. The solution is automatic differentiation. I promised there would be only one equation in this talk. I lied. Here's two more. They may look simple, but they're actually pretty complicated, because `e` is a thing called an abstract number. What this approach allows, however, is for you to leave the calculus to the computer. If you give it the function, it's able to efficiently and accurately evaluate its derivative, or gradient. That allows Hamiltonian Monte Carlo to explore the probability space strategically, rather than at random as in rejection sampling, or semi-randomly as in Metropolis Hastings. Hamiltonian Monte Carlo was further improved a couple of years ago by NUTS (or the No U-Turn Sampler). That makes it more _robust_, which is the nice way of saying "idiot proof". Thanks to autodiff you nonger have to differentiate a function to use HMC. And thanks to NUTS, you no longer have to tune sensitive parameters to get it to work. --- class: center, middle, full-bleed ![](img/advi.png) ??? And this is variational inference by automatic differentiation (ADVI), the new hotness. Computational statisticians are still trying to figure out to what extent the results can be trusted, but it's blazingly fast. --- ### Bayesian inference is great in theory... - Quantify risk - Insert institutional knowledge - Interpretability - Online learning -- ### but it's slow if you're not careful. -- ### Being careful requires cleverness... - Metropolis Hastings - Hamiltonian Monte Carlo with automatic differentiation and NUTS - ADVI -- ### the cleverness is now ready to abstracted away 😎 --- class: center, middle, full-bleed ![](img/stan.png)![](img/pymc3.png) ??? These complexities and more get abstracted away in a programming paradigm called probabilistic programming. In these systems, industrial strength implementations of the latest and greatest inference algorithms are baked in to the language as one-liners. This is similar to the way a fiddly algorithm like back propagation is available in deep learning frameworks like Tensorflow or Torch. All that's left to do for the user of these systems is specify the model, i.e. describe the random process that, for a given choice of parameters, generates the observations. For a problem like A/B testing that's very simple. There's an unknown conversion probability, and each visitor converts with that probability. For more complicated problems it can be trickier to specify the model. Probabilistic programming languages make things easier by including fundamental ideals like probability distributions and random variables as primitives of the language. Now at this point, if you're a certain age, say my age or up, you may be saying this sounds a lot like BUGS (or WinBUGS). And yes, they were early probabilistic programming systems. But what's changed in the last 2 or so years with the arrival of Stan and pymc3 is that HMC is now widely available. And that means that internet-scale problems are now tractable. --- class: middle ```C data { int
N; int
N_features; matrix[N, N_features] X; int
repaid[N]; } parameters { vector[N_features] p_coef; } model { vector[N] p; p_coef ~ cauchy(0, 2.5); p = logit(X * p_coef); repaid ~ bernoulli(p); } ``` ??? What does working with these languages look like? Well, here's almost the complete Stan listing for a product I'm going to show you in a second that predicts loan repayment based on income, outstanding debt, loan purpose, etc. And yes, it's very short! This is a side-product of distributions and RVs being primitives of the language. --- class: middle ```python # munge data X = pd.read_csv('mydata.csv') repaid = X['repaid'] X = X.drop('repaid', axis=1) data = {'N': len(X), 'N_features': len(X.columns), 'X': X.values, 'repaid': repaid} # compile model import pystan with open('model.stan') as f: model_code = f.read() model = pystan.StanModel(model_code=model_code) # run HMC fit = model.sampling(data) posterior = fit.extract() ``` ??? You write this Stan code. Then you write some data munging code in the general purpose language of your choice (Python in our case), and you call Stan from within that language. Stan translates the stan code to C++, which it compiles and runs for you on the data you feed in from your glue code. If that sounds a bit hacky, I am not necessarily going to disagree with you. But it works! You've got the posterior distribution for each paramter of your model. --- class: middle ```python def inv_logit(u): return 1 / (1 + np.exp(-u)) def predict_from_posterior_samples(X, posterior_sample): p_coef = posterior_sample['p_coef'].values p = inv_logit(np.dot(X, p_coef.T)).squeeze() return np.random.binomial(1, p) == 1 ``` ??? The downside of this hackiness is that in order to use these posterior samples to predict outcomes for completely new data, we need to _reimplement_ the model in our general purpose language. --- class: full-bleed
??? But once you've done that you can wrap it all in a Flask API and build a live, interactive product. --- class: full-bleed
--- class: full-bleed
--- ## Options #### Stan - great for offline analysis 👍 - but it's very awkward to productize #### pymc3 - algorithmically half a step behind (much less true than it used to be) - much easier to build products with 😎 #### Others - Anglican - Edward - Figaro ??? Every time you change your stan model, you need to change your application code too. They risk getting out of sync in ways that break things spectactularly or, even worse, subtly. --- ## Facebook Prophet ![](img/prophet.png) ??? Facebook released a probabilistic programming product last month! It's Python and R applications that use Stan under the hood for general purpose seasonal forecasting. --- ![](img/mauna.png) ??? We made this plot with it in 30 minutes last week. It's great! --- ## Use probabilistic programming When you - need to **quantify the probability of all possibilities**, not just determine which is most likely - want to make use of **institutional knowledge** - want to do **online learning** (i.e., continually update your model as data arrives) - want to do **active learning** (i.e., gather more information until your predictions reach some threshold of confidence) - want to use data to **decide if a more complicated model is justified** - have several **heterogeneous datasets** - need to **explain your decisions** to customers or regulators - have **sparse data** with a shared or hierarchical structure --- ## Homework! - Use `abayes()` to implement online learning - Implement a fuzzy comparison to speed up `abayes()` - Use it to solve the German Tank Problem (and check you get the analytic solution) ## Further reading - NYC real estate prototype, [www.fastforwardlabs.com/pre](http://www.fastforwardlabs.com/pre/) - [Understanding Bayes: How to become a Bayesian in eight easy steps](https://alexanderetz.com/2016/02/07/understanding-bayes-how-to-become-a-bayesian-in-eight-easy-steps/) - Our blog post on [the algorithms behind probabilsitic programming](http://blog.fastforwardlabs.com/2017/01/30/the-algorithms-behind-probabilistic-programming.html) - [The history of pymc3](http://stronginference.com/pymc3-release.html) --- class: center, middle, full-bleed
## Probabilistic programming products #### Mike Lee Williams • [@mikepqr](http://twitter.com/mikepqr) •
• [mike.place/talks/ppp](http://mike.place/talks/ppp/) #### Fast Forward Labs • [@fastforwardlabs](https://twitter.com/fastforwardlabs) • [fastforwardlabs.com](http://www.fastforwardlabs.com) ??? Last slide