Probabilistic Programming for couples

I will start with a disclaimer, for probabilistic programming to work, both partners should belong to the Bayesian Order and should know their business, i.e. be domain experts (the latter is self-implied).

The Bayesian Order differs from the more Traditional Statistical Order by preserving uncertainty. They interpret a probability as measure of belief, or confidence, of an event occurring. In contrast, the Frequentists assume that probability is the long-run frequency of events.

Frequentist or a Bayesian?

How do you know if you are dating a Frequentist or a Bayesian? Simply ask them this questions: “Do you like me?”

If they answer – after answering you “Yes” 1000 times before:

Yes

then they are Frequentists.

If they answer:

Yes, with probability 0.8; No, with probability 0.2.

That’s a good choice on a partner, probably… Their uncertainty might seem frustrating in the beginning, but rest assured that your Bayesian partner is continuously incorporating evidence into their beliefs. For example: If their prior belief is: “I expect him to bring me flowers today”, and each day they are proved wrong, the Bayesian inference will correct this belief, thus preserving the uncertainty that reflects the instability of your relationship.

Should you then switch to a Frequentist partner? Frequentists are still state-of-the-art in many areas (just think of those sexy neural networks), but they require a lot of your data and time. Bayesians work with lesser data, but you have to know your distributions well!

Speaking of data, what better data to analyse than your WhatsApp history data? This will tell you if there has been a change in the texting patterns with your partner. Exporting the chat history is very easy, just go to your WhatsApp mobile app, choose your partner from the list and from Options, select More -> Export chat.

Probabilistic Programming

Now, you don’t have to start reading all the messages (though it can be tempting), you can simply count the number of messages exchanged daily. That small dataset will be the input of our Probabilistic Programming model. But wait, what is Probabilistic Programming all about?

Probabilistic Programming is not about writing software that behaves probabilistically, but instead is a tool for Bayesian modelling. The idea is to use code to infer behaviour from the WhatsApp-texting data.

Before moving to more Bayesian intimacy, let’s spend some paragraphs discussing probability distributions. Explanations below are mainly extracted from the Probabilistic Programming and Bayesian Methods for Hackers Chapter 1.

Probability distributions

Let π‘ be some random variable. Then associated with π‘ is a probability distribution function that assigns probabilities to the different outcomes π‘ can take.

  • If π‘ is discrete, then its distribution is called a probability mass function, which measures the probability π‘ takes on the value π‘˜, denoted π‘ƒ(𝑍=π‘˜). One popular probability mass function is Poisson.
    • If a random variable π‘ has a Poisson mass distribution, we denote this by writing $$Z \sim \text{Poi}(\lambda) $$
    • One useful property of the Poisson distribution is that its expected value is equal to its parameter, i.e.: $$E\large[ \;Z\; | \; \lambda \;\large] = \lambda $$
    • πœ† is called a parameter of the distribution, and it controls the distribution’s shape. For the Poisson distribution, πœ† can be any positive number. By increasing πœ†, we add more probability to larger values, and conversely by decreasing πœ† we add more probability to smaller values.
  • If π‘ is continuous, then its distribution is called a probability density function. An example of continuous random variable is a random variable with exponential density.
    • Unlike a Poisson variable, the exponential can take on any non-negative values
    • If a random variable π‘ has an exponential distribution with parameter πœ†, we write this: $$Z \sim \text{Exp}(\lambda)$$
  • Given a specific πœ†, the expected value of an exponential random variable is equal to the inverse of πœ†, that is: $$E[\; Z \;|\; \lambda \;] = \frac{1}{\lambda}$$

Now, you may ask yourself?

But what is actually πœ†?

In the real world, πœ† is hidden from us, like the objects hidden from the prisoners in the myth of Plato’s cave – but they are, in fact, the true explanatory factors that are casting the shadows visible to the prisoners.

The myth (or allegory) of Plato's cave | Wellnessbeam
The myth of Plato’s cave, source: https://wellnessbeam.org/

The prisoners only see the shadows or π‘, and must go backwards to try and determine πœ†. Bayesian inference is concerned with beliefs about what πœ† might be. Rather than try to guess πœ† exactly, we can only talk about what πœ† is likely to be by assigning a probability distribution to πœ†.

Problem Definition

Next, let’s try to model the above-mentioned situation: You are given a series of daily WhatsApp-messages counts exchanged by 2 users from your system. The data, plotted over time, appears in the chart below. You are curious to know if the users’ WhatsApp-chat habits have changed over time, either gradually or suddenly. How can you model this? (Below it is in fact the WhatsApp history data between me and my boyfriend.)

WhatsApp messages exchanged

How can we start to model this data? A Poisson distribution is generally good for modelling counts and lambda is going to be the mean of this distribution:
$$ C_i \sim \text{Poisson}(\lambda) $$
Looking at the count data, it appears that the number of WhatsApp message becomes lower for the later period. Therefore, we can consider lambda to be changing by the following logic. $$\lambda = \begin{cases} \lambda_1 & \text{if } t \lt \tau \cr\lambda_2 & \text{if } t \ge \tau\end{cases}$$
If no sudden change occurred and indeed $\lambda_1 = \lambda_2$, then the $\lambda$s posterior distributions should look about equal.

We are going to infer what $\lambda_1, \lambda_2$, and $\tau$ are. They are all probability distributions.
Recall that πœ† can be any positive number. An exponential distribution provides a continuous density function, but it has its own hyperparameter called $\alpha$.
$$\begin{align}&\lambda_i \sim \text{Exp}( \alpha ) \\\end{align}$$
A good rule of thumb is to set the exponential parameter equal to the inverse of the average of the count data.

What about 𝜏? We can say that 𝜏 ~ uniform(1,81) since we do not know when is the breakpoint. So we should be able to say that it can be every possible day.
$$\begin{align}& \tau \sim \text{DiscreteUniform(1,81) }\\& \Rightarrow P( \tau = k ) = \frac{1}{81}\end{align}$$

Basically, what we are modelling is the poisson function which will generate values based on lambda, which is being estimated by another function called the exponential distribution.

Prior distributions

So to put all of our distributions together, we have:

$$\begin{align*}\lambda_{1}^{(0)} &\sim \text{Exponential}(\text{rate}=\alpha) \\\lambda_{2}^{(0)} &\sim \text{Exponential}(\text{rate}=\alpha) \\\tau &\sim \text{Uniform}[\text{low}=0,\text{high}=1) \\\text{for } i &= 1\ldots N: \\\lambda_i &= \begin{cases} \lambda_{1}^{(0)}, & \tau > i/N \\ \lambda_{2}^{(0)}, & \text{otherwise}\end{cases}\\ X_i &\sim \text{Poisson}(\text{rate}=\lambda_i)\end{align*}$$


Let’s implement this model using TF and TFP’s distributions.

TF Probability

TF Probability (TFP) is an open source Python library built using TF which makes it easy to combine deep learning with probabilistic models on modern hardware.

Since TF Probability is built on the TensorFlow (TF) stack, it brings the runtime benefits of TF to Bayesian analysis. These include write-once run-many (ability to run your development model in production) and speedups via GPUs and TPUs.

Source: Youtube

TFP distributions are a collection of probability distributions:

  • Examples: Normal, Binomial, Poisson, Gamma, Multivariate Normal, Dirichlet etc
  • Python class which encodes some useful properties of a random variable.

Bijectors transform inputs to outputs and back again:

  • They are volume preserving, bijective, differentiable maps.
  • They are useful because sometimes it is faster to do inference on a transformation of a distribution than the original distribution.

Model definition

The implementation in TFP is arguably very close to being a 1:1 mapping of the probabilistic model. We want to know the following distribution: $p(\lambda_1, \lambda_2, \tau | X)$. And by Bayes rule gives us: $p(\lambda_1, \lambda_2, \tau | X)$ $\propto$ $p(X|\lambda_1, \lambda_2, \tau)p(\lambda_1)p(\lambda_2)p(\tau)$

Doing inference using this model in TF Probability requires creating a joint log probability function which takes an input of samples and returns the log probability of the given sample in the model.

def joint_log_prob(count_data, lambda_1, lambda_2, tau):
    tfd = tfp.distributions
 
    alpha = (1. / tf.reduce_mean(count_data))
    rv_lambda_1 = tfd.Exponential(rate=alpha)
    rv_lambda_2 = tfd.Exponential(rate=alpha)
 
    rv_tau = tfd.Uniform()
 
    # lambda_ is an array which gets gathered by the boolean of whether the day is smaller than the sample of tau.
    lambda_ = tf.gather(
         [lambda_1, lambda_2],
         indices=tf.cast(tau * tf.cast(tf.size(count_data), dtype=tf.float32) <= tf.cast(tf.range(tf.size(count_data)), dtype=tf.float32), dtype=tf.int32))
    rv_observation = tfd.Poisson(rate=lambda_)
    # The output is the summation of all individual part log probability.
    return (
         rv_lambda_1.log_prob(lambda_1)
         + rv_lambda_2.log_prob(lambda_2)
         + rv_tau.log_prob(tau)
         + tf.reduce_sum(rv_observation.log_prob(count_data))
    )

Inference by sampling posterior

We are using Markov Chain Monte Carlo (MCMC), specifically the Hamiltonian Monte Carlo (HMC) kernel – see kernel code definition below -, to generate posterior samples using the model defined above. We are using the property of TF, called automatic differentiation, which automatically computes the gradient for us, which is then used by HMC.

Then we are going to compute the joint probabilities in a fast manner by calling the defined closure.

HMC samples live in Real number space. But our exponential distribution samples and uniform distribution samples live in R+ and (0,1). So we add these 3 independent bijectors to constraint the space we are sampling in (basically make sure that the HMC samples have > 0 probability and chain doesn’t get stuck).

kernel=tfp.mcmc.TransformedTransitionKernel(
        inner_kernel=tfp.mcmc.HamiltonianMonteCarlo(
            target_log_prob_fn=unnormalized_log_posterior,
            num_leapfrog_steps=2,
            step_size=step_size,
            state_gradients_are_stopped=True),
        bijector=unconstraining_bijectors)

where the bijectors are defined as follows:

unconstraining_bijectors = [
    tfp.bijectors.Exp(),       # Maps a positive real to R.
    tfp.bijectors.Exp(),       # Maps a positive real to R.
    tfp.bijectors.Sigmoid(),   # Maps [0,1] to R.  
]

Posterior distributions

Let’s plot the posterior distributions generated by the kernel:

[
    lambda_1_samples,
    lambda_2_samples,
    posterior_tau,
], kernel_results = graph_sample_chain(
    num_results=num_results,
    num_burnin_steps=num_burnin_steps,
    current_state=initial_chain_state,
    kernel = kernel)
    
tau_samples = tf.floor(posterior_tau * tf.cast(tf.size(count_data),dtype=tf.float32))
Results

You can see that the plotted samples for $\lambda_1$ and $\lambda_2$ part from each other, they do not overlap at all, meaning that were was an actual change happening in the messaging pattern – quite an anomaly!

The posterior distribution of $\tau$ suggests with very high probability that the switch-over happened at around day 38 – not so easy to eyeball this one!

The full picture

Can only be depicted when we also add the labels for the days, along with the expected number ($\lambda_1, \lambda_2$) of the WhatsApp messages exchanged predicted by the model.

Expected number of WhatsApp messages exchanged

And, of course, also by looking at the full code:

Conclusion

This analysis shows:

  • there was a behavioural change ($\lambda_1$ is different from $\lambda_2$ )
  • the change was sudden rather than gradual (as demonstrated by tau’s peaked posterior distribution)
  • $\tau$ = day 38, which corresponds to March 12th 2020, the day I officially started working from home due to Coronavirus lockdown.
  • a fun fact: there is a significant spike in the communication pattern on February 10th. I initially thought it was related to us making plans for Valentines day, but looking at the logs I realised we had a fight. Communication is important! : )

Key take-away: When the data of interest is not big enough to be trained on a neural network, we can use a probabilistic model to get meaningful insights.

Resources:

[1]: Probabilistic Programming by Adrian Sampson at Cornell University
[2]: TensorFlow @ O’Reilly AI Conference, San Francisco ’18
[3]: Probabilistic Programming vs. Traditional ML
[4]: Probabilistic Programming & Bayesian Methods for Hackers
[5]: Probabilistic Programming for Programming Languages People

Leave a Reply

Your email address will not be published. Required fields are marked *