The approximate answer to life, the universe and everything is neural networks. At least this is what the Universal Approximation Theorem (UAT) tells us.

Or to put it, in more earthly terms, quoting Ian Goodfellow:

“A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly.”

But – I may add – it is, nonetheless, very useful.

Now, other earthlings may ask themselves, why, in the name of **Intuition**, does this even work?

A brief and not so obvious intuition would be this one: because of the non-liniarities induced by the activation functions (as proven by George Cybenko in 1989). Or – rather the multilayer feedforward architecture itself (as shown by Kurt Hornik in 1991).

These technical explanations might make sense for the more mathematical-oriented minds, but the underlying ideas of the theorem can be easily followed via a simple example.

Let’s say you are a one-hidden-layer (or shallow) neural network and your task is to approximate this non-trivial function: $${x^3 – 2x + 2}$$

How would you solve it? Fire your internal neurons!

As per the simple architecture above, in order to approximate the given function, this particular neural network will have to activate all its 6 neurons. Weights and biases flow from the input layer to the middle layer, where activation functions (obviously, we will use ReLU, which is, as of 2017, the most popular activation function for deep learning), transform this input signal into an output signals. Eventually, these 6 outputs emerge into one final neuron, the approximate answer!

Or – simply put, in feed-forward pass terms, the neural network will approximate the function as:

$${ W^o(ReLU(W^h \cdot x+b^h) +b^0)},$$

where h represents the weights for the hidden layer and o the weights for the output layer

The brave minds, that already know the intricacies of a neural network, can opt for thinking and behaving as one, using different combinations of ReLU functions:

Desmos is a nice playground for representing neural network, here I have simply drawn my version of UAT with all its components: the true function, the ReLU function, the 6 activated neurons and the final approximation – so visually at least it looks quite neat!

In the chart below, ${n_1, n_2, n_3, n_4, n_5, n_6}$ are the 6 output neurons from the hidden layer calculated as ${ReLU(W^h\cdot x+b^h)},$.

You can think of these as being “high-level” features learned by the neural network. The weighed sum of these neurons – ${ o(x) }$ – is the regression value or the prediction.

All these hand-picked weights and biases can be plugged programatically into a feed-forward pass to test their predictive performance:

def feed_forward(X, W1, b1, W2, b2): A1 = X Z1 = A1 @ W1 + b1 A2 = relu(Z1) Z2 = A2 @ W2 + b2 return Z2

As expected, the results are not great, but at least the predicted values follow the trend shape of the actual values : )

Now, acting as a neural network can be a tedious job, you have to manually try various weights in order to change the slope or slightly nudge the biases to move these functions right or left – therefore I am going to leave this task to the expert – Tensorflow!

Using Colab, we’ll design a neural network for regression and configure it to test our scenario. Here, we’ll use a `Sequential`

model with one densely connected hidden layer and 6 neurons, and an output layer that returns a single, continuous value.

The weights generated by the Tensorflow model are comparable to the ones hand-picked in Desmos. And the model does a pretty good job as well, as seen in the plot below:

The universality construction, we have developed so far, uses only one hidden layer to compute an arbitrary function and, although the results are not of direct, practical use in implementing neural networks, I hope they are intuitive enough for you to see the potential of learning functions using deep learning. With this exercise, I am also aiming for a perception shift: what is the ultimate question to UAT? It is not whether any particular function is computable, but rather what’s a *good* way to compute the function.

I will leave you with this food for thought: Each day we have a lot of questions designed by Deep Thought. For some of them, approximate answers are good enough.

Resources:[1]: Chapter 4 from Neural Networks and Deep Learning

[2]: Universal approximation theorem