A while ago I was entertaining problems in the intersection of symbolic manipulation of expressions and Deep Learning. In particular I was interested in finding “optimal” expressions in some way. So, imagine you have some grammar $G$ that describe a set of expressions, let’s call it $\mathrm{Exp}_G$, and suppose we have a real-valued function that takes an expression and maps into a number $f: \mathrm{Exp}_G \to \mathbb{R}$. The problem I want to discuss is finding expressions that minimize that function:
$$ e^{\star} = \arg \min_{e \in \mathrm{Exp}_G} f(e) $$
Let’s discuss one particular toy version of this problem. It might look like a silly problem, but it’s simple enough to illustrate the concept and non-trivial enough to avoid simple solutions.
Suppose we have a training set that consist of a sample of strings. We hold the hypothesis that there’s an underlying logic or grammar that generated those strings and we want to describe commmon patterns in these strings by finding a regular expression that would match as many of those strings as possible. We’d like to use this regular expression to check if future out-of-sample strings we come across display the same patterns and belong to this group.
As an example, suppose our training set consists of the names “Rafael”, “Gabriel” and “Manoel”. Of course we have infinite regular expressions that match all those names. In particular we have the really trivial option /Rafael|Gabriel|Manoel/
that would certainly match all strings but would probably fail to match out-of-sample items that do match interesting patterns in the training set – for example “Emanuel”. In a way, this looks a lot like overfitting. Another failure mode would be underfitting and choosing the regex /.\*/
, that certainly fits all in-sample items but would also accept all out-of-sample items, even those that don’t match any interesting patterns in the training set – for example “Pedro”. Probably what we want is something like /.+el/
(or perhaps something with a bit more structure if this is not sufficiently constrained and we want to avoid matching the string “chapel”).
So, we have to impose additional requirements and constraints or, in ML parlance, additional regularizations that would make our regular expression regularize better. It could be interesting, for example, to find the regular expression that fits as many items as possible but also fail to match a set of randomly chosen strings (a negative sample). Or to find the regular expression that generates a reasonably sized match set but still match as many items as possible. That can be perhaps stated as finding the regular expression $e$ that minimizes a function:
$$ f(e) = − \sum_{s \in \mathrm{D}} \mathtt{Match}_e(s) + \lambda \Lambda(e) $$
Where:
This particular problem of symbolic optimization have been bugging me for a while. There seems to be a lot of things we could do if we could solve this problem. Here’s a non-exhaustive list.
Now, this looks interesting but it actually seems very intractable. First of all, the function $f$ might be expensive to calculate. Even if it isn’t, the space of possible expressions is combinatorially large. Probably factorially large in the size of the grammar G and depth of the expressions being considered. Pure enumeration is unfeasible and techniques like genetic programming don’t deal well with high dimensional inputs. Something like Simulated Annealing can deal better with big combinatorial spaces, but convergence is slow.
Wouldn’t it be nice if we could somehow turn this into gradient descent over a continuous space of analytical functions? That’s the easiest and most tractable type of optimization we know. If we can map somehow, even approximately, a combinatorial optimization into a continous, $\mathbb{R}^n$ valued optimization, that would the best deal we could get, wouldn’t it?
So, here’s an idea that ocurred to me a while ago, when I was reading the Neural Architecture Optimization paper2. Suppose we had a pair of function, let’s call it an Encoder/Decoder pair, that map our expressions into and onto real valued vectors. That is, we a pair of functions:
$$E: \mathrm{Exp}_G \to \mathbb{R}^k$$
and
$$D: \mathbb{R}^k \to \mathrm{Exp}_G$$
such that $D(E(e)) = e$ for all $e \in \mathrm{Exp}_G$.
If we had such a pair of functions, we could, instead of finding the expression $e$ that optimizes $f(e)$, we could try to find the vector $\mathbf{x} \in \mathbb{R}^k$ that optimizes $f(D(\mathbf{x}))$. Now we’re optimizing over a continuous space!!! That’s a bit of progress but it’s still not perfect: we don’t know how to calculate gradients of $f(D(\mathbf{x}))$ with respect to the vector $\mathbf{x}$, since there’s the intervening combinatorial space of the expressions.
Let’s first focus on the Encoder/Decoder pair. What if we could train functions to learn this bijection? Even if it’s not an exact bijection, it could be a start, right?
So, imagine we have a parametric family of functions we could use to find a suitable encoder:
$$E: \Theta_E \times \mathrm{Exp}_G \to \mathbb{R}^k$$
where $\Theta_E$ is some space of parameters. We could try to find a function that evaluates how well the vector $E(\theta_E, e)$ encodes the expression $e \in \mathrm{Exp}_G$ and find the parameter $\theta_E \in \Theta_E$ that optimizes that “well-encodability” function. If the parameter space is nice and the “well-encodability” function is continuous this could be done by gradient descent.
We could do the same with the decoder:
$$D: \Theta_D \times \mathbb{R}^k \to \mathrm{Exp}_G $$
If we put the encoder and decoder together, one good candidate for a measure of “well-encodability/decodability” is the reconstruction loss3:
$$ \ell(\theta_E, \theta_D) = \sum_{k=1}^N d(e_k, D(\theta_D, E(\theta_E, e_k))) $$
where $d(e, e')$ is a measure of how distant two expressions are. In summary, we would:
Now, after we have learned the encoder and decoder, we could now go back to our original optimization problem. We still want to find the best expression that minimizes $f(e)$ and we want to do that by minimizing $f(D(x, \theta_D))$. Perhaps we can apply the same trick above? We could try to learn a function:
$$V: \Theta_V \times \mathbb{R}^k \to \mathbb{R} $$
that given the vector embedding $\mathbf{x} = E(\theta_E, e)$ associated to an expression $e$, learns to estimate the value of $f(e)$. One possible way would be to:
That could be achieved by gradient descending the loss function:
$$ \ell(\theta_V) = \sum_{k=1}^{N} (V(\theta_V, E(\theta_E, e)) - f(e))^2 $$
So, in the end we could simulaneously descend into the three parameters $\theta_E$, $\theta_D$ and $\theta_V$ to minimize the following loss:
$$ \ell(\theta_E, \theta_D, \theta_V) = \sum_{k=1}^N\left[d(e_k, D(\theta_D, E(\theta_E, e_k))) + (V(\theta_V, E(\theta_E, e)) - f(e))^2\right] $$
Once we learn the best parameters $\theta_E^\star$, $\theta_D^\star$, $\theta_V^\star$, finding the best expression $e$ that maximizes $f(e)$ can be done by:
This will probably not be an exact solution, but hopefully one that is good enough.
That’s a nice story, but is it actually possible to do? That’s a great question. I have no idea. I tried attacking the regex problem with this idea before and I hit several brick walls (that mainly stem from my lack of knowledge in symbolic computation since my focus is 100% in ML and not in Computer Science in general).
The first difficulty is how to define the parametric family of encoders $E(\theta, e)$. Those functions must take values in the set of possible expressions $\mathrm{Exp}_G$ (the set of all regular expressions in our toy example). We could work with a string representation of the expression and use whatever fancy architecture kids are using for processing strings nowadays. But it would be nice to be able to feed the Abstract Syntax Tree (AST) of the expression and process this. I wonder if that would add significant amounts of inductive bias about the problem and help the model to learn. I tried using TreeLSTMs4 for this problem the past and the result was interesting. So I think this part is really feasible.
The parametric family of decoders $D(\theta_D, \mathbf{x})$ is a much worse problem though. Those functions have to produce syntatically correct expressions in order for the optimization to make sense. Using simple string generation neural nets to produce a string representation of the expression will probably not garantee syntatically valid expressions for every latent vector we input. One possibility is to build it as a generative process that produces valid ASTs recursively. That can work but on my past tries I had a lot of trouble with performance.
Also randomly generating trees is not easy and naive processes tend to produce trees with a very heavily tailed depth distribution. In my experiments frequently my decode would generate mostly very shallow ASTs but every once in a while it would generate a monstrously deep AST. Manually clipping the maximum depth didn’t work very well for me, but I might have done something dumb.
So, I didn’t reach the point where I could test if this idea would work at all, but that looks like a promising idea for someone with actual time to test it.
Alternatively, some other strategies like type driven development in Idris use information about typing and dependently typed proofs to suggest code by case analysis. Given a strong enough type constraint Idris is able to generate the desired code for a function. Perhaps there’s a way to marry this with symbolic manipulation of ASTs as well. ↩︎
Luo, R., Tian, F., Qin, T., Chen, E., & Liu, T. Y. (2018). Neural architecture optimization. arXiv preprint arXiv:1808.07233. ↩︎
Granted that we are able to find a distance function between expressions that can provide continuous gradients with respect to $\theta_D$ and $\theta_E$. ↩︎
Tai, K. S., Socher, R., & Manning, C. D. (2015). Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075. ↩︎