(this is a bit of a ramble - apologies. Just wanted to throw the idea out there somewhere of tokenizing individual molecules rather than, say, the amino acid sequence of those molecules. This is an incomplete essay draft).
Transformer architecture is taking over the world of machine learning. I believe that we can use transformers to simulate whole cells - starting with bacteria, and moving our way up to human cells. At the end, I believe we will be able to fairly accurately predict the outcome of any environmental perturbation or modification. However, as a synthetic biologist, I am excited about the ability to do forward engineering as well. I believe there is untapped potential in the transformer architecture that we can use to do this.
The core idea behind modern LLMs is that we can treat words (or more accurately, tokens) as vectors, and then have an self-attention mechanism where each vector can interact with each other vector. Both the word vector weights and the self-attention can be tuned to fit data better, with the goal being predicting the next word. This architecture is surprisingly simple, but is able to be trained with a massive amount of data, and the self-attention mechanism allows the system to understand complicated dependent patterns.
I conceptualize tokens as actual molecules within a biological system. Self attention, then, is how molecules interact with each other - the natural sparse encoding of large language models is somehow encoding the sparse space of biological interactions that are physically possible. Rather than predicting the next token, the system predicts the change in the positional encoding - which in our new biological model, is actually the primary thing we’re looking for - as positional encodings are actually quantities of the molecules within the system. Sequentially running the model changes the positional encodings - ie, the number of molecules - until you get to a point of cell division, where the whole system resets itself.
Separate models can then generate what it thinks a new token will look like in the vector encoding space (ie, what a new protein would do to the model), which can then be actualized and trained against the actual model.
Self attention here is different than it is in language. It’s much more fluid. But the mechanism of self-attention and then densely encoding that into a vectors stays the same.