Scene_Cast2 3 days ago |

Tangentially related - LLMs implicitly do parsing when they read code. I wonder how close or far it is from parsing algorithms such as this one.

codebje 3 days ago | root | parent |

Quite a long way away. They have different jobs to do: an LR parser for code is trying to produce an abstract syntax tree a compiler (or static analyser) that represents the structured program expressed in source. An LLM for code is trying to extract intent, code style, and other useful factors for generating more code that "belongs" with the rest. The former really wants valid source, the latter will accept any text and do its best; text that expresses intent tends to work better, so code with single letter variable names or obfuscated methods will make LLMs struggle, but compile fine, while pseudo-code explaining the intended outcome in prose will help get good code produced, but not compile at all.

An LR parser will produce a parse tree, showing how the given token stream satisfies the parser's grammar - or an error, if the token stream doesn't satisfy the grammar. Your typical parser algorithm is deterministic, even if the grammar it matches isn't, so you'd expect to get the same parse tree for the same token stream, every time. By and large LR parsers (as described in TFA) are shift/reduce parsers: there's a bunch of states, and in each state the next token in the input stream determines if the state shifts (pushes the token onto a term stack and moves to the next input token) and jumps to another state, or reduces (pops a number of terms off the term stack and replaces them with the term being reduced to) and goes back to a prior state. TFA talks about the situation where more than one shift or reduce action is possible for a given state and input token.

An LLM uses embeddings. AFAIK the current state of the art algorithm (method, perhaps?) for this is still transformers with attention: roughly speaking, each token in the input gets projected into an embedding vector, a query vector, and a key vector; attention means that for each token, all other tokens with a key vector near to their query vector have their embedding alter the query token's embedding, allowing for context to have an impact. Multiple layers of attention can run one after another, eventually producing a final set of token values. Different embedding models use different means to pick one of those values (eg, pick the last one in the input sequence, if your query/key vectors tend to push information 'forwards' to following tokens), and that's your whole input's embedding. Neural nets can be deterministic, but those used for embedding rarely are: the values produced by repeated runs over the same input will (or should) always be pretty close, but generally won't be exactly the same.