Regular Expressions as Containers

2025-11-10

tl;dr: there's way to do that ->

Ok, so, I need to rant about regular expressions a bit. Actually regular expressions have some interesting properties that we could be getting more mileage out of.

Background refresher

Regular expression language

RE are described by a regular language, a minimalistic version of which is roughly this:

re:   atom re               # concatenation
    | atom '|' re           # alternation
    | atom '*'              # repetition; Kleene star
    | atom '?'              # optional
    | atom

atom: [ -~]                 # literal
    | '(' re ')'            # grouping

I've skipped basically every feature that I can't reconstruct from what I've included, but it should be sufficient for the discussion.

NFA Construction

These grammar productions map to Non-deteministic Finite Automaton fragments. Note how start and end states can always be merged without unintended surprises. "ε" means empty string, that is, the NFA consumes no input.

Literal:

Kleene star:

After this, constructing the NFA is simply a matter of plugging the fragments together at the starts and ends.

DFA Conversion

After the NFA is constructed, it can either be used, as is done in NFA-based engines, or it can be converted to a DFA. Using the NFA directly and converting it to an DFA are both done by following all epsilon-transitions (that consume no input) at once, while keeping track of all possible states the equivalent DFA could be in.

DFA Minimisation

The process of minimising the DFA basically works by working backwards from acceptance (think of it as a hidden state if you like) and merging all nodes that are indistinguishable. So that, for example, all occurrences of AAA$ are merged into one, first because all ends $ are the same, then all nodes that have (exactly) one outgoing transition to $ are the same, and so on. The process ends when there are no more changes being made.

The case for working with DFA:s

(Minimised) DFA:s have a few nice properties:

Recognising input is trivially a linear time, constant space, single pass algorithm.
They're unique, so encoding transitions in sorted order reduces comparison to comparison of their encoded representations.
Nice set operations are efficient:
- Inclusion (if the language of A is a subset of that of B),
- Emptiness (accepts nothing) and
- Universality (accepts anything)
- Intersection
- Negation

So we should be able to get ==, <=, <, *, - operators, a zero and an infinity as well as the possibility of doing recognition with Ahead-of-Time compiled code and very tight code without it.

Regular Expression as container

To me, another intriguing possibility exists: To use the DFA as an indexed container. Defining it would look much the same as creating or compiling your regex does already, but the created object would have the added capabilities to produce a matching string named by an index, and, when matching (recognising), to tell which index a match has.

To find a match, start at the start state, with the given index $i$, then divide the (remaining) $i$ by the number of outbound transitions, including accepting and choose the transition given by the remainder, and make the quotient the new $i$. Repeat until the procedure has $i=0$ in an accepting state. (Or, if you prefer to think of acceptance as a state, transition to the accepted state.)

Trying this with $i=0$, we'll find that the behaviour is very reminiscent of the maze-solving technique of following a wall. edges by the sort order of the labels, this will prefer, e.g., 'a' over 'b', but the procedure will not generate matches in a sorted order for reasons that will become obvious momentarily. One consequence of note, however, is that acceptance must be the first transition if it is at all available. (Otherwise, it may never accept, even though it goes through a loop with several accepting states.)

If the pattern is infinite, e.g. like (aa*|bb*), then the indexing must alternate between the branches, because if it didn't, only the first infinite branch would ever be taken. (In the example, only ever a, aa, ...)

If the different transitions from a state lead to sub-graphs of different sizes, the indexing can't simply work by dividing the index by the number of transitions exiting the state and take the remainder.

What we can do, though, is to work out the sizes of the languages the sub-graphs describe, and take the remainder until the smallest sub-graph is exhausted and then taking the remainder without it.

If all sub-graphs are exhausted, we've found the end of the container, and if all the remaining sub-graphs are infinite, we can simply alternate forever, but all finite matches have finite indices.

a*|b*|ccc? would give us

"", "a", "b", "cc", "aa", "bb", "ccc", "aaa", "bbb", "aaaa", "bbbb", ...

(There's only one empty string, and "cc" comes before "aa" and "bb")

So why don't we have this? Well, one reason is that DFA:s have some drawbacks. Notably, it's relatively easy to make the graph blow up: $(0|1)^{n}0(0|1)^{*}$ makes for $2^{n+1}$ states. It's literally called "State explosion".

Well, hm...