Markdown is not context free (or, writing parsers vs perser combinators)

If you visit my blog posts over Gemini[1]. You'll see even though all my posts are written in Gemtext[2]. Due to the high frequency that I need to seperate inline code from text. I still use backticks for code blocks even it's not a standard in Gemtext. It' just feels natural. And so my Gemtext to HTML converter also supports it. One day I find one of my post needs support for bold text to make sense in HTML. So I was adding more markdown-like syntax support. Oh boy, adding bold and italic is a rabbit hole.

The * is not context free

Quick DuckDuckGo-ing for the BNF for Markdowns shows there's none. Boiling down to two main issues

  • The * is not context free
  • All possible inputs are valid markdown (No syntax error)

Let's try an example grammar (Alternativelly I can prove my point by dumping the grmmar into Bison and let it spit our errors):

some-plantext ::= [\wA-Za-z0-9]*
strong        ::= "**" some-plantext "**"
italic        ::= "*" some-plantext "*"
text          ::= some-plantext | strong | italic
paragraph     ::= text | text paragraph

Compiler develpers should already see the problem. There's two explnation of the string **a**. First italic{} {text} italic{} and the second strong{text}. A BNF grammar is not sufficient to describe the language. And it gets worse. The grammar rejects *a and **a*. Which is in violation of the second rule we had. We can fix this by adding new rules.

some-plantext ::= [\wA-Za-z0-9]*
                  | "*" [\wA-Za-z0-9]* | "**" [\wA-Za-z0-9]*
                  | [\wA-Za-z0-9]* "*" | [\wA-Za-z0-9]* "**"

Which make things a whole lot worse. Now the same **a** has new explnations. text{**a} italic{}, etc... There's no "simple" way to parse Markdown. Hand writing a LL parser seems to be a bad idea too. All the state and backtracking is bound to be a pain.

Parser combinator

Parser combinator is a really cool technique. Instead of expressing the grammar in someform of BNF, then build or generate a LL/LR parser. Parser combinator, as the name suggests expresses a complex parser as a combination of simpler parsers. A parser in parser combinator is a function that takes a string and yields a new node in the AST and the remaining input it cannot process. Note the wording here, it's yielding instead of returning. I really like Fritz's little rhyme.

a parser of Things
is a function of Strings
to an Generator
of Things and Strings - Fritz Ruehr, modified by me

Take a simple digit parser for example (I'm written pesudo C++, but it is close enough). In our example, I'll let the node yield the parsed string for simplicity:

auto digit() -> Parser {
    return [](std::string_view input) -> ResultGenerator {
        if (input.empty()) co_return;
        else if (input[0] >= '0' && input[0] <= '9')
            co_yield { std::string(1, input[0]), input.substr(1) };
    };
}

auto digit_parser = digit(); // returns a parser for a digit

// now we can apply the parser to a string
yield(digit_parser("123")); // -> {"1", "23"}
yield(digit_parser("abc")); // -> None

See! The digit parser accepted the string "123" and yield. We can also chain parsers or select between them using a then or Or parser (which the implemention I won't show, but feel free to read it. It's super simple).

auto two_digits = then(digit(), digit());
auto three_digits = then(two_digits, digit());
auto two_or_three_digits = Or(two_digits, three_digits);

yield(two_or_three_digits("1")); // -> None
yield(two_or_three_digits("12")); // -> {"12", ""}

auto parse = two_or_three_digits("123");
yield(parse); // -> {"12", "3"}
yield(parse); // -> {"123", ""}

Now the two_or_three_digits won't accept any single digit. Because the first parser accepted the only digit. Then the second parser is left with a null string. Thus rejecting the input. And when a 3 digit number is parsed, it actually yields multiple times. The first consuming 2 digits since that is a subset of a 3 digit number. The second consuming all 3 digits. This is how parser combinator handles ambiguity. It tries something first, if later parsers rejects the remainding input, it tries the next parser. Until all possible parsers are iterated.

For syntatic sugar, I'll use the + and | operator to express then and Or parser. Strings are treated as literals. Which only accept when the string is prefixed by the literal. (Ex: "123" accepts input "123" and "1234" but not "12" or "12asd")

Another intresting parser is some. It takes a parser and accepts the input if the parser can be applied one or more times. nullstring only accepts when the source string is fully consumed.

auto digits = some(digit());

yield(digits("123")); // -> {"123", ""}
yield(digits("123abc")); // -> {"123", "abc"}

auto integer = ("+" | "-" | "") + digits + nullstring();
yield(integer("+123")); // -> {"+123", ""}
yield(integer("-123")); // -> {"-123", ""}
yield(integer("123")); // -> {"123", ""}
yield(integer("123e5")); // -> None

Hopefully you can see where this is going. We can chain these basic parsers together to build a Markdown parser. The parser is very readable, no external DSL like Yacc and Bison needed. And it will do all the backtracking, disambiguation, and so on for us. We just have to guide the parser to yield the more complicated parse first and take the first yield. I highly encourage you to have a look at Computerphile's parser combinator video.

Oh, parser combinator also has another interesting feature. You can insert a LL/LR/LALR/etc.. parser as one of the sub-parser. Very useful for parsing embedded languages without explicit boundaries. Like inline HTML in Markdown.

Markdown paragraph parser

I implemented a proof of concept parser to parse gemtext w/ Markdown syntax. The tricky thing is the markdown_text parser. To make the overall parser fast. Instead of parsing paragraph using character() | some(character()) which tries to yield and disambiguate every character. It attempts to disambiguate only when it sees a * or _ or backtick character.

auto normal_text = markdown_text();
auto code = "`" + normal_text + "`";
auto em = ("*" + (code | normal_text) + "*")
        | ("_" + (code | normal_text) + "_");
auto strong = ("**" + (em | normal_text | code) + "**")
        | ("__" + (em | normal_text | code) + "__");
auto span = strong | em | code | normal_text;
auto text_run = some(span) + nullstring();

And to parse a paragraph, we just need to apply the text_run parser to the input.

auto parsed_result = text_run(paragraph);
assert(parse.begin() != parse.end()); // just to be sure the parser accepted the input

auto [ast, _] = yield(parsed_result);

Here's my PoC that parses a paragraph (well, highlights the style indicators. Easy enough to turn in to a translator).

Parser combinator should be more widespread

I think Parser Combinator should be more popular. Once I was talking to a Google recruiter and was asked about algorithms. One topic I was asked is to build a parser for a made up language. I wasn't allowed to use the standard flex/bison stack. I guess they are expecting me to build the BNF grammar and hand write/optimize a LL parser. I wrote a parser combinator and they had no idea what I was talking about. It's less code and more maintainable than a hand written LL parser. How's that not the expected answer?

Stories aside. I wrote my paragraph parser in 8 lines of code. And like 120 lines of infrastructure to implement parsing primitives. It's much shorter than any minimal Yacc + Bison + disambiguating implementation. Not to mention it's a lot easier to read and maintain.

I really think compiler courses should start teaching parser combinator instead of Lex/Yacc just for the simplicity. Much less pain too.

Hand writing GLL parser

There's one minor issue with parser combinators. They require C++20 coroutines. But the Gemini to HTML translator for my website is provided by dremini[2], a Gemini server library I wrote. And that library targets C++17. So I'm forced to hand write a GLL parser.

As expected it's no fun. You can view the source in the following link. It's a gigantic monolithic function:

I'll try my best to explain how it works. First, there's the parser state. It contains information about where the parser is in the input, the order and set of text style applied and the generated output.

struct ParserState
{
    std::string result;
    size_t pos = 0;
    bool in_code = false;
    bool in_italic = false;
    bool in_strong = false;
    std::stack<std::string> styles;
};

The parser maintains an array (I use a dqueue, only because I thought I'll need to insert on both ends). The first element is the active state. And the last element is the one it'll fall back if the current one ends up in a rejected state after consuming the input string. An accepting state here is defined as the output text is not currently in code/strong/em (i.e. no dangling text style).

static std::string renderPlainText(const std::string_view input)
{
    std::deque<ParserState> state_stack;
    state_stack.push_front({});

    while(!state_stack.empty()) {
        auto& state = state_stack.front();
        if(state.pos >= input.size() && state.styles.empty())
            return state.result;
        else if (state.pos >= input.size()) {
            // Take the last element and swap it with the first.
            // Then remove the last element. This makes that the current state.
            // And the current one is deleted
            std::swap(*state_stack.begin(), *state_stack.rbegin());
            state_stack.pop_back();
            continue;
        }
    ...

Now to parse the paragraph. We scan for the special characters. Switch the parser state to the corresponding one. AND push a fallback state to the stack. In case this is a false switch. Like I said above, Markdown does not allow syntax error. So when the parser encounters a string like 4*4=16. It sees the * and switches to the italic state. But no other * exists to switch the parser back into normal text. Thus ending in a rejected state. At this point the parser must fallback and treat the * as a normal character.

if(state.in_code && !state.styles.empty() && state.styles.top() == "code") {
    ...
} else if(state.in_code == false) {
    // Create a fallback state and push it to the stack
    auto backtrack_state = state;
    backtrack_state.result += ch;
    state_stack.push_back(backtrack_state);

    // Modify the current state
    state.in_code = true;
    state.result += "<code>";
    state.styles.push("code");
}

Handling ** and __ is even more complicated. Any ** can either be the beginning/ending of strong or an empty italic. Given empty italic is uselss, we ought to prioritize the strong. This could be done by modifying the current state. Then push 2 new fallbacks. One for italic. One for plain text.

auto italic_state = state; // make a clone
if(state.in_strong == false && !next_is_space) {
    // backup state for plain text
    auto backtrack_state = state;
    ...

    // modify the current state
    state.in_strong = true;
    ...
}
...
if(italic_state.in_italic == false) {
    ......
    // push more state to the stack
}


See? A normal parser is much more complicated than a parser combinator. Instead of asking you to program all the state transitions and backtracing. Combinators combines simple parsers in a easy to understand day. However a normal parser is what I have to live with before I can upgrade dremini to C++20. And now I totally see why Solderpunk decided to remove text styling in Gemtext. It's not even close to simple to parse. All the while being prone to interpretation differences.

Anyways, that's two different ways to parser a paragraph. Enough blogging for today. I'm gonna sign off now.

Author's profile
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.


  • marty1885 \at protonmail.com
  • GPG: 76D1 193D 93E9 6444
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df