Writing my HTML minimizer in half a day

This Saturday, I got too bored and decided do some long overdue housekeeping on my website. I wanted to minify the HTML files, one because this website runs on Drogon's template engine, which is just like PHP and won't collapse whitespace and indentations, and two because I dislike that the HTML I'm sending out looks like a mess. Yet I can't find a HTML minifier for C++ on GitHub. I've known that HTML is difficult to parse. Espically the fact that HTML, like Markdown, does not have a BNF grammar. The best anyone can do is a simple .*. But I looked into HTML's parsing rules. And thought to myself, this isn't that hard if I don't want to maintain it. Espically the HTML5 standard is very clear on how to handle out of spec HTML.

"Fuck it" I said to myself. I'll just write the damn thing. I choose to write in the most generic approach - parse the HTML into an AST, then serialize back into text. Which I can remove unneeded fluff during the serialization step.

Turns out hand writing a HTML parser is not that hard. However, you do need some knowledge in traditional parser designs and be fluent in your language of choice. And needs some elbow greese to power through the complex parser state.

The first steps are the hardest. It took me 2 hours to parse HTML into it's AST, without the tag attributes. An hour later I got minimization working. Yet an hour later I am able to run my website through the minimizer without causing visual differences. Finally, another 2 hours, I implemented most of the standard parse error handling that is convient for me and even some basic CDATA support! The rest can be handled by the browsers as they don't affect the AST.

Oh, one intresting thing I didn't know about HTML before. Apprantly HTML allows <! as a comment block. It's not the correct syntax, but spec conforming parsers must handle it. It's fun to see how languages becides C++ deal with legacy code.

You can find my HTML minimizer here. There's some optimizations I can still add. But won't for now since it's fast enough.

Author's profile. Photo taken in VRChat by my friend Tast+
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

  • marty1885 \at protonmail.com
  • Matrix: @clehaxze:matrix.clehaxze.tw
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df