Common Gemini crawler pitfalls

home

Just like the common web, crawlers on Gemini can run into similar pitfalls. Though the impact is much lower. The gemtext format is much smaller than HTML. And since Gemini does not support reusing the TCP connection. It takes much longer to mass-crawl a single capsule. Likely I can catch some issues when I see the crawler is still running late night. Anyways, this is a list of issues I have seen.

1. Misconfigured robots.txt MIME type

A handful of capsules have robots.txt. However the returned MIME type is not text/plain but text/gemini. I don't know why this is the case. But what happens is that most gemini crawlers (at least TLGS and GUS's) will stright up ignore the robots.txt. Which is kinda the right thing to do according to web standards. The consiquence is that the crawler will crawl the whole site. Even APIs and stuff that just shouldn't be crawled. Capsule owners can check this using a quick command.

export CAPSULE_HOST="example.com"
echo "gemini://${CAPSULE_HOST}/robots.txt\r" | ncat --ssl ${CAPSULE_HOST} 1965 | sed -n '1p'

A correctly configured capsule should return 20 text/plain. If you see 20 text/gemini, then the capsule is misconfigured.

2. paths in robots.txt misusing tailing /

There's a minor but important difference between /search and /search/. According to the robots.txt RFC draft[1], the former blocks /search?test but the latter allows it. Some capsules attempted to block certain endpoints. But tried blocking all subpaths under it instead.

[1]: robots.txt RFC draft

3. Crawler robots.txt caching

In HTTP, caching of robots.txt can be controlled by the Cache-Control header. But in Gemini, there's no header what so ever. So crawlers can only do a best-effort. TLGS currently caches robots.txt for a week. Seldomly this gets TLGS into some trouble. The cache may not be invalidated when a new endpoint is added to disallowed paths.

I've been considering reducing the caching period to a day. But not sure about the performance impact. Even with that. It's still possible for a TOCTOU bug to creep in. Capsules may create new endpoints right after the robots.txt is cached. There's no real way to prevent this. The best a crawler can do is to have some limit to check how many pages it has crawled on the same capsule. And stop if things start smelling fishy.

4. Too many open sockets

Gemini does not support reusing TCP connections. And TCP enters the WAIT_DEAD state after connection close by the remote host. It's likely that you crawl so fast that connections are not fully closed by the remote host. Leading to a build up of open sockets. And reaching the limit after a short while. Then either crashing the crawler of causing a lot of subsequent crawl attempts to fail. The only real sulotion is to periodically check for open sockets and wait for them to close.

Ideas for a capsule linter

I first heard the idea from gemi.dev. Basically a servce that also scans capsules for common issues. At the time he purposed the linter to check for invalid gemtext. And I refuted it as according to the gemtext spec all inputs are valid. Now I think there's a few subjects that could be intresting. Then provide a centralized place for owners to check their capsules.

valid robots.txt/security.txt/atom feeds
Capsule exposing sensitive/compute-intensive endpoints to crawlers
Capsule exposing infinte extending links

Martin Chang

Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

martin \at clehaxze.tw
Matrix: @clehaxze:matrix.clehaxze.tw
Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df