RE: My common Gemini crawler pitfalls
I opened Cosmos today and saw a reply to my previous post about gemini crawler pitfalls[1]. In it (I assume) he complains a crawler sending requests to gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1
which is non-existent.
Another issue I'm seeing with crawlers is an inability to deal with relative links. I'm seeing requests like `gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1` or `gemini://gemini.conman.org//boston/2015/07/02.3`. The former I can't wrap my brain around how it got that link [4] (and every request comes from the same IP (Internet Protocol) address—23.88.52.182) ...
Yeah... that's my crawler. I shall fix it. Upon investigation, it seems to be the issue of the capsule. For full transparency, I'll share the DB queries and command that I run to figure out what's going on. First, let's figure out which page contains a link to the page in question:
tlgs> SELECT url, is_cross_site FROM links WHERE to_url = 'gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1'
+------------------------------------------------+-----------------+
| url | is_cross_site |
|------------------------------------------------+-----------------|
| gemini://gemini.conman.org/boston/2008/04/30.1 | False |
+------------------------------------------------+-----------------+
Ok, so that invalid link comes from gemini://gemini.conman.org/boston/2008/04/30.1
. Let's see what's causing it.
❯ gmni gemini://gemini.conman.org/boston/2008/04/30.1 -j once
...
=> /boston/2008/04/30/2008/04/30.1#fn-2008-04-30-1-1 [4] /boston/2008/04/30/2008/04/30.1#fn-2008-04-30-1-1
...
Here's it. Seems the capsule itself is providing that link. There's really not much I can do about it. Maybe I misinterpreted the robots.txt
?
❯ gmni gemini://gemini.conman.org/robots.txt -j once
# Following content a mirror of http://boston.conman.org/
User-agent: archiver
Disallow: /boston
User-agent: *
Disallow:
Doesn't seem the crawler is disallowed to craw the link either. It's using the archiver
virtual agent from Robots.txt subset for Gemini[2]. Which my crawler does follow. But it's an indexer
agent. All in all I think it's behaving as intended.
For the other issues like crawling conman.org
tlgs> select url, is_cross_site from links where to_url like 'gemini://conman.org/%'
+---------------------------------------------------+-----------------+
| url | is_cross_site |
|---------------------------------------------------+-----------------|
| gemini://spool-five.com/capsules.gmi | True |
| gemini://kennedy.gemi.dev/observatory/known-hosts | True |
| gemini://tlgs.one/known-hosts | True |
+---------------------------------------------------+-----------------+
Seems someone is pointing to that specific host so causing the crawler to try crawling it. I can certainly filter out links from the known-hosts
pages from search engines, but I can't filter out all the links from other people wrongfully pointing to the host. Anything I can do in this case?

Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict
I run TLGS, a major search engine on Gemini. Used by Buran by default.
- marty1885 \at protonmail.com
- Matrix: @clehaxze:matrix.clehaxze.tw
- Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df