RE: My common Gemini crawler pitfalls

I opened Cosmos today and saw a reply to my previous post about gemini crawler pitfalls[1]. In it (I assume) he complains a crawler sending requests to gemini:// which is non-existent.

Another issue I'm seeing with crawlers is an inability to deal with relative links. I'm seeing requests like `gemini://` or `gemini://`. The former I can't wrap my brain around how it got that link [4] (and every request comes from the same IP (Internet Protocol) address— ...

Yeah... that's my crawler. I shall fix it. Upon investigation, it seems to be the issue of the capsule. For full transparency, I'll share the DB queries and command that I run to figure out what's going on. First, let's figure out which page contains a link to the page in question:

tlgs> SELECT url, is_cross_site FROM links WHERE to_url = 'gemini://'
| url                                            | is_cross_site   |
| gemini:// | False           |

Ok, so that invalid link comes from gemini:// Let's see what's causing it.

❯ gmni gemini:// -j once

=> /boston/2008/04/30/2008/04/30.1#fn-2008-04-30-1-1 [4] /boston/2008/04/30/2008/04/30.1#fn-2008-04-30-1-1

Here's it. Seems the capsule itself is providing that link. There's really not much I can do about it. Maybe I misinterpreted the robots.txt?

❯ gmni gemini:// -j once
# Following content a mirror of

User-agent: archiver
Disallow: /boston

User-agent: *

Doesn't seem the crawler is disallowed to craw the link either. It's using the archiver virtual agent from Robots.txt subset for Gemini[2]. Which my crawler does follow. But it's an indexer agent. All in all I think it's behaving as intended.

For the other issues like crawling

tlgs> select url, is_cross_site from links where to_url like 'gemini://'                              
| url                                               | is_cross_site   |
| gemini://              | True            |
| gemini:// | True            |
| gemini://                     | True            |

Seems someone is pointing to that specific host so causing the crawler to try crawling it. I can certainly filter out links from the known-hosts pages from search engines, but I can't filter out all the links from other people wrongfully pointing to the host. Anything I can do in this case?

Author's profile
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict
  • marty1885 \at
  • GPG: 76D1 193D 93E9 6444
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df