RE: Gemini as a fertile frontier for hacking

home

gemi.dev have a interesting piece of small article I encourage people to read. For me as a search engine developer at leaset. He poses some important questions.

gemi.dev - Gemini Space for hacking

As of January 2022, there’s only around 200,000 pages across 1200 domains. This is small enough to be manageable, but large enough to be interesting to work on. To put things in perspective, Gemini space is roughly the size of WWW in 1992 right now.

This is totally true. The fact that Gemini is the size of early Internet is a blessing for developers. Commoncrawl[1] is a massive common web crawling effort. Which results in 73.5TB of COMPRESSED crawled data as of January 2022. While the entirety of TLGS is about 850MB non compressed, including the index. Gemini is much easier to work with and gets something tangable quicker. Developers are not forced to deal with the boring database scaling before the fun stuff. It is also much cheaper too! Storing 74TB of data alone is no joke. Not to mention the actual index.

[1]: Common crawl January dataset stats

I too, believe Gemini is a breeding ground for tech that the FOSS community has lost in the early internet. And I want to put forward some of my thoughts to his question. Not necessarily to answer them tho.

Searching Gemini: This is a big one. We have a few search engines, but the results can be hit or miss. How can we improve this? - gemi.dev

I can't speak for geminispace.info (GUS)[2] and AuraGem[3]. There's a few things I found when developing TLGS[4]

Link density on Gemini is very low
Capsule can be on multiple different domains
General lack of content

Most search engines uses a combination of text ranking and link analysis to rank pages against a search query. To be particular geminispace.info uses some variant of PageRank and TLGS uses SALSA. As the name suggests. Their performance depndes on the quality of links between pages. According to Commoncrawl the average in-degree on the common web is ~3.6 pages[5] while the average on Gemini is ~0.4549 (Data based on TLGS's index. The is higher than the actual values because TLGS does not index documents, RFC, etc..). The number means on average a page on the common web has 3.6 pages pointing to it while only 0.45 on Gemini. This makes link analysis algorithms provide less information on Gemini.

My guess is there's two factors in play. First that Gemini doesn't have inline links. So people are less generous with linking. Then that Gemini has much more Gemlogs compared to documents, discussion boards, Q&A, etc... There's nothing wrong with it. But Gemlogs are more about expressing thought and links to less outside pages.

[2]: geminispace.info - Gemini Search Engine

[3]: AuraGem Search Engine

[4]: TLGS - "Totally Legit" Gemini Search

[5]: Commoncrawl web graph statstics 2019-2022

Another thing complicating search is duplicated content. Capsules on multiple domains and/or proxy to subfolders of another capsule. For example both gemini://kngsly.io/discord-rise-and-bad-privacy-practices and gemini://kngsly.smol.pub/discord-rise-and-bad-privacy-practices points to the exact same document and is likely the owned by the same user. If the two pages ware popular - with a lot of backlinks. They would have a higher score if link analysis could attribute one's backlink to another. But it would be unfair if one of them is stolen content. Yet I can never be sure if someone cloned the contents of another capsule, and maybe they have the same IP address because shared hosting. There's no real way to tell them apart.

Finally, as of writing this Gemlog. Empirically Gemini is mostly Gemlogs. No doubt they are very valulable content. I argue most times people search for a solution or a discussion thread. Like an answer on StackOverflow or discusstion on unix.com. Which Gemini is lacking. On Gemini we have station[6], Geddit[7] and ♊︎ch[8]. To me these are social sites like Reddit and 4chan/2ch. Which people rarely search for.

[8]: station - where capsuleers hang out

[7]: Geddit

[8]: ♊︎ch

Besides general reasons embedded in geminispace. There's some shortcommings of TLGS

TLGS's recent switch from HITS to SALSA should provide a considerable bump in search quality. Common search terms like "gemini", "privacy" and "directory" shows better ranking. gemini.circumlunar.space is finally the first link when searching for gemini. If link density could increase as Gemini grows - for example, Gemini only news sites pointing to other capsules or forums start to pop up on Gemini. It's possible we jump from 2000s' algorithm to 2007. Adding HillTop to the mix. Which according to Google's paper should drastically increase the search quality.

I'd also like to replace PostgreSQL's full text index with ElasticSearch or Manticore. PostgreSQL FTS is ok... But with a lot of space for improvement. No fuzzy search, text highlight is slow and no BM25 or even TF-IDF scoring. Though I had to say the ts_rank_cd function is quite good at short queries.

How interconnecting are capsules and pages? How smol is this smol space? - gemi.dev

I don't have data on 1992's web. But I imagine much less interconnected. Especially given no real academic interest in it - the WWW was designed for scientists to share data. However, I think, this along with others are good questions to write a paper on.. At least something I can pass to my future implores and make them think I published something 😛

Capsule linter: Crawling gemini I see a lot of problems with various capsules. Broken links, bad Mimetypes, invalid gemtext, content in other languages missing "lang=" attributes, etc. - gemi.dev

I think invalid gemtext doesn't exist per-se? The spec basically says that if a line couldn't be pared as a special line. It's plain text. The broken links one is interesting.

How do we notify people to update/repair broken links?
How should search engines ignore broken links completely?

People join and leave Gemini constantly. We can't expect someone putting up a server forever. However content loss is a big issue especially when Gemini is young, Loosing Gemini-only content is even more sad as we will never find it back. But sometimes it's only broken because restructuring of the capsule. People are lazy enough to not put a redirection response in the old location. Is there a way to encourage proper redirection?

The other being important to search engines. Crawling broken links sometimes takes more time than actual files. Some capsules would hang the SSL handshake thus crawlers has to wait until timeout. Degrading the performance. But it's also unreasonable to blacklist an entire capsule once it disappeared. What if the owner decided to rejoin Gemini sometime later? TLGS has a strike system. If SSL or content stansfer timed out 3 times during crawling. The domain is ignored for the rest of crawling. This still has downsides. Basically still wasting time on the 3 timeouts and maybe the owner has some slow CGI script that is forgot to block. There must be a better method.

Personally I'm hacking on:

How can I scale TLGS to handle the common web?
How do I compare Gemini to the early internet?
What's Gemini's use case for non-technical users?
How could we make client certificates more user friendly?

And ideas in my head:

Move TLGS from PostgreSQL to ElasticSearch or Manticore for Full Text index
How do we solve expired certificates on Gemini?
Gemini users come and go. I want to archive the last good state of each capsule.

Martin Chang

Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

martin \at clehaxze.tw
Matrix: @clehaxze:matrix.clehaxze.tw
Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df