The design of TLGS's crawler

TLGS is a search engine. Crawling is a very important part of it's oprtation. A crawler must follow links that it sees, retrive the page and follow more links on that page. While writing a basic crawler with DFS is relatively easy. Implementing a crawler that handles concurrent crawls well, runs on multiple cores, caches reusable metadata with minimal overhead is, hard.

Besides performnace, there's other features that I want my crawler to have

  • Can run across multiple machines, distributed
  • No single point of failure
  • Index and crawl list is stored on SQL database

The simplest crawler possible

Ignoring all the performance and avaliablity goals. Just build the most basic crawler then extend it later. A recursive crawler that crawls one page at a time.

Presumably it works something like this:

  • Query a page from a page list to crawl from the DB
  • Crawls and parses said page. Update full text index
  • For each link on that page. Check against the robots.txt of that domain.
  • If not blocked. Add it to the page list
  • Repeat until no more page to crawl

Developers with some DB experience could implement this in an afternoon. And it is actually enough for Geminispace - About 20K pages, ignoring robots.txt [1]. Believe or not. Pre-public versions of TLGS runs on this kind of carwler. Finishes around an entire day of work (depend on where you are. most capsules located in EU and US). IIRC geminispace.info/GUS' crawler still works in this fashion.

TLGS's crawler

The basic design works. And I'm a beliver of "Ain't broke, don't fix". But what if Gemini exploides in popularity someday? Besides, there's more and more Gemini capsules comming online. And existing capsules are adding more and more pages. It might a problem of time until it takes the basic crawler more than 3 days to update the index - the time between TLGS reindexing. So I decided to go to the extreme. Make it as scalable as it can. That means three tings the crawler much handle.

  • concurrent connections on a single thread.
  • Multiple threads on a single machine.
  • Distributable onto multiple machines. (Ignoring DB scalablity. It's not a part of the crawler)

Archicutring the crawler

My first idea is to build a microservice of crawl masters and workers. The masters maintains a list of pages to crawl and workers polling master to task. Reporting when they have data ready. This sounds scalable as the hard work is done by the workers while master is basically a glorifed interface to the DB. However, 5 minutes of thinking detures me. First of all, it is complicated. I had to setup a communication channel between the masters and workers. Which never goes well without detailed design. Sedondly, it's either complicated or moot. Either I implement some fallabck mechanism so workers uses backup masters in case of the master crashed. Or masters each maintain their own list and have consensus to avoid duplicated work. In the latter case, there's no difference between a master-worker and a bunch or workers.

So I decided on the latter. To just have multiple workers and let them reach consensus. I'm not going to write my own Raft or Paxos. Instead, since we can fetch a large list of pages to crawl per batch, from the DB, thus a low frequency of queries. It is OK to just realy on DB transactions to avoid data race. Each crawler has it's own buffer of pages to crawl. When that buffer is empty, crawler queries the DB for N pages and mark them as taken. Other crawlers are not allowed to take the taken pages for 5 minutes.

This design has some properties I really like. First, it is self healing. No action is needed when a crawler crashed. The pages are avaliable for taking again after 5 minutes. Other crawlers will pick them up eventually. Sedondly, there only one lock during transaction. That lock is very short lived and could be minigated by randomizing crawling order. It's unlikely a transaction conflict would occur. (Though I haven't implement randomizing yet)

Lock free crawl dispatching

Turns out fully concurrent job dispatch is a very tricky problem. I want the crawler to be mutex-free for maximal scalablity. While solving the following problems:

  • Page crawls can run on multiple threads
  • At most N concurrent page crawls can go on at any single time
  • Must dispatch as many pages as possible
  • Cannot terminate early
  • It's possible for a page to submit new pages to crawl

The simple for loop approach simply won't work due to variable lifetime and C++ coroutines running upon await. A recursive approach is needed.

void dispatch() {
    static atomic<size_t> counter = 0;
    static atomic<bool> ended = false;
    if(ended)
        return;

    if(counter < N) {
        counter++;
        // call dispatch again to generate more jobs
        dispatch();

        async_run([&counter](){
            bool no_more_pages = crawlPage();
            counter--;
            if(no_more_pages && counter == 1)
                ended = true;
            if(no_more_pages)
                return;

            // we are done. regenerate jobs
            dispatch();
        })
    }

}

It works like this. A atomic counter keeps track of how many jobs are running. When it reaches N, it stops generating jobs. Decrementing when a job finishes. And we exponencially recursively call dispatch to generate more jobs. Until counter reaches N. Then when each job finishes, it calls dispatch again to generate more jobs.

Next, when there's no more pages to crawl, we don't nessarily know crawling ended yet. It's possible one of the other page crawls could discover more pages. We only set the ended flag when we are the last crawler running. Finally we reject all future dispatches.

Superising this algorithm scales to multiple machines. The algorithm itself is still lock free and stateless. No special care is needed for spinning up more machines. It does not track ongoing jobs on other machines. Thus potentially terminating early when there could be more task. But that's a rare enough problem when I need to scale to multiple machines.

Dealing with too many sockets

One problem I run into soon being too many sockets. Unlike HTTP, Gemini does not allow reuse of connections. So connections become CLOSE_WAIT upon server finish sending all packets. TLGS can start processing the page as soon as all packets arrive and the FIN packet is received. However the socket is in the CLOSE_WAIT state while all this happens. The OS is waiting for peer acknowledge the FIN packet. Turns out TLGS is so fast that sockets are not released fast enough. And most distros have a limit of ~1000 open sockets per process. Soon TLGS tries open a new socket for a page. Fails. Trantor (the network library I use) aborts because that's faster than C++ exceptions.

I just workarounded this by checking the count of opened sockets periodically. And literarly wait until it's lower than the limit.

Detecting ASCII arts

Gemini does not support inline images. Instead people use ASCII art for decoration. But they mess up the search preview and index. I have do my best to remove them.

Luckily there are obvious patterns in most ASCII arts. First is covered by W3's detection algorithm 2A[1] (wow, I didn't expect W3 would be helpful on Gemini). Second is generally people put ASCII art at the very beginning of the page. Thus I can simply remove all preformatted blocks at the very begging of the page.

  • One or more occurrences of 4 or more same characters consecutively.

That handled 95% of the ASCII arts. The rest has to be dealt with manually. A blacklist is used to detect characters that definitely is not code. Like ⣿ and ☆. And specific patters are also blacklisted. This got me up to ~99% detection rate. Which is good enough.

Result

With all these design and work. TLGS crawler can crawl more than 200 pages per second on 2 slow VPS CPUs. I'm very happy with the result. The design should scale to much more machines. Let's get Gemini popular and stress test it 😀

Author's profile
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.


  • marty1885 \at protonmail.com
  • GPG: 76D1 193D 93E9 6444
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df