Doing better than DuckDuckGo, some ideas

Back to home

This is my thoughts on Drew's article[1] in which he explains his ideas of an ideal, open search engine. I want to share my thoughts too. Hopefully I'm qualified on writing about this. I am no researcher on related topics. But I do created TLGS. Hopefully some thoughts could help people realize a open search engine. I highly recommend reading his post.

[1]: Drewdevault - We can do better than DuckDuckGo

Ranking, data model, scaling and accessibility

A search engine must be very scalable to index and query data of the entire internet. Some solutions pops up in mind: microservices, data sharding and aggressive caching. These are great. But also makes the system very difficult for individual developers to setup. Which in turn inhibits people from contributing to the project. Putting either a limit on the complexity of the project or a limit of how many people could help.

Database design wise. I think sharding based on the URL makes the most sense. A index store 3 main information, the full text, backlinks to a page and links exist in said page. Something like the following pseudo code in C++

struct IndexData {
    std::string url;
    std::string content_text;
    std::vector<std::string> links;
    std::vector<std::string> backlinks;
}

Luckily a search engine is very tolerant to data desync and inconsistencies. And the data is very simple. The only one that needs care are the backlinks. Backlinks are links on other pages that points to the current page. In SQL this would be a subtable that gets JOIN-ed for each search query. Preventing sharding the DB. The solution is to treat SQL like NoSQL. Instead of aggrigrating data. We store backlinks as HSTORE (PostgreSQL) or other key-value pair types. Before indexing a page. We go through existing links and remove them from the backlink list. Orm we could just use NoSQL with some extenal full text search system. It scales well both ways. The search engine just queries every shard to get the full result. Ideally there's also a memory cache sitting infront for each SQL shards to cache recent, common results.

Ranking (link analysis) is a much more difficult problem to solve. There's 2 kinds of ranking algorithms. The first kind takes in the entire crawed database and generates a single rank for every site. The famus PageRank used by Google in it's early days is an example. You hvae to either have a very good computer to run them or able to distribute them efficently. With a dataset size as large as the entire WWW is. It's unlikely to fit the newtwork graph inside a single machine - unless you got a fully speced IBM mainframe - which is ultra expensive and doesn't run open source OS (it runs z/OS or z/VM. You run Linux on top). Distributing PageRank is also hard. PageRank is a O(M*N) algorithm. Where M is the average number of links on a page and N is the number of pages. It iterates through every page in the index, update the pages ranking based on connected pages. Rince and repeat. Until the ranking converges.

The second class of algorithms runs on a subset of the network graph - the pages that matches the search query. HITS and SALSA are example on this kind. They have the down side that their results depends on the search query. Which have to re-evaluate every time. Even though also being O(M*N). They tend to scale better to larger graphs as the memory requirments are lower. But at the cost of much more compute per search.

There's a 3rd class of algorithms. A hybird between link analyis and content ranking. Hilltop being one of them. They tends to provide better search result given a good content ranking. Currently Google uses it for their search result. However I don't expect an open source search engine would adopt such and algorithm. Content ranking is by it nature very prone to bias and misinterpertation. It also likely requires real human to classify sites before the index is usable to the search engine. Not something likely doable by an open source project. Personally, I see an open source project to adopt HITS or SALSA initially. Then transition to optimized PageRank when they can afford machines to run it.

There's no algorithm of truth

One fact that I observed is that the general public, i.e. not technically minded people, have no clue about OpSec and trust. They basically trust whatever they see on the internet. Instead of treating the internet and search engine as a tool, something that is used by humans. They more or less treat search engines as an orcale. Send some question to Google and out comes answer. Less and less people are appling critical thinking skills to what they found on internet. I don't blame them. Internet is unthinkably huge and companies has put in so many effort to make people's experience on the internet easy and seamless.

However, this also causes a lot of the internet sensorship that we, the FOSS and InfoSec community absolutely hates. Personally even though I disagree whan far left/right websites. I don't think search engines should blacklist them, vanishing them form search results. It's well known in the IT world that hiding a problem always leads to a bigger problem down the line. That's why we publish CVEs and proactivelly patching bugs. Instead of creating a safe space, I think we sould allow ideas to propergrate. And let people to decide which is right which it not. That also include ideas that we sould sensor online activies. This is in my eyes, a part of how we got the YouTube demonitization debacle. Kids also watches YouTube. When they see "adult content" (which is likely just deep discusstions on different topics) on YouTube. Parents complain. Then YouTube is forced to discourage creatorss to not make those kinds of content.

In short, in my opinion, why companies sensories the internet is because people have put too much trust into the internet. And there's no algorithm, not even with infinate comuting power, to determine if something is trustwrthy/approprate or not. The best companies can do is to throw humans at the problem. Ask humans to decite which sites are too dangerous to show. Which introduces bias to the system. How would an open source search engine approach it is unknown. I would leave everything not sensored. People should not look at what they don't want. In case they do, they have important lessons to learn. It's a bug of the user. Not a bug in the search enging.

However I bet my idea is not going to fly when the search engine get sued in cort for stupid reasons.

Who's going to pay for it?

This is the literal million dollar question. Paid search results and ads just ain't going to work obviously. Maybe paid API access? I don't think so. General search APIs to provide knowledge are kinda worthless in trems of monetary value. In my experience, most companies want news and site searchs to moniter their public relations. I don't have any idea on this. It's very tricky to keep open access, privacy conservation and revenue at the same time. Nor VCs are going to fund such project. Google is much better for what they want. And they have the money to get Google do what they want anyways.

We can't beat Google (and Bing) in search quality

Should go without saying. Google nad Microsoft are big cooperations that hires good engineers. Unlike most stories that open source developers beating companies beacause the sheer quantity of talent and passion and how shitty the company is. Google and Microsoft also have the most talented developers to work on their search engine. However, they can pay for the engineering effort. While an open source search engine can only rely mostly on contributions. And maybe one or two fulltime developers.

Besides the wide gap of human resources. Google and Microsoft collects user data to improve the search result. It is very invasive of privacy. But it does allow the collecters to understand their users. To know what they want and implement it. Surveys can't hope to match the detail that web data collection can achieve. I don't know a way we can improve on this. It's a tradeoff we made early to build something respecting provacy. And this is the exact reason why comanies are collecting as much user data as they can.

Drew, count me in if you decided to build your search enging.

Martin Chang

Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

martin \at clehaxze.tw
Matrix: @clehaxze:matrix.clehaxze.tw
Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df