The stupid socket file descriptor leak in Dremini

TLGS'[1] crawler has had a crashing issue from the very beginning. The crawler tries to open too many sockets at the same time with the error message "sockets::createNonblockingOrDie: too many open sockets". The crawl dispatcher already limited how many concurrent craws can happen at any time. First I thought it was because I make the dispatcher lock free and I screwed up, but no edge cases were discovered after days of simulation and testing. So I made a compermise, the crawler would count open sockets periondically and wait for some to close before continuing asynchronously. Just so I can still claim everything is still lock free. That solves the crawler crashing, but in return the crawler hangs after some time. It's apparent this is due to some sockets not closing, thus it never drop below the threshold to continue crawling. It's much less often so I took it and just restart the crawler every time it hangs.

That lasted until until last weekend. It was a long weekend and I had a lot of free time on hand. This is bound to be fun..

Background

A crawler's job is to grabe page, index it and find all the links on the page. Follw the links. And repeat. TLGS is a search engine and it's crawler works the same way. But instead of speaking HTTP(S). It crawls pages served over the Gemini protocol[2]. The entirity of TLGS is written in C++ using the Drogon web application framework [3] and the Dremini plugin [4]. Which both are based on Trantor, the async TCP network library[5].

To be clear. C and C++ is sometimes known to be the languge that causes trouble. Because memory issues. This is not the case here. As you's see later, the bug it ended up being can also happen in Rust, Lisp and any other language.

The first attempt

It's like 5 months since I last looked at this. So I started by gathering some information about the issue. I know the hang is caused by us waiting for sockets to close but the number never dropped below the threshold. Just to sanity check my hypothesis, I ran the crawler (in production, hehe 😥) until it hangs. Then I ran ls -l /proc/<crawler pid>/fd | grep socket | wc -l and saw the number of open file descriptors was high. I also ran netstat -nap | grep tlgs_crawler and saw most of the sockets were in CLOSE_WAIT state.

Ok.. I confirmed there's excesive open sockets. And the remote host had finished transmitting and sent the FIN packet.

Since there's definatelly descriptor leak. I bring out the usual first-aid kit that I use is most cases. AddressSanitizer and CppCheck. Unfortunately, although ASan was able to find some memory leaks, no file descriptor leaks are detected. Nor does CppCheck can find known patterns of resource leaks. Maybe some UB caused the lead and what I need is UBSan? Ran that, still no UB found.

At this point I had no tangable directions to go on. I started blindly attaching GDB to the hang crawler and try to see what's going on. But the async nature of the crawler makes it hard to debug. Most of the time the crawler is blocked in epoll_wait, waiting for events.

The second attempt

I'm going nowhere with debugging tools. Let's go back to the fundamentals. trantor::Socket is governed by RAII from trantor::TcpConnection through a unique_ptr. So either TcpConnection failed to release or there's a code path that releases the socket without calling close() on the file descriptor. The latter is much less likely as RAII is very difficult to screw up; and I checked multiple times. I added debug prints to socket constuction and destruction to record the descriptor number. Run it and I got logs like this (I had to limit to one concurrent connection max):

Socket created with fd: 11
Processed gemini://tlgs.one/
Socket created with fd: 12
Processed gemini://tlgs.one/known-hosts
Destroied socket with fd: 11
Destroied socket with fd: 12

This. Makes no sense. Assuming I didn't screw up RAII and coroutine implementation. The socket should have destructed right after we got a response. My first instinct is that I missed some edge case in Drogon's Task<T> implementation. Maybe symmetric transfer[1] it broken? Drogon's coroutine concept is designed to be compatible with CppCoro's. I swapped out the entire implementation. Nope, same issue using CppCoro. Maybe I need symmetric transfer in SQL and Gemini response awaiters? Hours are spent on trying that. Turns out that's a) impossible and b) makes no sense. Callbacks in Drogon are, in more than 99% of cases, called from the event loop itself. They are bounded to have a very shallow call stack. Thus no chance to buffer overflow. Thus symmetric transfer is not needed.

At this point, in my mind. Either something somewhere subtle is broken or both Drogon and CppCoro have the same bug. The lattar is very unlikely. I decide to figure out some pattern before further denugging. I made some awk script comb throgh the crawler logs and list the descriptors that failed to close.

22
24
36
40
61
...

That doesn't look like a pattern. I again think maybe I screwed up the crawl dispatchere. I did try to be a cool kid and make it wait-free. I updated the client spawning logic to store all active clients in a map in order to use gdb and inspect. Ran the new code and wait for the hang. 0 active clients. Wohoo! I am wrong yet again.

Third time's the charm

Dremini's gemini client it quite small. Only a few hundred lines. Out of desperation, I copied the entire client into drogon's repo so I can connect internal workings together and get more insight. Also a tight loop to see when the socket is destroyed

for(;;)
    co_await dremini::sendRequestCoro("gemini://gemini.clehaxze.tw/");

Same issue! The logs have the same pattern of Socket created with fd: <fd> without Destroied socket with fd: <fd> following immediately.

Socket created with fd: 11
Socket created with fd: 12
Socket created with fd: 13
Destroied socket with fd: 12
Destroied socket with fd: 13
Destroied socket with fd: 11

Nice. Now the scope is much much smaller. Something in the client itself or the coroutine. I also asked the other maintainers of Drogon/Trantorin the meantime. They pointed out that it's possibe I accidentally created a circular reference to trantor::TcpConnection. Or held a reference somewhere with long lifetime. Nothing turned up on my radar after more code reading. But that sounds possible.

After some more banging the wall. I decided to rewrite the response awaiter. To be much simpler and have much less features. Surprise! Sockets destructed in the correct order!

Socket created with fd: 11
Destroied socket with fd: 11
Socket created with fd: 12
Destroied socket with fd: 12
Socket created with fd: 11
Destroied socket with fd: 11

Features are added one by one back until the old pattern shows up again. Turns out enabling timeout is the culprit. Going through the client I found this

auto weakPtr = weak_from_this();
timeoutTimerId_ = loop_->runAfter(timeout_, [weakPtr, connPtr](){
    auto thisPtr = weakPtr.lock();
    if(!thisPtr)
        return;
    ...

D'oh. I captured the shared_ptr to the connection in the handler. Ohhhhhhh dang. RIP myself. I never thought this would be an issue as timers are invalidated upon getting a response or connection failure. And the lambda is destroyed when it eventually timed out and run the handler.

Anyways. Fixing that and refactoring response generation sovled the issue. Now the socket count is much lower and raises much slower. It is still weird socket count continues to rise instead of having a upper bound. All while the log shows we are release the sockets. So TIME_WAIT or CLOSE_WAIT I guess? Still, figuring our the larger issue makes me happy. Now I can schdule the crawler and leave it without manual intervention.

Author's profile
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict
  • marty1885 \at protonmail.com
  • GPG: 76D1 193D 93E9 6444
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df