The trouble with trantor's threading design

This is just something on my mind and I'm trying to figure out how to solve. Most web frameworks use a work-stealing thread pool to improve throughput. Maybe one of the threads is overburdened with 100 tasks in its queue. But all the other 15 threads are idle. Other threads should be able to take tasks away from the overburdened thread. Well, I can't do that (at least not easily) in Drogon and Trantor.

For content, Drogon[1] is a C++ web framework and Trantor[2] is it's base TCP library.

For more reference, read my previous article on drogon's threading model

What's the problem? In Trantor, event loops are associated with a thread. So a 16 thread pool have 16 event loops. Each event loop have it's own epool/kqueue event system to hadle network events. Event loops also provide the queueInLoop and runInLoop function to enqueue tasks on certain threads or run immediately if we happens to be running on that thread.

This helps us to not deal with threading issues in low level code; by allowing us to put tasks on the right thread. Take disconnecting MySQL and executing SQL queries for example:

void MysqlConnection::disconnect()
{
    auto thisPtr = shared_from_this();
    std::promise<int> pro;
    auto f = pro.get_future();
    loop_->runInLoop([thisPtr, &pro]() {
        thisPtr->status_ = ConnectStatus::Bad;
        thisPtr->channelPtr_->disableAll();
        thisPtr->channelPtr_->remove();
        thisPtr->mysqlPtr_.reset();
        pro.set_value(1);
    });
    f.get();
}

void MysqlConnection::execSql(string_view &&sql, ...)
{
    if (loop_->isInLoopThread())
        ...
    else
    loop_->queueInLoop(
        [thisPtr,...]
         exceptCallback = std::move(exceptCallback)]() mutable {
            thisPtr->execSqlInLoop(std::move(sql),
            paraNum,
            std::move(parameters),
            std::move(length),
            ...
}

By forcing every client related code in a single thread. We don't have to worry about other thread using the MySQL client while we are disconnecting. Either we already disconnected or we will disconnect later. There's no intermediate state. However this poses a problem for task stealing. A task stealing thread pool have it's threads look for tasks on other threads, dequeueing and running them. Now, it's possible for a queue be enqueued with execSql then disconnect. All the while, a thread stole the disconnect task and executed it. Yet the current thread is still executing execSql. Boom, suddenly race conditions and use after free. I guess this is usually solved by fine-grained locking. I just don't see us using it in Drogon. I believe our approach is much more elegant and less error-prone.

Potential solutions

I don't think there's an elegant solution. My current proposal is to have a multi-producer/multi-consumer queue in the thread pool. We submit tasks that can be shared to that queue. Then, upon a thread going idle. It checks if the queue have something to do. If so, it dequeues and runs the task. However, this does not provide execution guarantees. It all task queues ended up busy. It's possible to starve tasks on the thread pool from executing. - Unlike a work-stealing thread pool, it's not a net-improvement.

Another approach is to add a tag to tasks. When we try to dequeue for stealing. We check if the task is sharable. If so, we dequeue it. If not, we consider there's no task to steal. This is also not as efficient as a traditional work-stealing thread pool, solely because we have to check if the task is sharable. I also think it's hard to make it completely lock-free.

I don't know. Time to go to bed. Maybe I can come up with something while dreaming.

Author's profile
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.


  • marty1885 \at protonmail.com
  • GPG: 76D1 193D 93E9 6444
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df