There are two possible ways to have HAProxy run on multiple CPU cores:
By using the multiprocess model, where HAProxy automatically starts a number of separate system processes (method available since HAProxy version 1.1.7)
By using the multithreading model, where HAProxy automatically starts a number of threads within a single process (method available since HAProxy version 1.8)
The traditional multiprocess approach currently achieves better performance, but the new multithreading model solves all of the limitations typically associated with multiprocess configurations and could certainly be interesting for early adopters who prefer ease of management over maximum performance.
The choice of the method is also somewhat dependent on the specific user needs and configuration. We know that SSL offloading and HTTP compression scale well in a multithreading model, at least on a relatively small number of threads (2 to 4). For other uses or a larger number of threads, we are still in the process of gathering definitive benchmarks and experiences.
In this blog post, we will take you on a tour of the multithreading functionality in HAProxy 1.8. We will provide you with background information, configuration instructions, a more detailed technical overview, and some debugging tips.
So let’s start!
From Multiprocess to Multithreading
The nbproc directive has been deprecated and removed in HAProxy version 2.5.
Starting with HAProxy version 1.1.7 released in 2002, it has been possible to automatically start multiple HAProxy processes. This was done using the configuration directive “nbproc”, and later individual processes were also mapped to individual CPU cores using “cpu-map”.
These multiprocess configurations were a standard way to scale users’ workloads and, with the correct settings, each individual process was able to take full advantage of the high-performance, event-driven HAProxy engine.
Also, multiprocess configurations had additional, specialized uses — for example, they were the configuration of choice for massive SSL offloading solutions. The general recipe for SSL offloading was:
Dedicate all but one HAProxy process to offloading SSL traffic
Have all those processes send decrypted traffic to the remaining process which handles the actual application logic (compression, HTTP headers modification, stickiness, routing, etc.)
However, multiprocess configurations come with certain limitations:
The HAProxy peers’ protocol, which is used to synchronize stick tables across HAProxy instances, may only be used in single-process configurations, leading to complexity when many tables need to be synchronized
There is no information sharing between HAProxy processes, so all data, configuration parameters, statistics, limits, and rates are per-process
Health checks are performed by each individual process, resulting in more health checking traffic than strictly necessary and causing transient inconsistencies during state changes.
The HAProxy Runtime API is applicable to a single process, so any Runtime API commands need to be sent to all individual processes.
Consequently, while multiprocess configurations are useful in many cases, the performance benefits come combined with increased management complexity.
In the multithreading model, HAProxy starts multiple threads within a single process rather than starting multiple individual processes, and as such, it avoids all of the aforementioned problems.
Multithreading support has been implemented and included in HAProxy, starting with HAProxy 1.8.
Our goal for the first multithreading release was to produce a stable, thread-safe implementation with an innovative and extensible design. The initial work took us 8 months to complete, and we believe we have accomplished the task, but multithreading support will remain labeled experimental until we improve its overall performance and confirm stability in the largest installations.
Multithreading Support
Before activating multithreading, HAProxy must be compiled with multithreading support. This is done by default on Linux 2.6.28 and greater, FreeBSD, OpenBSD, and Solaris. For other target platforms, it must be explicitly enabled by using the flag “USETHREAD=1”. Similarly to enabling it, on the mentioned platforms where multithreading is enabled by default, it can be disabled by using “USETHREAD=”.
To check for multithreading support in your HAProxy, please run “haproxy -vv”. If multithreading is enabled, you will see the text “Built with multithreading support” in the output:
$ haproxy -vv
HA-Proxy version 1.8-rc4-358847-18 2017/11/20
Copyright 2000-2017 Willy Tarreau <willy@haproxy.org>
...
Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
...
Built with multithreading support.
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
...
Advanced: Multithreading Architecture
This section provides a deeper technical overview for those wishing to get better insight and understanding of the multithreading functionality in HAProxy.
From an architectural point of view, numerous parts of HAProxy were improved as part of adding multithreading support.
But, instead of having one thread for the scheduler and a number of threads for the workers, we have decided to run a scheduler in every thread. This has allowed the proven, high-performance, event-driven engine component of HAProxy to run per thread and to remain essentially unchanged. Additionally, in this way, the multithreading behavior was made very similar to the multiprocess one as far as usage is concerned, but it comes without the multiprocess limitations!
In the multithreading model, each entity (a task, fd, or applet) is processed by only one thread at a time, and all entities attached to the same session are processed by the same thread. This means that all of the processing related to a specific session is serialized, avoiding most of the locking issues that would otherwise be present.
Thread affinity is also set on each entity. The session-related entities stick to the thread which accepted the incoming connection. Global entities (listeners, peers, checks, DNS resolvers, etc.) have no affinity, and all threads are likely to process them, but always one at a time in accordance with the description given in the previous paragraph.
Another important subject related to multithreading is the changes in backend server states and their propagation. Changes to server states are now done in a single place, synchronously, removing the need to use locks in places where they would normally be needed.
Any remaining multithreading topics mostly boil down to locks and atomic operations. In this initial release of HAProxy 1.8.0, some parts are conservatively locked to make them thread-safe, and we will surely improve performance over time by refining or removing some of these locks.
For example, one of the areas that will receive improvements in the future is Lua’s multithreading performance. Lua’s design forced us to use a global lock, which means that using Lua scripts with several threads will have a noticeable cost as the scripts will essentially run single-threaded.
Some other places in the code have been made thread-local to avoid the need for locks, but consequently, they could slightly change the expected behavior. For instance, reusable connections to backends are available to sessions sticky on the same thread only.
Finally, regarding the low-level details, it should be mentioned that we use pthreads to create threads and GCC’s atomic built-in functions to do atomic operations. We use progressive locks invented by our HAProxy Technologies CTO Willy Tarreau for all spinlocks and RWlocks. We use macros to abstract all the details, and all of this can be seen in the HAProxy header file “include/common/hathreads.h”.
Advanced: Debugging
As mentioned, the multithreading support is labeled experimental. User feedback and any bug reports will be very helpful to us in reaching the final desired level of performance and stability.
To help us diagnose locking costs or problems, you could enable debug mode by compiling HAProxy with the option “DEBUG=-DDEBUG_THREAD”. With it, HAProxy will provide statistics on the locks. Currently, we display this information when HAProxy is stopped, but we are considering adding the equivalent option to the Runtime API too.
From a performance perspective, it is always helpful to know the lock costs as this could help highlight bottlenecks. Here is an excerpt from a sample output:
Stats about Lock THREAD_SYNC:
# write lock : 0
# write unlock: 0 (0)
# wait time for write : 0.000 msec
# wait time for write/lock: 0.000 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock FDTAB:
# write lock : 139315
# write unlock: 139315 (0)
# wait time for write : 13.739 msec
# wait time for write/lock: 98.622 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Debugging information is also useful in tracing back deadlocks or double locks. For example, an attempt to do a double lock will fail, and HAProxy will exit. For each lock, we keep track of the last place where it was locked, and that information can then easily be printed in gdb:
(gdb) p rq_lock
$1 = {lock = 0, info = {owner = 0, waiters = 0, last_location = {function = 0x5abd80 <__func__.26911> "process_runnable_tasks", file = 0x5abd25 "src/task.c", line = 252}}}
Conclusion
We hope you have enjoyed this blog post providing an introduction to multithreading functionality in HAProxy, its configuration, and basic troubleshooting procedures.
If you want to use multithreading with HAProxy 1.8 in your infrastructure backed by enterprise support from HAProxy Technologies, please see our HAProxy Enterprise – Trial Version or contact us for expert advice.
Happy multithreading, and stay tuned!
Subscribe to our blog. Get the latest release updates, tutorials, and deep-dives from HAProxy experts.