With HAProxy, you can implement a circuit breaker to protect services from widespread failure.
Martin Fowler, famous for being one of the Gang of Four authors who wrote Design Patterns: Elements of Reusable Object-Oriented Software, hosts a website where he catalogs software design patterns. He defines the Circuit Breaker pattern like this:
The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.
A circuit breaker has these characteristics:
It monitors services in real-time, checking whether failures exceed a threshold;
If failures become too prevalent, the circuit breaker shuts off the service for a time.
When enough errors are detected, a circuit breaker flips into the open state, which means it shuts off the service. When that happens, the calling service or application realizes that it can not reach the service. It could have some contingency in place, such as increasing its retry period, rerouting to another service, or switching to some form of degraded functionality for a while. These fallback plans prevent the application from spending any more time trying to use the service.
You would never use a circuit breaker between end-users and your application since it would lead to a bad user experience. Instead, it belongs between services in your backend infrastructure that depend on one another. For example, if an order fulfillment service needs to call an address verification service, then there is a dependency between those two services. Dependencies like that are common in a distributed environment, such as in a microservices architecture. In that context, a circuit breaker isolates failure to a small part of your overall system before it seriously impacts other parts.
This blog post will teach you how to implement a circuit breaker with HAProxy. You’ll see a simple way, which relies on HAProxy’s observe
keyword, and a more complex way that allows greater customization.
Why We Need Circuit Breaking
A circuit breaker isolates failure to a small part of your overall system. To understand why it’s needed, let’s consider what other mechanisms HAProxy has in place to protect clients from faulty services.
First up, consider active health checks. When you add the check
parameter to a server
line in your HAProxy configuration, it pings that server to see if it’s up and working properly. This can either be an attempt to connect over TCP/IP or an attempt to send an HTTP request and get back a valid response. If the ping fails enough times, HAProxy stops load balancing to that server.
Because they target the service’s IP and port or a specific URL, active health checks monitor a narrow range of the service. They work well for detecting when a service is 100% down. But what if the error happens only when a certain API function is called, which is not monitored by the health checks? The health checks would report that the service is functioning properly, even if 70 or 80% of requests are calling the critical function and failing. Unlike active health checks, a circuit breaker monitors live traffic for errors, so it will catch errors in any part of the service.
Another mechanism built into HAProxy is automatic retries, which let you attempt a failed connection or HTTP request again. Retrying is an intrinsically optimistic operation: It expects that calling the service a second time will succeed, which is perfect for transient errors such as those caused by a momentary network disruption. Retries do not work as well when the errors are long-lived, such as those that happen when a bad version of the service has been deployed.
A circuit breaker is more pessimistic. After errors exceed a threshold, it assumes that the disruption is likely to be long-lived. To protect clients from continuously calling the faulty service, it shuts off access for a specified period of time. The hope is that, given enough time, the service will recover.
You can combine active health checks, retries, and circuit breaking to get full coverage protection.
Implement a Circuit Breaker: The Simple Way
Since 2009, HAProxy has had the observe
keyword, which enables live monitoring of traffic for detecting errors. It operates in either layer4 mode or layer7 mode, depending on whether you want to watch for failed TCP/IP connections or failed HTTP requests. When errors reach a threshold, the server is taken out of the load-balancing rotation for a set period of time.
Consider the following example of a backend
section that uses the observe layer7
keyword to monitor traffic for HTTP errors:
backend serviceA | |
default-server maxconn 30 check observe layer7 error-limit 50 on-error mark-down inter 1s rise 30 slowstart 20s | |
server s1 192.168.0.10:80 | |
server s2 192.168.0.11:80 |
Keywords on the default-server
line applies to all server
lines that follow. These keywords mean the following:
Keyword | Description |
| How many connections HAProxy should open to the server in parallel. |
| Enables health checking. |
| Monitor live traffic for HTTP errors. |
| If errors reach 50, trigger the on-error action. |
| What to do when the error limit is reached: mark the service as down. |
| How often to send an active health check; In conjunction with the rise, this sets the period to keep the server offline. |
| How many active health checks must pass before bringing the server back online? |
| After the server recovers and is brought back online, this sends traffic to it gradually over 20 seconds until it reaches 100% of maxconn. |
With these keywords in place, HAProxy will perform live monitoring of traffic at the same time as it performs active health checking. If 50 consecutive requests fail, the server is marked as down and taken out of the load balancing rotation. The period of downtime lasts for as long as it takes for the active health checks to report that the server is healthy again. Here, we’ve set the interval of the active health checks to one per second. There must be 30 successful health checks. So, the service will be shut off for a minimum of 30 seconds.
We’re also including the slowstart
keyword, which eases the server back into full service once it becomes healthy, sending traffic to it gradually over 20 seconds. In-circuit breaker terminology, this is called putting the server into a half-open state. A limited number of requests are allowed to invoke the service during this time.
With this implementation, each server is taken out of the load balancing rotation on a case-by-case basis as the load balancer detects a problem with them. However, if you prefer, you can put a rule that will quicken your reaction time by removing the entire pool of servers from active service when X number of servers have failed. For example, if you started with ten servers, but six have failed and been circuit broken, assume it won’t be long before the other four will fail too. So, the circuit break them now.
To do this, add a http-request return
line that uses the nbsrv
fetch method to check how many servers are still up and if that number falls below a threshold return a 503 error status for all requests. HAProxy will continue to check the servers in the background and will bring them back online when they are healthy again.
backend serviceA | |
default-server maxconn 30 check observe layer7 error-limit 50 on-error mark-down inter 1s rise 30 slowstart 20s | |
# Circuit break the whole backend if the number | |
# of servers becomes less than or equal to 4 | |
http-request return status 503 content-type "application/json" string "{ \"message\": \"Circuit Breaker tripped\" }" if { nbsrv() le 4 } | |
server s1 192.168.0.10:80 | |
server s2 192.168.0.11:80 | |
server s3 192.168.0.12:80 | |
server s4 192.168.0.13:80 | |
server s5 192.168.0.14:80 | |
server s6 192.168.0.15:80 | |
server s7 192.168.0.16:80 | |
server s8 192.168.0.17:80 | |
server s9 192.168.0.18:80 | |
server s10 192.168.0.19:80 |
Implement a Circuit Breaker: The Advanced Way
There’s another way to implement a circuit breaker—one that isn’t as simple but offers more ways for you to customize the behavior. It relies on several of HAProxy’s unique features, including stick tables, ACLs, and variables.
Consider the following example. Everything in the backend
section except for the server
lines make up our circuit breaker logic:
backend serviceA | |
# Create storage for tracking client info | |
stick-table type string size 1 expire 30s store http_req_rate(10s),gpc0,gpc0_rate(10s),gpc1 | |
# Is the circuit broken? | |
acl circuit_open be_name,table_gpc1 gt 0 | |
# Reject request if circuit is broken | |
http-request return status 503 content-type "application/json" string "{ \"message\": \"Circuit Breaker tripped\" }" if circuit_open | |
# Begin tracking requests | |
http-request track-sc0 be_name | |
# Count HTTP 5xx server errors | |
http-response sc-inc-gpc0(0) if { status ge 500 } | |
# Store the HTTP request rate and error rate in variables | |
http-response set-var(res.req_rate) sc_http_req_rate(0) | |
http-response set-var(res.err_rate) sc_gpc0_rate(0) | |
# Check if error rate is greater than 50% using some math | |
http-response sc-inc-gpc1(0) if { int(100),mul(res.err_rate),div(res.req_rate) gt 50 } | |
server s1 192.168.0.10:80 check | |
server s2 192.168.0.11:80 check |
When our circuit breaker detects that more than 50% of recent requests have resulted in an error, it shuts off the entire backend—not only a single server—and rejects all incoming requests for the next 30 seconds.
Let’s step through this configuration. First, we define a stick table:
stick-table type string size 1 expire 30s store http_req_rate(10s),gpc0,gpc0_rate(10s),gpc1 |
A stick table tracks information about requests flowing through the load balancer. It stores a key, which is a string and associates it with counters. Here, the key is the name of the backend, serviceA. The counters include:
http_req_rate(10s)
– the HTTP request rate over the last 10 seconds;gpc0
– a general-purpose counter, which will store a cumulative count of errors;gpc0_rate
– the rate that the general-purpose counter (errors) is increasing over 10 seconds;gpc1
– a second general-purpose counter, which will store a 0 or 1 to indicate whether the circuit is open.
This stick table will store the HTTP request rate and the error rate for the backend. When the errors make up a percentage of the requests, we set the second general-purpose counter, gpc1, to 1, opening the circuit and shutting off the service. The stick table’s expiration parameter is set to 1m, which means one minute, which is how long the circuit will stay open before it reverts back.
You can think of the stick table as looking like this when errors have reached the threshold, the circuit is open, and the service is offline for 30 seconds:
Key | Req rate | gpc0 (errors) | gpc0 rate (error rate) | gpc1 counter (circuit open) | expires |
be_api | 50 | 25 | 25 | 1 | 30 seconds |
After the stick-table
line, we define an ACL named circuit_open:
acl circuit_open be_name,table_gpc1 gt 0 |
This line defines an expression that checks the gpc1 counter to see whether its value is greater than zero. If it is, then the circuit_open ACL will return true. Note that we’re using the table_gpc1
converter to get the value. There’s an important difference between this and the similar fetch method sc_get_gpc1(0)
. The sc_get_gpc1(0)
fetch method will reset the expiration on the record when it’s used, but the table_gpc1
converter will not. In this instance, we do not want to reset the expiration because that would extend the time that the service is down every time someone requests. With the converter, the expiration counts down a minute and then restores the service, regardless of whether clients are trying to call the service in the meantime.
After that, an http-request return
line rejects all requests if the circuit is open, returning an HTTP 503 status:
http-request return status 503 content-type "application/json" string "{ \"message\": \"Circuit Breaker tripped\" }" if circuit_open |
To give the caller more information, it sends back a JSON response with the message Circuit Breaker tripped. If this line is invoked, it ends the processing of the request, and the rest of the lines will not be called.
The next line begins tracking requests:
http-request track-sc0 be_name |
It adds a record to the stick table if it doesn’t exist and also updates the counters during each subsequent request. Note that the action method is called track-sc0
, which means it should start tracking sticky counter sc0. A sticky counter is a temporary variable that holds the value of the key long enough to add or fetch the record from the table. It is a slot HAProxy uses to track a request as it passes through. The http-request track-sc0
line assigns the sticky counter variable to use—sc0—and stores the backend’s name in it.
By default, you get three sticky counters to use, which means that you can track three different aspects of a request simultaneously. Use the actions http-request track-sc0
, track-sc1
, and track-sc2
. Increase the number of sticky counters by compiling HAProxy with the MAX_SESS_STKCTR
compile-time variable.
The next line uses the sc-inc-gpc0(0)
function to increment the first general-purpose counter in the stick table if the server returned a status greater than or equal to 500:
http-response sc-inc-gpc0(0) if { status ge 500 } |
The expression “status ge 500” counts any errors in the HTTP 5xx range. We’re counting errors manually, which allows us to control which error codes we care about. Later, we calculate the rate of errors using the sc0_gpc0_rate
function.
How should you read the funky syntax of the sc-inc-gpc0(0) function? It says: Look up the sticky counter 0, the sticky counter we chose previously—the number in parentheses—and find the associated counter called gpc0. Then increment it. In other words, find the record that has the key serviceA and increment the error counter. Granted, in this configuration, the table will only ever have one record in it since the stick table is defined in one backend, but not used anywhere else.
There are built-in error counters named http_err_cnt
and http_err_rate
, but these look for errors with HTTP 4xx statuses only.
The next two lines store the HTTP request rate and error rate in variables. The error rate is the rate at which the gpc0 counter is being incremented. The helper functions sc_http_req_rate
and sc_gpc0_rate
return these values. We store them in variables named res.req_rate and res.err_rate:
http-response set-var(res.req_rate) sc_http_req_rate(0) | |
http-response set-var(res.err_rate) sc_gpc0_rate(0) |
We must store them in variables because the next line uses the mul
and div
functions, which accept variables only:
http-response sc-inc-gpc1(0) if { int(100),mul(res.err_rate),div(res.req_rate) gt 50 } |
This line increments the second general-purpose counter, gpc1, if 50% of the requests had errors. By setting gpc1 to 1, the circuit is opened. There’s some math here that creates a percentage showing the rate of errors relative to the rate of all requests:
100 x error_rate / request rate = X
if X > 50, open circuit
That’s it. That configures a circuit breaker for this backend that will shut off the service if at least 50% of the recent requests were errors. You can adjust the circuit breaker threshold by changing the number 50 or adjust the time that the circuit breaker stays open by adjusting the expire field on the stick table. You may also want to require a minimum number of requests per second before the error rate is checked. For example, maybe you only care if 50% of requests are errors if you’ve had at least 100 requests in the past ten seconds. If so, change the last line to this, which includes that extra condition:
http-response sc-inc-gpc1(0) if { var(res.req_rate) gt 100 } { int(100),mul(res.err_rate),div(res.req_rate) gt 50 } |
Circuit Breaking in HAProxy: Conclusion
The circuit breaker pattern is ideal for detecting service failures that active health checks might not catch. It protects a system from widespread failure by isolating a faulty service and restricting access to it for a time. Clients can be designed to expect a circuit break and fallback to another service or simply deactivate that part of the application. HAProxy offers both a simple and an advanced way to implement the pattern, giving you plenty of flexibility.
Want to stay up to date on similar topics? Subscribe to our blog! You can also follow us on Twitter and join the conversation on Slack.
Interested in advanced security and administrative features? HAProxy Enterprise is the world’s fastest and most widely used software load balancer. It powers modern application delivery at any scale and environment, providing the utmost performance, observability, and security. Organizations harness its cutting-edge features and enterprise suite of add-ons backed by authoritative expert support and professional services. Ready to learn more? Sign up for a free trial.
Subscribe to our blog. Get the latest release updates, tutorials, and deep-dives from HAProxy experts.