Service reliability

Circuit breakers

A circuit breaker is a mechanism that monitors services in real time, checking for errors in the service’s responses. If failures exceed a threshold, the circuit breaker flips into the open state and shuts off access to the service. Its purpose is to detect error conditions that may last a long time and rather than allowing dependent services to continue calling the faulty service, it sends back an error immediately. This prevents them from trying to use the service for a period of time.

Circuit breaker using the observe argument Jump to heading

A simple implementation of the circuit breaker pattern involves using the observe argument to monitor live traffic for errors. Consider the following example, which will disable access to a server if it detects at least 50 percent HTTP errors:

haproxy
backend myservice
default-server maxconn 30 check observe layer7 error-limit 50 on-error mark-down inter 1s rise 30 slowstart 20s
server s1 192.168.0.10:80
server s2 192.168.0.11:80
haproxy
backend myservice
default-server maxconn 30 check observe layer7 error-limit 50 on-error mark-down inter 1s rise 30 slowstart 20s
server s1 192.168.0.10:80
server s2 192.168.0.11:80

How it works:

  • The default-server directive sets arguments that apply to all server lines in the backend section.
  • The check argument enables health checking of the server.
  • The observe layer7 argument enables monitoring of traffic coming and going from the server.
  • The error-limit 50 argument sets a threshold of 50 errors, after which it triggers the on-error action.
  • The on-error mark-down argument marks the service as DOWN if the error-limit is reached.
  • The inter 1s sets how often to send active health checks (1 second), which are responsible for checking a service after it has failed to know when to bring it back online.
  • The rise 30 argument sets how many successful active health checks there must be (30) before bringing the server back online. When you multiply the inter value by the rise value, you get the minimum amount of time that the server will be removed from the load-balancing rotation (1 second x 30 = 30 seconds).
  • The slowstart 20s argument sends traffic to the server gradually over 20 seconds after it has recovered until it reaches 100% of its maximum connections, as set by maxconn.

You may also set observe to layer4 if you prefer to monitor for unsuccessful connections to a server rather than failed HTTP responses.

Circuit breaker using stick tables Jump to heading

In this more complex example, the load balancer monitors the number of HTTP 5xx errors returned from all servers in the backend. If that number makes up 50% of all responses, it disables access to the service by rejecting all new requests for the next 30 seconds.

haproxy
backend myservice
stick-table type string size 1 expire 30s store http_req_rate(10s),gpc0,gpc0_rate(10s),gpc1
# Is the circuit open (no traffic can flow)?
acl circuit_open be_name,table_gpc1 gt 0
# Reject request if circuit is open
http-request deny deny_status 503 if circuit_open
# Begin tracking requests
http-request track-sc0 be_name
# Count HTTP 5xx server errors
http-response sc-inc-gpc0(0) if { status ge 500 }
# Store the HTTP request rate and error rate in variables
http-response set-var(res.req_rate) sc_http_req_rate(0)
http-response set-var(res.err_rate) sc_gpc0_rate(0)
# Check if error rate is greater than 50% using some math
http-response sc-inc-gpc1(0) if { int(100),mul(res.err_rate),div(res.req_rate) gt 50 }
server s1 192.168.0.10:80 check
server s2 192.168.0.11:80 check
haproxy
backend myservice
stick-table type string size 1 expire 30s store http_req_rate(10s),gpc0,gpc0_rate(10s),gpc1
# Is the circuit open (no traffic can flow)?
acl circuit_open be_name,table_gpc1 gt 0
# Reject request if circuit is open
http-request deny deny_status 503 if circuit_open
# Begin tracking requests
http-request track-sc0 be_name
# Count HTTP 5xx server errors
http-response sc-inc-gpc0(0) if { status ge 500 }
# Store the HTTP request rate and error rate in variables
http-response set-var(res.req_rate) sc_http_req_rate(0)
http-response set-var(res.err_rate) sc_gpc0_rate(0)
# Check if error rate is greater than 50% using some math
http-response sc-inc-gpc1(0) if { int(100),mul(res.err_rate),div(res.req_rate) gt 50 }
server s1 192.168.0.10:80 check
server s2 192.168.0.11:80 check

How it works:

  • The stick-table line tracks requests entering the backend. It monitors the HTTP request rate, the HTTP error rate (captured with the generic counters named gpc0 and gpc0_rate), and a counter that acts as a flag that opens the circuit (gpc1) when the error percentage exceeds a threshold. The expire argument sets how long to disable access to the service once the gpc1 flag has been incremented. In this example, the period to disable the service when it becomes faulty is 30 seconds.
  • The circuit_open ACL checks whether the flag gpc1 is 0 or 1. If it is 1, the circuit is open.
  • The http-request deny line rejects all requests while the circuit is open, returning an HTTP 503 - Service Unavailable response in the meantime.
  • The http-request track-sc0 line ensures that all requests entering the backend are monitored for errors.
  • The http-response sc-in-gpc0(0) line increments the error counter (gpc0) every time a server returns an HTTP 5xx response (i.e. any HTTP error in the 500-599 range).
  • The http-response set-var lines set two variables. The first is res.req_rate, which holds the current HTTP request rate. The second is res.err_rate, which holds the current HTTP error rate.
  • The http-response sc-inc-gpc1(0) line increments the gpc1 flag to 1 if the error rate makes up at least 50% of the request rate. This opens the circuit. The circuit is left open and no requests are allowed into the backend until the record expires in the stick table after 30 seconds.

Adjust the error rate threshold on the http-response sc-inc-gpc1(0) line to a number other than 50. Or, adjust the time period that the circuit stays open by changing the expire argument on the stick table.

See also Jump to heading

Do you have any suggestions on how we can improve the content of this page?