Rendered at 19:34:54 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
bijowo1676 20 hours ago [-]
the article offers a simplified world model: Poisson arrivals and infinite queue, which is fine as a math model.
In the real world however, the bursts can be correlated, due to factors like timeouts/retries, thundering herd, correlated bursts.
so the real economics of load-balanced system is a simple reliability story: being able to reasonably serve the peak traffic, which leads to over-provisioning of those systems.
using cloud allows some form of scale up/down of resources, but doesn't completely solve the problem. I think the migration away from synchronyous systems towards async systems and letting clients gradually absorb the delays is a better approach (rather than forcing infrastructure to be dynamically scaled up/down and be billed per request-second by your cloud provider)
PaulHoule 7 hours ago [-]
I'll argue though there is a political and ideological dimension to queuing theory, especially when it comes to systems where humans are handling the work.
In the idealized case the comfortable place to operate a system is with around 2/3 utilization, like around there latency (customer experience, employee experience) is reasonable, slack is reasonable, etc.
A manager who is being managed by a manager who is being managed by a manager who is being managed (...) is going to see 0.99 utilization and want that last 0.01 and be oblivious to the fact that the math says the system is already past the breaking point, customers are furious, employees are worn out. Any slack at at all seems like an affront.
cindyllm 4 hours ago [-]
[dead]
genxy 16 hours ago [-]
Another technique to add to the mix if you can handle the additional complexity is to load or feature shed. If you can delay or just drop additional expensive application features during the exact time you need to scale or handle a burst, then your system has additional core app logic to handle requests. This can prevent the system getting wedged in a positive feedback loop.
See also the gamedev technique of having sacrificial assets or code, so when you need to free up space late in the schedule to ship, you have something you can actually shed.
Ylano 10 hours ago [-]
Having something intentionally non-essential to cut is much better than discovering under pressure that everything is load-bearing
iTokio 2 hours ago [-]
It’s also an old devops trick to put some GB files on the FS of critical systems (databases, ..) when there are no way to dynamically add more space/volumes and monitoring is not trusted.
Bit hard to explain though.
fmajid 10 hours ago [-]
As in the classic paper "Wide-area traffic: The failure of Poisson modeling" by Vern Paxson (author of GNU flex) and Sally Floyd (legend in the world of TCP/IP congestion control):
Hawkes processes are what other fields use to model this
Joel_Mckay 19 hours ago [-]
>In the real world however, the bursts can be correlated
Very true, as application-layer load-balancing often explicitly pre-bakes the traffic schedule to several hundred distributed IPs for data locality. Essentially bypassing the functional need for DNS and local round-robin traffic balancers.
One trades concurrent bandwidth for slightly higher latency, and dynamically adapted capacity as traffic load changes. =3
laz 16 hours ago [-]
If your clients are all this well behaved, then you’re definitely not exposed to the public internet.
The global edge networks that I’m aware of all use L4 LBs and L7 LBs. Cloudflare picks anycast over DNS LB, but DNS LB is still widely used.
I don’t see these things changing.
Joel_Mckay 14 hours ago [-]
> I don’t see these things changing.
Time Division Multiplexing is usually already used on cellular and Wifi wireless protocols. It only requires slight modification to turn it into an effective network traffic balancer to avoid the naive "everyone update on Tuesday 6am UTC", or "It is Christmas morning and game registration is open".
Notably, it also allows tracking specific accounts by encoding disjoint ingress host lists (siloed concurrent user groups with client certs and firewall whitelist rules.) And users do not have global network knowledge as hosts are cycled into temporary stewardship under load. Thus, only the coordinators for one-time new-user registration operates on classical DNS/round-robin host services.
With DNS, by expected function everyone knows the global published ingress points within minutes. Under a DoS the traffic just hammers down, and small firms usually just pay for the Cloudflare like services.
For systems I've known, TDM reduced peak resource capacity costs down by around 37x. Generally speaking, a 100 user group having fun will not share their server details/invites with folks that exhibit lag-switching or other network shenanigans.
But you are correct, in that it doesn't help if PIBKAC. =3
laz 13 hours ago [-]
We're definitely talking about different things. The open internet is hostile and you'll always need load shedding when the clients misbehave.
If you control all clients and servers and are on a closed network, you can do all sorts of fun things... though load shedding is helpful for when your good clients turn bad due to a code bug. A self DoS is the worst kind of outage.
Ylano 10 hours ago [-]
[dead]
mjb 20 hours ago [-]
A dead comment says:
> Of course, this assumes independent events. World Cup, super bowls, etc break these assumptions.
Yes, this is very true. The model here works for Poisson arrivals and exponential service time (the M/M), which are poor approximations of real-world traffic patterns (which tend to be non-stationary and non-ergodic, and include substantial seasonality). However, the frequency of that seasonality is typically rather low (e.g. daily cycles), and so these stronger assumptions are quite defensible for short time periods.
A better approach is to do simulation with real traffic patterns, or even with more sophisticated parametric models, and get better answers (e.g. https://stability-sim.systems/). The good news is that kind of simulation is cheaper to do than ever before.
megamalloc 20 hours ago [-]
What's conspicuously missing is the plot of performance when you do have a well tuned queue in front of the service. Yes, having a queue becomes less important the more backend servers you have, but here even with 10 servers the plot shows your latency remains >25% worse than it would be with a queue.
Also missing is discussion of how the variance in processing times affects you when you rely on load balancing alone.
mjb 20 hours ago [-]
> What's conspicuously missing is the plot of performance when you do have a well tuned queue in front of the service.
As in between the service and the load balancer? There's already an infinite queue in the load balancer. You can try that out on https://stability-sim.systems/ to see the effect, but the short version is that (in this model) it makes things worse.
If you're saying that the queue in the load balancer should be limited in size to reduce tail latency, then I agree.
megamalloc 19 hours ago [-]
No, I mean when you have a queue broker that the backends can pull work from when they become idle, rather than relying on load balancing which will send work to backends while they're still busy.
wmf 18 hours ago [-]
This scenario already works that way. The very first sentence says "servers, each of which can only handle a single concurrent request, and has no internal queuing". This implies that the load balancer waits for a server to finish a request then immediately sends the next one.
megamalloc 17 hours ago [-]
I don't believe it does. As I understand it, the load balancer has a queue in which it can buffer infinite requests, but it drains that queue by pushing work to the backend servers in what's probably a round-robin fashion. So there is secondary queueing at each server. Even the "least connections" strategies available through some load balancers do not usually behave as you might expect (by always sending the next request to a server that's idle). Pull-based load balancing via a queue has its own downsides but the big upside is to make latency essentially a constant low overhead regardless of the number of servers in the typical case.
NooneAtAll3 4 hours ago [-]
I think your imagination decided to rapidly overcomplicate what is literally (literally literally) Queuing Theory 101 example
If I were to guess there weren't any "backend servers" at all. It was just array of random increasing numbers (that stand for request arrival times) and arrays of numbers with minimum distance (that stand for time each consumer took a request)
there's no connections to "least-ify" the strategy about. There's no difference between consumers, no matter the amount of requests having been processed
juergn 10 hours ago [-]
M/M/c is not the right model for a typical loadbalancer, since a loadbalancer typically does not manage a shared queue but simply passes the jobs to one of the servers. The models are:
- M/M/1 (vertical scaling, one queue and one fast server): fastest response time
- M/M/c (thread-pool, one shared queue, c slow servers): c-times slower for low loads, asymptotically similar to M/M/1 for high loads
- c-times M/M/1 (loadbalancer, c slow servers each which its own queue): always c-times slower than M/M/1.
Only the response times are different. Throughtput is the same in the ideal case.
lmeyerov 4 hours ago [-]
curious how folks like to measure this stuff wrt load testing?
We are actively revisiting our traffic simulation approach, and a surprisingly non-obvious part has been which charts to focus on. Our case is a gpu-server-backed interactive analytics app, like a photoshop for data, so we do focus both on latency and errors, and especially around handling bursty sessions as discussed in the article.
crypttales 21 hours ago [-]
Of course, this assumes independent events. World Cup, super bowls, etc break these assumptions.
Still, queuing theory is so cool.
resters 15 hours ago [-]
It's not surprising if one has the mental model of the probability that the request gets enqueued. Then when you add variable time to process requests it becomes more clear why some requests can take unexpectedly long (there is a >0 probability that a request gets queued behind several of the slowest endpoints, for example). So even if 90% of the endpoints are fast and most of the requests aren't even queued, there will still be some that end up being quite slow.
nilsherzig 21 hours ago [-]
Why would anyone think that it would get linearly worse? What's the (wrong) assumption there?
gm678 4 hours ago [-]
I think people are reading it as a request every 0.8s that takes 1s to process, instead of 0.8 requests per second.
jldugger 12 hours ago [-]
I'm mostly just surprised the graph starts at 5 seconds for a mean value for all datapoints. I would have assumed it starts much closer to 1s. Which just makes the poll responses even crazier. Who is picking B when you have 25% more capacity than you need?
But I suppose the question is underspecified. How does the load balancer know which systems are busy? What happens to a request if the load balancer routes a request to a busy server?
physix 20 hours ago [-]
I thought the same thing. But, should we be surprised about what people believe in these days?
I think that the issue is in part due to the variables. Plotting the mean request time is less intuitive than plotting throughput.
If you plot throughput vs number of servers, it'll be a straight line. And asking people that, I think most would agree on a straight line. But who knows!
mjb 20 hours ago [-]
One explanation would be that more load could mean higher (absolute) variance in queue length, and therefore higher latency especially at higher percentiles. It doesn't work out that way (for reasons that Erlang actually writes about in one of his original works), but it's not an entirely unreasonable intuition.
PunchyHamster 21 hours ago [-]
I think author made it up just to have something more to show up on graph.
antonvs 20 hours ago [-]
It was a poll on Twitter, do you really expect good responses?
4 hours ago [-]
fabijanbajo 12 hours ago [-]
The footnote on exponential vs. log-normal service times is the part I'd push on.. in production I almost never see exponential, and heavy tails change the picture. Curious if you've looked at how robust this is under realistic distributions
Ylano 10 hours ago [-]
80% utilization is not a universal statement
12 hours ago [-]
jiggawatts 13 hours ago [-]
The problem with this kind of theoretical analysis is that most load balancers don't work this way, especially the typical "cloud" HTTP or TCP load balancers, which are stateless and avoid this kind of central queuing logic like the plague because it doesn't scale to their levels.
For example, most cloud load balancers I've worked with are stateless, non-queuing, and allocate work to back-ends strictly randomly.
Traditional non-cloud load balancers can implement this kind of perfect queuing, but these settings are generally off by default even when available.
Envoy, Apache, and Traefik have partial or limited support.
Conversely, most multi-threaded web server frameworks already do this by default! For example, ASP.NET has essentially an internal "load balancer" with a perfect queue if you pretend each core is a "node" and the whole server is the "scale out system".
zer00eyz 5 hours ago [-]
> can only handle a single concurrent request, and has no internal queuing
And the systems that load balancers front almost never behave this way...
Dont even get me started on client performance here as well, latency, speed, caching -- that can all be impacted by payload size.
The article is interesting, but it is an ideal that almost never turns up in the real world.
anchorapi 11 hours ago [-]
[flagged]
ukanwat 14 hours ago [-]
[dead]
bigcat12345678 20 hours ago [-]
Seemingly inconsequential article on hacker news and assume it probably is the kind of article that describes a profound idea with a naive title. And turns out it's actually very confusing as it puts overweight dramaticity over mundane intuition. Those type of writing belongs to literature sphere, not technology writing.
In the real world however, the bursts can be correlated, due to factors like timeouts/retries, thundering herd, correlated bursts.
so the real economics of load-balanced system is a simple reliability story: being able to reasonably serve the peak traffic, which leads to over-provisioning of those systems.
using cloud allows some form of scale up/down of resources, but doesn't completely solve the problem. I think the migration away from synchronyous systems towards async systems and letting clients gradually absorb the delays is a better approach (rather than forcing infrastructure to be dynamically scaled up/down and be billed per request-second by your cloud provider)
In the idealized case the comfortable place to operate a system is with around 2/3 utilization, like around there latency (customer experience, employee experience) is reasonable, slack is reasonable, etc.
A manager who is being managed by a manager who is being managed by a manager who is being managed (...) is going to see 0.99 utilization and want that last 0.01 and be oblivious to the fact that the math says the system is already past the breaking point, customers are furious, employees are worn out. Any slack at at all seems like an affront.
See also the gamedev technique of having sacrificial assets or code, so when you need to free up space late in the schedule to ship, you have something you can actually shed.
Bit hard to explain though.
https://www.osti.gov/servlets/purl/10107457
Hawkes processes are what other fields use to model this
Very true, as application-layer load-balancing often explicitly pre-bakes the traffic schedule to several hundred distributed IPs for data locality. Essentially bypassing the functional need for DNS and local round-robin traffic balancers.
One trades concurrent bandwidth for slightly higher latency, and dynamically adapted capacity as traffic load changes. =3
The global edge networks that I’m aware of all use L4 LBs and L7 LBs. Cloudflare picks anycast over DNS LB, but DNS LB is still widely used.
I don’t see these things changing.
Time Division Multiplexing is usually already used on cellular and Wifi wireless protocols. It only requires slight modification to turn it into an effective network traffic balancer to avoid the naive "everyone update on Tuesday 6am UTC", or "It is Christmas morning and game registration is open".
Notably, it also allows tracking specific accounts by encoding disjoint ingress host lists (siloed concurrent user groups with client certs and firewall whitelist rules.) And users do not have global network knowledge as hosts are cycled into temporary stewardship under load. Thus, only the coordinators for one-time new-user registration operates on classical DNS/round-robin host services.
With DNS, by expected function everyone knows the global published ingress points within minutes. Under a DoS the traffic just hammers down, and small firms usually just pay for the Cloudflare like services.
For systems I've known, TDM reduced peak resource capacity costs down by around 37x. Generally speaking, a 100 user group having fun will not share their server details/invites with folks that exhibit lag-switching or other network shenanigans.
But you are correct, in that it doesn't help if PIBKAC. =3
If you control all clients and servers and are on a closed network, you can do all sorts of fun things... though load shedding is helpful for when your good clients turn bad due to a code bug. A self DoS is the worst kind of outage.
> Of course, this assumes independent events. World Cup, super bowls, etc break these assumptions.
Yes, this is very true. The model here works for Poisson arrivals and exponential service time (the M/M), which are poor approximations of real-world traffic patterns (which tend to be non-stationary and non-ergodic, and include substantial seasonality). However, the frequency of that seasonality is typically rather low (e.g. daily cycles), and so these stronger assumptions are quite defensible for short time periods.
A better approach is to do simulation with real traffic patterns, or even with more sophisticated parametric models, and get better answers (e.g. https://stability-sim.systems/). The good news is that kind of simulation is cheaper to do than ever before.
As in between the service and the load balancer? There's already an infinite queue in the load balancer. You can try that out on https://stability-sim.systems/ to see the effect, but the short version is that (in this model) it makes things worse.
If you're saying that the queue in the load balancer should be limited in size to reduce tail latency, then I agree.
If I were to guess there weren't any "backend servers" at all. It was just array of random increasing numbers (that stand for request arrival times) and arrays of numbers with minimum distance (that stand for time each consumer took a request)
there's no connections to "least-ify" the strategy about. There's no difference between consumers, no matter the amount of requests having been processed
- M/M/1 (vertical scaling, one queue and one fast server): fastest response time
- M/M/c (thread-pool, one shared queue, c slow servers): c-times slower for low loads, asymptotically similar to M/M/1 for high loads
- c-times M/M/1 (loadbalancer, c slow servers each which its own queue): always c-times slower than M/M/1.
Only the response times are different. Throughtput is the same in the ideal case.
We are actively revisiting our traffic simulation approach, and a surprisingly non-obvious part has been which charts to focus on. Our case is a gpu-server-backed interactive analytics app, like a photoshop for data, so we do focus both on latency and errors, and especially around handling bursty sessions as discussed in the article.
Still, queuing theory is so cool.
But I suppose the question is underspecified. How does the load balancer know which systems are busy? What happens to a request if the load balancer routes a request to a busy server?
I think that the issue is in part due to the variables. Plotting the mean request time is less intuitive than plotting throughput.
If you plot throughput vs number of servers, it'll be a straight line. And asking people that, I think most would agree on a straight line. But who knows!
For example, most cloud load balancers I've worked with are stateless, non-queuing, and allocate work to back-ends strictly randomly.
Traditional non-cloud load balancers can implement this kind of perfect queuing, but these settings are generally off by default even when available.
- NetScaler: surgeProtection + maxClient=1
- F5 BIG-IP LTM: request queuing + pool/member connectionLimit=1
- HAProxy: server maxconn 1 + timeout queue
- NGINX Plus: server max_conns=1 + queue
Envoy, Apache, and Traefik have partial or limited support.
Conversely, most multi-threaded web server frameworks already do this by default! For example, ASP.NET has essentially an internal "load balancer" with a perfect queue if you pretend each core is a "node" and the whole server is the "scale out system".
And the systems that load balancers front almost never behave this way...
Dont even get me started on client performance here as well, latency, speed, caching -- that can all be impacted by payload size.
The article is interesting, but it is an ideal that almost never turns up in the real world.