what scale actually taught me about infra

this one is more of an infra diary than an announcement.
ive been thinking a lot about what it actually means to run software at scale. not the pretty architecture diagram version. the real version where a million tiny things happen at once and one weird server can ruin your night.
kubernetes used to feel like overkill. then it became necessary. and along the way the rest of the stack started teaching me things too.
rke2 was where i learned to breathe
my first real kubernetes setup was rke2 with rancher.
i did not know what i was doing. kubernetes felt like trying to assemble a spaceship from blog posts. there were pods, nodes, ingresses, storage classes, cert managers, helm charts, service meshes, and somehow every answer created three new tabs.
rancher helped because it made the cluster visible. i could click around, see what was running, see what was broken, and slowly build a mental model.
it also got me into prometheus, grafana, and loki. once you can actually see the system, you stop guessing as much.
rke2 was not perfect, but it gave me a place to learn. it made kubernetes feel possible.
k3s made it lighter
after that i moved toward k3s.
k3s felt lighter and more direct. less ceremony, less weight, still kubernetes. i ran real workloads on it with longhorn for storage and cilium for ingress, and it worked well.
i provisioned the nodes with ansible, which got me partway to the thing i actually wanted: machines defined in code instead of in my memory. but ansible only describes how to set a server up, not what state it should always be in. drift creeps back in. someone runs a command. an upgrade rewrites a file. six months later the playbook and the actual server disagree, and you find out during an incident.
no dramatic breakup story here. k3s is good. ansible is good.
but the more traffic you handle, the more your taste changes. you stop caring about cleverness and start caring about whether the system can survive being boring at very high volume.
scaling is mostly not glamorous. its tuning kernel params so more websocket connections fit on a colocated box. its making sure deploys dont drop people mid session. its figuring out why one node got weird.
that kind of work is what changes your opinion about operating systems.
talos clicked
talos is a linux distribution built only for kubernetes.
not “ubuntu with kubernetes installed on top”. not “ssh into the box and tweak it until it feels right”. talos is an operating system that says: this machine exists to be a kubernetes node, no side quests.
no ssh. no shell. no local login.
the first time you hear that, it sounds annoying. after running real infrastructure, it starts sounding peaceful.
talos config is declarative. the node config is an artifact you can review, version, apply, and recreate. if something changes, you know where it came from. if a machine dies, you replace it, no funeral.
that fits kubernetes better than traditional server management does. kubernetes wants desired state. talos brings that idea down into the operating system.
the big win is not that talos makes kubernetes simple. kubernetes is still kubernetes. the win is that talos removes a whole category of mystery state.
bare metal is the direction
imagine 50 AMD EPYC 4585PX worker nodes. overclocked, these things are absurd, especially for runtimes like nodejs, ruby, and elixir where single thread performance is what actually moves request latency.
the goal is more of this. a fleet of fast bare metal nodes, owned outright, sitting on 50gb pipes with no egress fees, beats hyperscaler economics by a wide margin at our scale. on aws, the egress alone for our traffic would run somewhere around 50k a month. not compute. not databases. just bytes leaving their dungeon.
cloud is great for getting started. once your real bottleneck is bandwidth and single-thread perf, the math flips.
caching, carefully
we leaned harder on cloudflare caching once. rules looked correct. it still occasionally served one users hydrated html page to another user. nothing sensitive leaked, but the trust hit was permanent.
cdns are great for static assets and truly public pages. they are not where you want to be clever with logged-in html.
so the cache moved closer to the services, where the boundaries are explicit and an attacker (or a tired engineer) cant accidentally promote private data to public. cache the boring stuff aggressively. stay paranoid about anything personalized. invalidation lives next to the write, not in a ttl youre hoping holds.
the dragonfly part
we use bullmq heavily. not just one off background jobs, but core stuff: moderation flows, notifications, retries, cleanup, and message related work, deploy spikes. when queues lag, users feel it.
which means the redis under it matters a lot. plain redis is single threaded, and at our request rate one core runs out before the machine does. dragonfly is a redis compatible drop in that uses every core. swap it in, bullmq stops being the bottleneck.
multi node redis also means picking between sentinel, cluster mode, or a managed product. dragonfly clusters with a few flags and has just been working so much better ultimately.
load shedding instead of dying
the other thing scale teaches you is that you cannot always block bad traffic cleanly.
ddos and abuse traffic shows up constantly. sometimes it is obvious junk you can drop at the edge. sometimes it is a botnet of zerodayed android tv boxes sitting in someones living room, quietly sending traffic that looks identical to a real user. you cant just blanket block everything that looks suspicious without taking actual people down with it.
so the goal stops being “keep everything alive” and starts being “stay up for the people who matter, even if some traffic gets shed.”
that means real load shedding. rate limits at the edge with turnstile and cloudflare doing the first pass. cheap endpoints get more headroom, expensive ones get protected first. queues with priority so core flows do not get stuck behind a flood of background work. circuit breakers on backends so one slow dependency does not pull everything down with it.
the trick is to fail loudly and cheaply. a 429 with a clear reason is much better than a 30 second hang that ends in a 500. users hate slow way more than they hate “unk something went wrong”
it is not glamorous. it is the difference between the site staying usable during a spike and the whole thing falling on its face.
what actually stuck
zoom out on all of this and the lessons are not really about any one tool.
the cloud was a great tutorial, then bandwidth and single thread perf flipped the math and bare metal started looking obvious. caching is great. queues run more of the system than people give them credit for. ddos and abuse never really stop, you just learn to lose gracefully.
underneath all of it, you want an operating system that does not fight you. that is where talos earned its place. not because it is shiny. because it is boring in the right ways.
infra problems rarely announce themselves as infra problems. they show up as product bugs. a slow page. a delayed job. a websocket reconnect. a deploy that behaves differently on one node. a cache that worked perfectly until it really did not.
there is no finished version of this. there is only the next thing the system teaches you.