Corrosion

141 points | by cgb_ 4 days ago

67 comments

kflansburg 5 hours ago
> an if let expression over an RWLock assumed (reasonably, but incorrectly) in its else branch that the lock had been released. Instant and virulently contagious deadlock.
I believe this behavior is changing in the 2024 edition: https://doc.rust-lang.org/edition-guide/rust-2024/temporary-...
[-]
- kibwen 5 hours ago
  > I believe this behavior is changing
  Past tense, the 2024 edition stabilized in (and has been the default edition for `cargo new` since) Rust 1.85.
  [-]
  - kflansburg 5 hours ago
    Yes, I've already performed the upgrade for my projects, but since they hit this bug, I'm guessing they haven't.
    [-]
    - kibwen 4 hours ago
      They may have upgraded by now, their source links to a thread from a year ago, prior to the 2024 edition, which may be when they encountered that particular bug.
      [-]
      - kflansburg 4 hours ago
        I see now that this incident happened in September 2024 as well.
natebrennand an hour ago
> Finally, let’s revisit that global state problem. After the contagious deadlock bug, we concluded we need to evolve past a single cluster. So we took on a project we call “regionalization”, which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in the region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies.
This tier approach makes a lot of sense to mitigate the scaling limit per corrosion node. Can you share how much data you wind up tracking in each tier in practice?
How concise is the entry for each application -> [regions] table? Does the constraint of running this on every node mean that this creates a global limit for number of applications? It also seems like the region level database would have a regional limit for the number of Fly machines too?
ricardobeat 5 hours ago
> Like an unattended turkey deep frying on the patio, truly global distributed consensus promises deliciousness while yielding only immolation
Their writing is so good, always a fun and enlightening read.
blinkingled 3 hours ago
> The bidding model is elegant, but it’s insufficient to route network requests. To allow an HTTP request in Tokyo to find the nearest instance in Sydney, we really do need some kind of global map of every app we host.
So is this a case of wanting to deliver a differentiating feature before the technical maturity is there and validated? It's an acceptable strategy if you are building a lesser product but if you are selling Public Cloud maybe having a better strategy than waiting for problems to crop up makes more sense? Consul, missing watchdogs, certificate expiry, CRDT back filling nullable columns - sure in a normal case these are not very unexpected or to-be-ashamed-of problems but for a product that claims to be Public Cloud you want to think of these things and address them before day 1. Cert expiry for example - you should be giving your users tools to never have a cert expire - not fixing it for your stuff after the fact! (Most CAs offer API to automate all this - no excuse for it.)
I don't mean to be dismissive or disrespectful, the problem is challenging and the work is great - merely thinking of loss of customer trust - people are never going to trust a new comer that has issues like this and for that reason move fast break things and fix when you find isn't a good fit for this kind of a product.
[-]
- tptacek 3 hours ago
  It's not a "differentiating feature"; it eliminated a scaling bottleneck. It's also a decision that long predates Corrosion.
  [-]
  - blinkingled 3 hours ago
    I was referring to the "HTTP request in Tokyo to find the nearest instance in Sydney" part which felt to me like a differentiating feature- no other cloud provider seems to have bidding or HTTP request level cross regional lookup or whatever.
    The "decision that long predates Corrosion" is precisely the point I was trying to make - was it made too soon before understanding the ramifications and/or having a validated technical solution ready? IOW maybe the feature requiring the problem solution could have come later? (I don't know much about fly.io and its features, so apologies if some of this is unclear/wrongly assumes things.)
    [-]
    - tptacek 3 hours ago
      That's literally the premise of the service and always has been.
      [-]
      - x0x0 2 hours ago
        fwiw, I'm happily running a company and some contract work on fly literally as aws, but what if it weren't the most massively complex pile of shit you've ever seen.
        I have a couple reasonably sized, understandable toml files and another 100 lines of ruby that runs long-running rake tasks as individual fly machines. The whole thing works really nicely.
anentropic 2 hours ago
blog posts should have a date at the top
[-]
- chrisweekly an hour ago
  YES. THIS. ALWAYS!
  Huge pet peeve. At least this one has a date somewhere (at the bottom, "last updated Oct 22, 2025").
soamv 7 hours ago
> New nullable columns are kryptonite to large Corrosion tables: cr-sqlite needs to backfill values for every row in the table
Is this a typo? Why does it backfill values for a nullable column?
[-]
- ricardobeat 4 hours ago
  It seems to be a quirk of cr-sqlite, it wants to keep track of clock values for the new column. It's not backfilling the field values as far as I understand. There is a comment mentioning it could be optimized away:
  https://github.com/vlcn-io/cr-sqlite/blob/891fe9e0190dd20917...
- andrewaylett 6 hours ago
  I assume it would backfill values for any column, as a side-effect of propagating values for any column. But nullable columns are the only type you can add to a table that already contains rows, and mean that every row immediately has an update that needs to be sent.
jimmyl02 3 hours ago
always wondered at what scale gossip / SWIM breaks down and you need a hierarchy / partitioning. fly's use of corrosion seems to imply it's good enough for a single region which is pretty surprising because iirc Uber's ringpop was said to face problems at around 3K nodes.
it would be super cool to learn more about how the world's largest gossip systems work :)
[-]
- chucky_z 28 minutes ago
  Back of napkin math I’ve done previously, it breaks down around 2 million members with Hashicorps defaults. The defaults are quite aggressive though and if you can tolerate seconds of latency (called out in the article) you could reach billions without a lot of trouble.
  [-]
  - tptacek 22 minutes ago
    It's also frequency of changes and granularity of state, when sizing workloads. My understanding is that most Hashi shops would federate workloads of our size/global distribution; it would be weird to try to run one big cluster to capture everything.
- tptacek 2 hours ago
  SWIM is probably going to scale pretty much indefinitely. The issue we have with a single global SWIM broadcast domain isn't that the scale is breaking down; it's just that the blast radius for bugs (both in Corrosion itself, and in the services that depend on Corrosion) is too big.
  We're actually keeping the global Corrosion cluster! We're just stripping most of the data out of it.

conradev 3 hours ago

  To ensure every instance arrives at the same “working set” picture, we use cr-sqlite, the CRDT SQLite extension.

Cool to see cr-sqlite used in production!

mosura 3 hours ago
Someone needs to read about ant colony optimization. https://en.wikipedia.org/wiki/Ant_colony_optimization_algori...
This blog is not impressive for an infra company.
[-]
- tucnak 3 hours ago
  I respect Fly, and it does sound like a nice place to work, but honestly, you're onto something. You would expect ostensibly Public Cloud provider to have a more solid grasp on networking. Instead, we're discovering how they're learning about things like OSPF!
  Makes you think that's all.
  [-]
  - tptacek 2 hours ago
    What a weird thing to say. I wrote my first OSPF implementation in 1999. The point is that we noticed the solution we'd settled on owes more to protocols like OSPF than to distributed consensus databases, which are the mainstream solution to this problem. It's not "OMG we just discovered this neat protocol called OSPF". We don't actually run OSPF. We don't even do a graph->tree reduction. We're routing HTTP requests, not packets.
    [-]
    - mosura 2 hours ago
      Look at one of the other comments:
      > in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution"
      This is what people saw as the key takeaway. If that takeaway is news to you then I don’t know what you are doing writing distributed systems.
      While this message may not be what was intended it was what was broadcast.
      [-]
      - akerl_ 2 hours ago
        It seems weird to take an inaccurate paraphrase from a commenter and then use it to paint the authors with your desired brush.
        [-]
        mosura an hour ago
        Not sure the replies to that comment help the cause at all.
mrbluecoat 5 hours ago
For the TL;DR folks: https://github.com/superfly/corrosion
nodesocket 3 hours ago
Anybody used rqlite[1] in production? I'm exploring how to make my application fault-tolerant using multiple app vm instances. The problem of course is the SQLite database on disk. Using a network file system like NFS is a no-go with SQLite (this includes Amazon Elastic File System (EFS)).
I was thinking I'll just have to bite the bullet and migrate to PostgreSQL, but perhaps rqlite can work.
[1] https://rqlite.io
[-]
- otoolep 2 hours ago
  rqlite creator here. Right there on the rqlite homepage[1] are listed two production users: replicated.com[2] and textgroove.com are both using it.
  [1] https://rqlite.io/
  [2] https://www.replicated.com/blog/app-manager-with-rqlite
bananapub 6 hours ago
in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.
[-]
- tptacek 4 hours ago
  I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
  [-]
  - vlovich123 2 hours ago
    > Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
    But they have to. Physically no solution will be instantaneous because that’s not how the speed of light nor relativity works - even two events next to each other cannot find out about each other instantaneously. So then the question is “how long can I wait for this information”. And that’s the part that I feel isn’t answered - eg if the app dies, the TCP connections die and in theory that information travels as quickly as anything else you send. It’s not reliably detectable but conceivably you could have an eBPF program monitoring death and notifying the proxies. Thats the part that’s really not explained in the article which is why you need to maintain an eventually consistent view of the connectivity. I get maybe why that could be useful but noticing app connectivity death seems wrong considering I believe you’re more tracking machine and cluster health right? Ie not noticing an app instance goes down but noticing all app instances on a given machine are gone and consensus deciding globally where the new app instance will be as quickly as possible?
    [-]
    - tptacek 2 hours ago
      A request routed to a dead instance doesn't fall into a black hole: our proxies reroute it. But that's very slow; to deliver acceptable service quality you need to minimize the number of times that happens. So you can't accept a solution that leaves large windows of time within which every instance that has gone down has a stale entry. Remember: instances coming up and down happens all the time on this platform! It's part of the point.
  - __turbobrew__ 3 hours ago
    > Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
    Did you ever consider envoy xDS?
    There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…
    [-]
    - tptacek 3 hours ago
      Nope. Talk a little about how how Envoy's service discovery would scale to millions of apps in a global network? There's no way we found the only possible point in the solution space. Do they do something clever here?
      What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.
      [-]
      - __turbobrew__ 3 hours ago
        I haven’t personally worked on envoy xds, but it is what I have seen several BigCo’s use for routing from the edge to internal applications.
        > Running consensus transcontinentally is very painful
        You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas.
        I have seen the following scheme work for millions of workloads:
        1. Etcd quorum across 3 close, but independent regions
        2. On startup, the app registers itself under a prefix that all other app replicas register
        3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients.
        4. A custom grpc resolver is used to do lookups by service name
        [-]
        tptacek 3 hours ago
        I'm thrilled to have people digging into this, because I think it's a super interesting problem, but: no, keeping quorum nodes close-enough-but-not-too-close doesn't solve our problem, because we support a unified customer namespace that runs from Tokyo to Sydney to São Paulo to Northern Virginia to London to Frankfurt to Johannesburg.
        Two other details that are super important here:
        This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.
        The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.
        The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.
        [-]
        __turbobrew__ 2 hours ago
        > I'm thrilled to have people digging into this, because I think it's a super interesting problem
        Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!
        Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.
        __turbobrew__ 2 hours ago
        > you still need reliable realtime update of instances going down
        The way I have seen this implemented is through a cluster of service watcher that ping all services once every X seconds and deregister the service when the pings fail.
        Additionally you can use grpc with keepalives which will detect on the client side when a service goes down and automatically remove it from the subset. Grpc also has client side outlier detection so the clients can also automatically remove slow servers from the subset as well. This only works for grpc though, so not generally useful if you are creating a cloud for HTTP servers…
        [-]
        tptacek 2 hours ago
        Detecting that the service went down is easy. Notifying every proxy in the fleet that it's down is not. Every proxy in the fleet cannot directly probe every application on the platform.
        justinparus an hour ago
        The solutions across different BigCorp Clouds varies depending on the SLA from their underlying network. Doing this on top the public internet is very different than on redundant subsea fiber with dedicated BigCorp bandwidth!
      - hedgehog 3 hours ago
        Is it actually necessary to run transcontinental consensus? Apps in a given location are not movable so it would seem for a given app it's known which part of the network writes can come from. That would require partitioning the namespace but, given that apps are not movable, does that matter? It feel like there are other areas like docs and tooling that would benefit from relatively higher prioritization.
        [-]
        tptacek 3 hours ago
        Apps in a given location are extremely movable! That's the point of the service!
        [-]
        hedgehog 2 hours ago
        We unfortunately lost our location with not a whole lot of notice and the migration to a new one was not seamless, on top of things like the GitHub actions being out of date (only supporting the deprecated Postgres service, not the new one).
throwaway290 7 hours ago
I guess all designers at fly were replaced by ai because this article is using gray bold font for the whole text. I remember these guys had good blog some time ago
[-]
- tptacek 4 hours ago
  The design hasn't changed in years. If someone has a screenshot and a browser version we can try to figure out why it's coming out fucky for you.
  [-]
  - kg 4 hours ago
    Looking at the css, there's a .text-gray-600 CSS style that would cause this, and it's overridden by some other style in order to achieve the actual desired appearance. Maybe the override style isn't loading - perhaps the GP has javascript disabled?
    [-]
    - throwaway290 10 minutes ago
      javascript is enabled but I don't see the problem on another phone, so yeah seems related
    - tptacek 3 hours ago
      Thanks! Relayed.
- dewey 7 hours ago
  Not sure if that was changed since then, but it's not bold for me and also readable. Maybe browser rendering?
  [-]
  - ceigey 6 hours ago
    Also not bold for me (Safari). Variable font rendering issue?
    [-]
    - throwaway290 5 hours ago
      stock safari on ios 26 for me. is it another of 37366153 regressions of ios 26?
      [-]
      - iviv 4 hours ago
        Looks normal to me on iOS 26.0.1
  - throwaway290 5 hours ago
    stock safari on ios
    and I think the intended webfont is loaded because the font is clearly weird ish and non-standard and the text is invisible for good 2 seconds at first while it loads:)
- mcny 5 hours ago
  Please try the article mode in your web browser. Firefox has a pretty good one but I understand all major browsers have this now.
  [-]
  - throwaway290 5 hours ago
    I only use article mode in exceptional cases. I hold fly to higher standard than that.
    [-]
    - tptacek 4 hours ago
      D'awwwwww.
- jjtheblunt 4 hours ago
  latest macos firefox and safari both show grey on white, legible but contrast somewhat lacking, but rendered properly for grey on white.
- foofoo12 7 hours ago
  It's totally unreadable.
  [-]
  - davidham 5 hours ago
    Looks like it always has, to me.
tucnak 3 hours ago
What's this obsession with SQLite? For all intents and purposes, what they'd accomplished is effectively a Type 2 table with extra steps. CRDT is totally overkill in this situation. You can implement this in Postgres easily with very little changes to your access patterns... DISTINCT ON. Maybe this kind of "solution" is impressive for Rust programmers, I'm not sure what's the deal exactly, but all it tells me is Fly ought to hire actual networking professionals, maybe even compute-in-network guys with FPGA experience like everyone else, and develop their own routers that way—if only to learn more about networking.
[-]
- tptacek 3 hours ago
  What part of this problem do you think FPGAs would help with?
  In what sense do you think we need specialty routers?
  How would you deploy Postgres to address these problems?
  [-]
  - tucnak 2 hours ago
    So you know how when you're making the first steps in whatever discipline, as per hacking ethos, you start experimenting, fiddling with things? Networking is a lot like that. For businesses in the cloud, they may often get away staying largely ignorant to networking. However, as soon as you become a big operator, and hit the scale, all the gaps in the knowledge would show real quick. If you care about routing, it helps to learn about routers, and how to make one! That is to say, modern networking at scale benefits greatly from hardware design competencies. (FWIW, design platforms such as AMD Alveo is an easy way to get started with high-bandwidth RDMA datapaths and everything.)
    Re: Postgres, I thought it were fitting to describe the SQLite-CRDT application in simple terms.
    [-]
    - DAlperin an hour ago
      (I used to work at fly on networking)
      Fly has a lot of interesting networking issues but I don't know that like, the actual routing of packets is the big one? And even in the places where there is bottlenecks in the overlay mesh I'm not sure that custom FPGAs are going to be the solution for now.
      But also this blog post isn't about routing packets, it's about state tracking so we know _where_ to even send our packets in the first place.
    - tptacek 2 hours ago
      I'm not sure you understand our problem space.