> an if let expression over an RWLock assumed (reasonably, but incorrectly) in its else branch that the lock had been released. Instant and virulently contagious deadlock.
They may have upgraded by now, their source links to a thread from a year ago, prior to the 2024 edition, which may be when they encountered that particular bug.
> Finally, let’s revisit that global state problem. After the contagious deadlock bug, we concluded we need to evolve past a single cluster. So we took on a project we call “regionalization”, which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in the region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies.
This tier approach makes a lot of sense to mitigate the scaling limit per corrosion node. Can you share how much data you wind up tracking in each tier in practice?
How concise is the entry for each application -> [regions] table?
Does the constraint of running this on every node mean that this creates a global limit for number of applications? It also seems like the region level database would have a regional limit for the number of Fly machines too?
> The bidding model is elegant, but it’s insufficient to route network requests. To allow an HTTP request in Tokyo to find the nearest instance in Sydney, we really do need some kind of global map of every app we host.
So is this a case of wanting to deliver a differentiating feature before the technical maturity is there and validated? It's an acceptable strategy if you are building a lesser product but if you are selling Public Cloud maybe having a better strategy than waiting for problems to crop up makes more sense? Consul, missing watchdogs, certificate expiry, CRDT back filling nullable columns - sure in a normal case these are not very unexpected or to-be-ashamed-of problems but for a product that claims to be Public Cloud you want to think of these things and address them before day 1. Cert expiry for example - you should be giving your users tools to never have a cert expire - not fixing it for your stuff after the fact! (Most CAs offer API to automate all this - no excuse for it.)
I don't mean to be dismissive or disrespectful, the problem is challenging and the work is great - merely thinking of loss of customer trust - people are never going to trust a new comer that has issues like this and for that reason move fast break things and fix when you find isn't a good fit for this kind of a product.
I was referring to the "HTTP request in Tokyo to find the nearest instance in Sydney" part which felt to me like a differentiating feature- no other cloud provider seems to have bidding or HTTP request level cross regional lookup or whatever.
The "decision that long predates Corrosion" is precisely the point I was trying to make - was it made too soon before understanding the ramifications and/or having a validated technical solution ready? IOW maybe the feature requiring the problem solution could have come later? (I don't know much about fly.io and its features, so apologies if some of this is unclear/wrongly assumes things.)
fwiw, I'm happily running a company and some contract work on fly literally as aws, but what if it weren't the most massively complex pile of shit you've ever seen.
I have a couple reasonably sized, understandable toml files and another 100 lines of ruby that runs long-running rake tasks as individual fly machines. The whole thing works really nicely.
It seems to be a quirk of cr-sqlite, it wants to keep track of clock values for the new column. It's not backfilling the field values as far as I understand. There is a comment mentioning it could be optimized away:
I assume it would backfill values for any column, as a side-effect of propagating values for any column. But nullable columns are the only type you can add to a table that already contains rows, and mean that every row immediately has an update that needs to be sent.
always wondered at what scale gossip / SWIM breaks down and you need a hierarchy / partitioning. fly's use of corrosion seems to imply it's good enough for a single region which is pretty surprising because iirc Uber's ringpop was said to face problems at around 3K nodes.
it would be super cool to learn more about how the world's largest gossip systems work :)
Back of napkin math I’ve done previously, it breaks down around 2 million members with Hashicorps defaults. The defaults are quite aggressive though and if you can tolerate seconds of latency (called out in the article) you could reach billions without a lot of trouble.
It's also frequency of changes and granularity of state, when sizing workloads. My understanding is that most Hashi shops would federate workloads of our size/global distribution; it would be weird to try to run one big cluster to capture everything.
SWIM is probably going to scale pretty much indefinitely. The issue we have with a single global SWIM broadcast domain isn't that the scale is breaking down; it's just that the blast radius for bugs (both in Corrosion itself, and in the services that depend on Corrosion) is too big.
We're actually keeping the global Corrosion cluster! We're just stripping most of the data out of it.
I respect Fly, and it does sound like a nice place to work, but honestly, you're onto something. You would expect ostensibly Public Cloud provider to have a more solid grasp on networking. Instead, we're discovering how they're learning about things like OSPF!
What a weird thing to say. I wrote my first OSPF implementation in 1999. The point is that we noticed the solution we'd settled on owes more to protocols like OSPF than to distributed consensus databases, which are the mainstream solution to this problem. It's not "OMG we just discovered this neat protocol called OSPF". We don't actually run OSPF. We don't even do a graph->tree reduction. We're routing HTTP requests, not packets.
Anybody used rqlite[1] in production? I'm exploring how to make my application fault-tolerant using multiple app vm instances. The problem of course is the SQLite database on disk. Using a network file system like NFS is a no-go with SQLite (this includes Amazon Elastic File System (EFS)).
I was thinking I'll just have to bite the bullet and migrate to PostgreSQL, but perhaps rqlite can work.
in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.
I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
> Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
But they have to. Physically no solution will be instantaneous because that’s not how the speed of light nor relativity works - even two events next to each other cannot find out about each other instantaneously. So then the question is “how long can I wait for this information”. And that’s the part that I feel isn’t answered - eg if the app dies, the TCP connections die and in theory that information travels as quickly as anything else you send. It’s not reliably detectable but conceivably you could have an eBPF program monitoring death and notifying the proxies. Thats the part that’s really not explained in the article which is why you need to maintain an eventually consistent view of the connectivity. I get maybe why that could be useful but noticing app connectivity death seems wrong considering I believe you’re more tracking machine and cluster health right? Ie not noticing an app instance goes down but noticing all app instances on a given machine are gone and consensus deciding globally where the new app instance will be as quickly as possible?
A request routed to a dead instance doesn't fall into a black hole: our proxies reroute it. But that's very slow; to deliver acceptable service quality you need to minimize the number of times that happens. So you can't accept a solution that leaves large windows of time within which every instance that has gone down has a stale entry. Remember: instances coming up and down happens all the time on this platform! It's part of the point.
Nope. Talk a little about how how Envoy's service discovery would scale to millions of apps in a global network? There's no way we found the only possible point in the solution space. Do they do something clever here?
What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.
I haven’t personally worked on envoy xds, but it is what I have seen several BigCo’s use for routing from the edge to internal applications.
> Running consensus transcontinentally is very painful
You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas.
I have seen the following scheme work for millions of workloads:
1. Etcd quorum across 3 close, but independent regions
2. On startup, the app registers itself under a prefix that all other app replicas register
3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients.
4. A custom grpc resolver is used to do lookups by service name
I'm thrilled to have people digging into this, because I think it's a super interesting problem, but: no, keeping quorum nodes close-enough-but-not-too-close doesn't solve our problem, because we support a unified customer namespace that runs from Tokyo to Sydney to São Paulo to Northern Virginia to London to Frankfurt to Johannesburg.
Two other details that are super important here:
This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.
The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.
The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.
> I'm thrilled to have people digging into this, because I think it's a super interesting problem
Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!
Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.
> you still need reliable realtime update of instances going down
The way I have seen this implemented is through a cluster of service watcher that ping all services once every X seconds and deregister the service when the pings fail.
Additionally you can use grpc with keepalives which will detect on the client side when a service goes down and automatically remove it from the subset. Grpc also has client side outlier detection so the clients can also automatically remove slow servers from the subset as well. This only works for grpc though, so not generally useful if you are creating a cloud for HTTP servers…
Detecting that the service went down is easy. Notifying every proxy in the fleet that it's down is not. Every proxy in the fleet cannot directly probe every application on the platform.
The solutions across different BigCorp Clouds varies depending on the SLA from their underlying network. Doing this on top the public internet is very different than on redundant subsea fiber with dedicated BigCorp bandwidth!
Is it actually necessary to run transcontinental consensus? Apps in a given location are not movable so it would seem for a given app it's known which part of the network writes can come from. That would require partitioning the namespace but, given that apps are not movable, does that matter? It feel like there are other areas like docs and tooling that would benefit from relatively higher prioritization.
We unfortunately lost our location with not a whole lot of notice and the migration to a new one was not seamless, on top of things like the GitHub actions being out of date (only supporting the deprecated Postgres service, not the new one).
I guess all designers at fly were replaced by ai because this article is using gray bold font for the whole text. I remember these guys had good blog some time ago
Looking at the css, there's a .text-gray-600 CSS style that would cause this, and it's overridden by some other style in order to achieve the actual desired appearance. Maybe the override style isn't loading - perhaps the GP has javascript disabled?
and I think the intended webfont is loaded because the font is clearly weird ish and non-standard and the text is invisible for good 2 seconds at first while it loads:)
What's this obsession with SQLite? For all intents and purposes, what they'd accomplished is effectively a Type 2 table with extra steps. CRDT is totally overkill in this situation. You can implement this in Postgres easily with very little changes to your access patterns... DISTINCT ON. Maybe this kind of "solution" is impressive for Rust programmers, I'm not sure what's the deal exactly, but all it tells me is Fly ought to hire actual networking professionals, maybe even compute-in-network guys with FPGA experience like everyone else, and develop their own routers that way—if only to learn more about networking.
So you know how when you're making the first steps in whatever discipline, as per hacking ethos, you start experimenting, fiddling with things? Networking is a lot like that. For businesses in the cloud, they may often get away staying largely ignorant to networking. However, as soon as you become a big operator, and hit the scale, all the gaps in the knowledge would show real quick. If you care about routing, it helps to learn about routers, and how to make one! That is to say, modern networking at scale benefits greatly from hardware design competencies. (FWIW, design platforms such as AMD Alveo is an easy way to get started with high-bandwidth RDMA datapaths and everything.)
Re: Postgres, I thought it were fitting to describe the SQLite-CRDT application in simple terms.
Fly has a lot of interesting networking issues but I don't know that like, the actual routing of packets is the big one? And even in the places where there is bottlenecks in the overlay mesh I'm not sure that custom FPGAs are going to be the solution for now.
But also this blog post isn't about routing packets, it's about state tracking so we know _where_ to even send our packets in the first place.
> an if let expression over an RWLock assumed (reasonably, but incorrectly) in its else branch that the lock had been released. Instant and virulently contagious deadlock.
I believe this behavior is changing in the 2024 edition: https://doc.rust-lang.org/edition-guide/rust-2024/temporary-...
> I believe this behavior is changing
Past tense, the 2024 edition stabilized in (and has been the default edition for `cargo new` since) Rust 1.85.
Yes, I've already performed the upgrade for my projects, but since they hit this bug, I'm guessing they haven't.
They may have upgraded by now, their source links to a thread from a year ago, prior to the 2024 edition, which may be when they encountered that particular bug.
I see now that this incident happened in September 2024 as well.
> Finally, let’s revisit that global state problem. After the contagious deadlock bug, we concluded we need to evolve past a single cluster. So we took on a project we call “regionalization”, which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in the region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies.
This tier approach makes a lot of sense to mitigate the scaling limit per corrosion node. Can you share how much data you wind up tracking in each tier in practice?
How concise is the entry for each application -> [regions] table? Does the constraint of running this on every node mean that this creates a global limit for number of applications? It also seems like the region level database would have a regional limit for the number of Fly machines too?
> Like an unattended turkey deep frying on the patio, truly global distributed consensus promises deliciousness while yielding only immolation
Their writing is so good, always a fun and enlightening read.
> The bidding model is elegant, but it’s insufficient to route network requests. To allow an HTTP request in Tokyo to find the nearest instance in Sydney, we really do need some kind of global map of every app we host.
So is this a case of wanting to deliver a differentiating feature before the technical maturity is there and validated? It's an acceptable strategy if you are building a lesser product but if you are selling Public Cloud maybe having a better strategy than waiting for problems to crop up makes more sense? Consul, missing watchdogs, certificate expiry, CRDT back filling nullable columns - sure in a normal case these are not very unexpected or to-be-ashamed-of problems but for a product that claims to be Public Cloud you want to think of these things and address them before day 1. Cert expiry for example - you should be giving your users tools to never have a cert expire - not fixing it for your stuff after the fact! (Most CAs offer API to automate all this - no excuse for it.)
I don't mean to be dismissive or disrespectful, the problem is challenging and the work is great - merely thinking of loss of customer trust - people are never going to trust a new comer that has issues like this and for that reason move fast break things and fix when you find isn't a good fit for this kind of a product.
It's not a "differentiating feature"; it eliminated a scaling bottleneck. It's also a decision that long predates Corrosion.
I was referring to the "HTTP request in Tokyo to find the nearest instance in Sydney" part which felt to me like a differentiating feature- no other cloud provider seems to have bidding or HTTP request level cross regional lookup or whatever.
The "decision that long predates Corrosion" is precisely the point I was trying to make - was it made too soon before understanding the ramifications and/or having a validated technical solution ready? IOW maybe the feature requiring the problem solution could have come later? (I don't know much about fly.io and its features, so apologies if some of this is unclear/wrongly assumes things.)
That's literally the premise of the service and always has been.
fwiw, I'm happily running a company and some contract work on fly literally as aws, but what if it weren't the most massively complex pile of shit you've ever seen.
I have a couple reasonably sized, understandable toml files and another 100 lines of ruby that runs long-running rake tasks as individual fly machines. The whole thing works really nicely.
blog posts should have a date at the top
YES. THIS. ALWAYS!
Huge pet peeve. At least this one has a date somewhere (at the bottom, "last updated Oct 22, 2025").
> New nullable columns are kryptonite to large Corrosion tables: cr-sqlite needs to backfill values for every row in the table
Is this a typo? Why does it backfill values for a nullable column?
It seems to be a quirk of cr-sqlite, it wants to keep track of clock values for the new column. It's not backfilling the field values as far as I understand. There is a comment mentioning it could be optimized away:
https://github.com/vlcn-io/cr-sqlite/blob/891fe9e0190dd20917...
I assume it would backfill values for any column, as a side-effect of propagating values for any column. But nullable columns are the only type you can add to a table that already contains rows, and mean that every row immediately has an update that needs to be sent.
always wondered at what scale gossip / SWIM breaks down and you need a hierarchy / partitioning. fly's use of corrosion seems to imply it's good enough for a single region which is pretty surprising because iirc Uber's ringpop was said to face problems at around 3K nodes.
it would be super cool to learn more about how the world's largest gossip systems work :)
Back of napkin math I’ve done previously, it breaks down around 2 million members with Hashicorps defaults. The defaults are quite aggressive though and if you can tolerate seconds of latency (called out in the article) you could reach billions without a lot of trouble.
It's also frequency of changes and granularity of state, when sizing workloads. My understanding is that most Hashi shops would federate workloads of our size/global distribution; it would be weird to try to run one big cluster to capture everything.
SWIM is probably going to scale pretty much indefinitely. The issue we have with a single global SWIM broadcast domain isn't that the scale is breaking down; it's just that the blast radius for bugs (both in Corrosion itself, and in the services that depend on Corrosion) is too big.
We're actually keeping the global Corrosion cluster! We're just stripping most of the data out of it.
Someone needs to read about ant colony optimization. https://en.wikipedia.org/wiki/Ant_colony_optimization_algori...
This blog is not impressive for an infra company.
I respect Fly, and it does sound like a nice place to work, but honestly, you're onto something. You would expect ostensibly Public Cloud provider to have a more solid grasp on networking. Instead, we're discovering how they're learning about things like OSPF!
Makes you think that's all.
What a weird thing to say. I wrote my first OSPF implementation in 1999. The point is that we noticed the solution we'd settled on owes more to protocols like OSPF than to distributed consensus databases, which are the mainstream solution to this problem. It's not "OMG we just discovered this neat protocol called OSPF". We don't actually run OSPF. We don't even do a graph->tree reduction. We're routing HTTP requests, not packets.
Look at one of the other comments:
> in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution"
This is what people saw as the key takeaway. If that takeaway is news to you then I don’t know what you are doing writing distributed systems.
While this message may not be what was intended it was what was broadcast.
It seems weird to take an inaccurate paraphrase from a commenter and then use it to paint the authors with your desired brush.
Not sure the replies to that comment help the cause at all.
For the TL;DR folks: https://github.com/superfly/corrosion
Anybody used rqlite[1] in production? I'm exploring how to make my application fault-tolerant using multiple app vm instances. The problem of course is the SQLite database on disk. Using a network file system like NFS is a no-go with SQLite (this includes Amazon Elastic File System (EFS)).
I was thinking I'll just have to bite the bullet and migrate to PostgreSQL, but perhaps rqlite can work.
[1] https://rqlite.io
rqlite creator here. Right there on the rqlite homepage[1] are listed two production users: replicated.com[2] and textgroove.com are both using it.
[1] https://rqlite.io/
[2] https://www.replicated.com/blog/app-manager-with-rqlite
in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.
I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
> Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
But they have to. Physically no solution will be instantaneous because that’s not how the speed of light nor relativity works - even two events next to each other cannot find out about each other instantaneously. So then the question is “how long can I wait for this information”. And that’s the part that I feel isn’t answered - eg if the app dies, the TCP connections die and in theory that information travels as quickly as anything else you send. It’s not reliably detectable but conceivably you could have an eBPF program monitoring death and notifying the proxies. Thats the part that’s really not explained in the article which is why you need to maintain an eventually consistent view of the connectivity. I get maybe why that could be useful but noticing app connectivity death seems wrong considering I believe you’re more tracking machine and cluster health right? Ie not noticing an app instance goes down but noticing all app instances on a given machine are gone and consensus deciding globally where the new app instance will be as quickly as possible?
A request routed to a dead instance doesn't fall into a black hole: our proxies reroute it. But that's very slow; to deliver acceptable service quality you need to minimize the number of times that happens. So you can't accept a solution that leaves large windows of time within which every instance that has gone down has a stale entry. Remember: instances coming up and down happens all the time on this platform! It's part of the point.
> Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
Did you ever consider envoy xDS?
There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…
Nope. Talk a little about how how Envoy's service discovery would scale to millions of apps in a global network? There's no way we found the only possible point in the solution space. Do they do something clever here?
What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.
I haven’t personally worked on envoy xds, but it is what I have seen several BigCo’s use for routing from the edge to internal applications.
> Running consensus transcontinentally is very painful
You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas.
I have seen the following scheme work for millions of workloads:
1. Etcd quorum across 3 close, but independent regions
2. On startup, the app registers itself under a prefix that all other app replicas register
3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients.
4. A custom grpc resolver is used to do lookups by service name
I'm thrilled to have people digging into this, because I think it's a super interesting problem, but: no, keeping quorum nodes close-enough-but-not-too-close doesn't solve our problem, because we support a unified customer namespace that runs from Tokyo to Sydney to São Paulo to Northern Virginia to London to Frankfurt to Johannesburg.
Two other details that are super important here:
This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.
The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.
The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.
> I'm thrilled to have people digging into this, because I think it's a super interesting problem
Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!
Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.
> you still need reliable realtime update of instances going down
The way I have seen this implemented is through a cluster of service watcher that ping all services once every X seconds and deregister the service when the pings fail.
Additionally you can use grpc with keepalives which will detect on the client side when a service goes down and automatically remove it from the subset. Grpc also has client side outlier detection so the clients can also automatically remove slow servers from the subset as well. This only works for grpc though, so not generally useful if you are creating a cloud for HTTP servers…
Detecting that the service went down is easy. Notifying every proxy in the fleet that it's down is not. Every proxy in the fleet cannot directly probe every application on the platform.
The solutions across different BigCorp Clouds varies depending on the SLA from their underlying network. Doing this on top the public internet is very different than on redundant subsea fiber with dedicated BigCorp bandwidth!
Is it actually necessary to run transcontinental consensus? Apps in a given location are not movable so it would seem for a given app it's known which part of the network writes can come from. That would require partitioning the namespace but, given that apps are not movable, does that matter? It feel like there are other areas like docs and tooling that would benefit from relatively higher prioritization.
Apps in a given location are extremely movable! That's the point of the service!
We unfortunately lost our location with not a whole lot of notice and the migration to a new one was not seamless, on top of things like the GitHub actions being out of date (only supporting the deprecated Postgres service, not the new one).
I guess all designers at fly were replaced by ai because this article is using gray bold font for the whole text. I remember these guys had good blog some time ago
The design hasn't changed in years. If someone has a screenshot and a browser version we can try to figure out why it's coming out fucky for you.
Looking at the css, there's a .text-gray-600 CSS style that would cause this, and it's overridden by some other style in order to achieve the actual desired appearance. Maybe the override style isn't loading - perhaps the GP has javascript disabled?
javascript is enabled but I don't see the problem on another phone, so yeah seems related
Thanks! Relayed.
Not sure if that was changed since then, but it's not bold for me and also readable. Maybe browser rendering?
Also not bold for me (Safari). Variable font rendering issue?
stock safari on ios 26 for me. is it another of 37366153 regressions of ios 26?
Looks normal to me on iOS 26.0.1
stock safari on ios
and I think the intended webfont is loaded because the font is clearly weird ish and non-standard and the text is invisible for good 2 seconds at first while it loads:)
Please try the article mode in your web browser. Firefox has a pretty good one but I understand all major browsers have this now.
I only use article mode in exceptional cases. I hold fly to higher standard than that.
D'awwwwww.
latest macos firefox and safari both show grey on white, legible but contrast somewhat lacking, but rendered properly for grey on white.
It's totally unreadable.
Looks like it always has, to me.
What's this obsession with SQLite? For all intents and purposes, what they'd accomplished is effectively a Type 2 table with extra steps. CRDT is totally overkill in this situation. You can implement this in Postgres easily with very little changes to your access patterns... DISTINCT ON. Maybe this kind of "solution" is impressive for Rust programmers, I'm not sure what's the deal exactly, but all it tells me is Fly ought to hire actual networking professionals, maybe even compute-in-network guys with FPGA experience like everyone else, and develop their own routers that way—if only to learn more about networking.
What part of this problem do you think FPGAs would help with?
In what sense do you think we need specialty routers?
How would you deploy Postgres to address these problems?
So you know how when you're making the first steps in whatever discipline, as per hacking ethos, you start experimenting, fiddling with things? Networking is a lot like that. For businesses in the cloud, they may often get away staying largely ignorant to networking. However, as soon as you become a big operator, and hit the scale, all the gaps in the knowledge would show real quick. If you care about routing, it helps to learn about routers, and how to make one! That is to say, modern networking at scale benefits greatly from hardware design competencies. (FWIW, design platforms such as AMD Alveo is an easy way to get started with high-bandwidth RDMA datapaths and everything.)
Re: Postgres, I thought it were fitting to describe the SQLite-CRDT application in simple terms.
(I used to work at fly on networking)
Fly has a lot of interesting networking issues but I don't know that like, the actual routing of packets is the big one? And even in the places where there is bottlenecks in the overlay mesh I'm not sure that custom FPGAs are going to be the solution for now.
But also this blog post isn't about routing packets, it's about state tracking so we know _where_ to even send our packets in the first place.
I'm not sure you understand our problem space.