Build durable workflows with Postgres

132 points | by KraftyOne 14 hours ago

47 comments

cmdtab 13 hours ago
Recently moved some of the background jobs from graphile worker to DBOS. Really recommend for the simplicity. Took me half an hour.
I evaluated temporal, trigger, cloudflare workflows (highly not recommended), etc and this was the easiest to implement incrementally. Didn't need to change our infrastructure at all. Just plugged the worker where I had graphile worker.
The hosted service UX and frontend can use a lot of work though but it's not necessary for someone to use. OTEL support was there.
[-]
- johtso 10 hours ago
  Why would you not recommend Cloudflare workflows? Was thinking of using them in my current project..
  [-]
  - cmdtab 27 minutes ago
    They inherit all the limitations of DO. For example, if you want to do anything that requires more than 6 TCP connection. Every fetch request will start failing silently because there is no more TCP connection to go through. This was a deal breaker for us. Their solution was split our code into more workflows or DOs.
    You are limited to 128 MB ram which means everything has to be steamed. You will rewrite your code around this because many node libraries don't have streaming alternatives for things that spike memory usage.
    The observability tab was buggy. Lifecycle chart is hard to understand for knowing when things will evict. Lot of small hidden limitations. Rate limits are very low for any mid scale application. Full Node compatibility is not there yet (work in progress) so needed to change some modules.
    Overall, a gigantic waste of time unless you are doing something small scale. Just go with restate/upstash + lambdas/cloud run if you want simpler experience that scales in serverless manner
- barapa 8 hours ago
  Agree on the UI - I wish it was improved
  [-]
  - qianli_cs 8 hours ago
    We heard you! Working on improvements based on user feedback. Stay tuned :)
- LudwigNagasena 13 hours ago
  What was the reason for the transition?
  [-]
  - cmdtab 12 hours ago
    Needed checkpoints in some of our jobs wrapping around the AI agent so we can reduce cost and increase reliability (as workflow will start from mid step as opposed to a complete restart).
    We already check pointed the agent but then figure it's better to have a generic abstraction for other stuff we do.
- diarrhea 12 hours ago
  Interesting!
  What made you opt for DBOS over Temporal?
  [-]
  - cmdtab 12 hours ago
    Temporal required re-architecting some stuff, their typescript sdk and sandbox is bit unintuitive to use so would have been an additional item to grok for the team, and additional infrastructure to maintain. There was a latency trade off too which in our case mattered.
    Didn't face any issue though. Temporal observability and UI was better than DBOS. Just harder to do incremental migration in an existing codebase.
lacoolj 8 hours ago
So we do this exact thing in our software, and I implement it (along with other devs), and I was still entranced enough to read through the end. No differences between ours and theirs (this is a fairly common practice anyway) but article is written in succinct, informative chunks with "images" (of code) in between.
This is how you write a technical article. Thanks to the author for the nice read :)
rlili 13 hours ago
Some other lightweight solutions around:
https://github.com/iopsystems/durable
https://github.com/maxcountryman/underway
alpb 14 hours ago
I've been following DBOS for a while and I think the model isn't too different than Azure Durable Functions (which uses Azure Queues/Tables under the covers to maintain state). https://learn.microsoft.com/en-us/azure/azure-functions/dura...
Perhaps the only difference is that Azure Durable Functions has more syntactic sugar in C# (instead of DBOS choice being Python) to preserve call results in the persistent storage? Where else do they differ? At the end, all of them seem to be doing what Temporal is doing (which has its own shortcomings and it's also possible to get it wrong if you call a function directly instead of invoking it via an Activity etc)?
[-]
- rubenvanwyk 3 hours ago
  This actually looks super amazing for C# ~ but doesn’t use Postgres?? All the backends seem to be purely Azure related / Microsoft products, so although the Framework is Apache2, your infrastructure needs to rely on MS?
- KraftyOne 13 hours ago
  Both do durable workflows with similar guarantees. The big difference is that DBOS is an open-source library you can add to your existing code and run anywhere, whereas Durable Functions is a cloud offering for orchestrating serverless functions on Azure.
  [-]
  - alpb 13 hours ago
    As far as I know, Azure Durable Functions doesn't have a server-side proprietary component and it's actually fully open source framework/clients as well. So it's actually not a cloud offering per-se. You can see the full implementations at:
    * https://github.com/Azure/durabletask
    * https://github.com/microsoft/durabletask-go
    [-]
    - KraftyOne 9 hours ago
      That's interesting, I'll take a look! I had always thought of it as an Azure-only thing.
cpursley 14 hours ago
I've been using https://www.pgflow.dev for workflows which is built on pgmq and am really impressed so far. Most of the logic is in the database so I'm considering building an Elixir adapter DSL.
[-]
- mmcclure 11 hours ago
  Just curious, if you’re already in Elixir and using Postgres, why not use Oban[1]? It’s my absolute favorite background job library, and the thing I often miss most when working in other ecosystems.
  [1] https://github.com/oban-bg/oban
  [-]
  - sbrother 6 hours ago
    Oban is so good! My startup has an extensive graph of background jobs all managed by Oban, and it's just rock solid, simple to use and gets out of the way.
- ishita_julep 14 hours ago
  what are you using the DSL for?
  [-]
  - cpursley 13 hours ago
    It’s used to generate the database migration that defines the flows. More syntax sugar than anything.
rubenvanwyk 3 hours ago
Been looking at DBOS for a while ~ are there plans to port to other languages such as Java or C#?? Are you open to community ports??
[-]
- qianli_cs 3 hours ago
  Yeah, we plan to add more languages. Currently supports Python and TypeScript, and Go and Java will be released soon. We’re having a preview of DBOS Java at our user group meeting on August 28: https://lu.ma/8rqv5o5z Welcome to join us! We’d love to hear your feedback.
  We welcome community contributions to the open source repos.
jumploops 11 hours ago
I've been looking at migrating to Temporal, but this looks interesting.
For context, we have a simple (read: home-built) "durable" worker setup that uses BullMQ for scheduling/queueing, but all of the actual jobs are Postgres-based.
Due to the cron-nature of the many disparate jobs (bespoke AI-native workflows), we have workers that scale up/down basically on the hour, every hour.
Temporal is the obvious solution, but it will take some rearchitecting to get our jobs to fit their structure. We're also concerned with some of their limits (payload size, language restrictions, etc.).
Looking at DBOS, it's unclear from the docs how to scale the workers:
> DBOS is just a library for your program to import, so it can run with any Python/Node program.
In our ideal case, we can add DBOS to our main application for scheduling jobs, and then have a simple worker app that scales independently.
How "easy" would it be to migrate our current system to DBOS?
[-]
- mnahkies 2 hours ago
  As another commentator said, temporal is quite tricky to self host/scale in a cost effective manner. This is also reflected in their cloud pricing (which should've been the warning sign to us tbh)
  Overall it's a pretty heavy/expensive solution and I've come to the conclusion it's usage is best limited to lower frequency and/or higher "value" (eg: revenue or risk) tasks.
  Orchestrating a food delivery that's paying you $3 of service fees - good use case. Orchestrating some high frequency task that pays you $3 / month - not so good.
- KraftyOne 11 hours ago
  I'd love to learn more about what you're building--just reach out at peter.kraft@dbos.dev.
  One option is that you have DBOS workflows that schedule and submit jobs to an external worker app. Another option is that your workers use DBOS queues (https://docs.dbos.dev/python/tutorials/queue-tutorial). I'd have to better understand your use case to figure out what would be the best fit.
  [-]
  - blumomo 11 hours ago
    I’m also interested in what you think can become best practices where we can have (auto-scaling) worker instances that can pick up DBOS workflows and execute them.
    Do you think an app’s (e.g. FastAPI) backend should be the DBOS Client, submitting workflows to the DBOS instance? And then we can have multiple DBOS instances with each picking up jobs from a queue?
    [-]
    - KraftyOne 10 hours ago
      Yeah, I think in that case you should have auto-scaling DBOS workers all pulling from a queue and a FastAPI backend using the DBOS client to submit jobs to the queue.
      Queue docs: https://docs.dbos.dev/python/tutorials/queue-tutorial Client docs: https://docs.dbos.dev/python/reference/client
- cyberpunk 3 hours ago
  Unless you’re planning on using their (temporalio’s) saas you’re in for building a very large database cluster for this if you need some scale.
  (source: i run way more cassandra than i ever thought reasonable)
atombender 11 hours ago
While DBOS looks like a nice system, I was really disappointed to learn that Conductor, which is the DBOS equivalent of the Temporal server, is not open source.
Without it, you get no centralized coordination of workflow recovery. On Kubernetes, for example, my understanding is that you will need to use a stateful set to assign stable executor IDs, which the Conductor doesn't need.
I suppose that's their business model, to provide a simplistic foundation where you have to pay money to get the grown up stuff.
[-]
- jedberg 4 hours ago
  > Conductor, which is the DBOS equivalent of the Temporal server,
  Just to clarify, Conductor is not anything like the Temporal server. In Temporal, the server is a critical component that stores all the running state and is required for Temporal to work (and blocks your app from working if it's down).
  Conductor is an out of band connector to give Transact users access to the same observability and workflow management as DBOS Cloud users have, but it isn't required and your app will keep working even if it breaks.
  You can run a durable, and scalable, application with just Transact, it's just a lot harder without Conductor to help you.
  You are correct that the business model is to provide add ons for Transact applications, but I'd say it's unfair to call Transact a "simplistic foundation" and not "grown up".
  Transact is absolutely Enterprise grade software that can run at massive scale.
darkteflon 12 hours ago
Often wondered whether it would be possible / advisable to combine DBOS with, e.g., Dagster if you have complex data orchestration requirements. They seem to deal with orthogonal concerns but complement nicely. Is integration with orchestration frameworks something the DBOS team has any thoughts on?
[-]
- KraftyOne 12 hours ago
  Would love to learn more about what you're building--what problems or parts of your system would you solve with Dagster vs DBOS?
agambrahma 12 hours ago
Curious how this compares to Cloudflare, which is the other provider that is really going for simplified workflows
at0mic22 14 hours ago
Every few years someone discovers FOR UPDATE SKIP LOCKED and represents it. I remember it lasting for 15 years at least
[-]
- atombender 11 hours ago
  The "someone" in this case happens to be Michael Stonebraker, the creator of Postgres and CTO of DBOS.
  [-]
  - digdugdirk 6 hours ago
    So glad someone else chuckled reading this. Two thumbs up for knowing better than the creator of the thing they're talking about!
- qianli_cs 13 hours ago
  Yup, some features are timeless and deserve a re-intro every now and then. SKIP LOCKED is definitely one of them.
  [-]
  - skrtskrt 13 hours ago
    with a nice NOWAIT when appropriate
krashidov 10 hours ago
How does this compare with inngest or restate? We currently use inngest right now and it works great but the typescript API is a bit clunky
[-]
- KraftyOne 9 hours ago
  Like Inngest and Restate, DBOS provides durable workflows. The difference is that DBOS is implemented as a Postgres-backed library you can "npm install" into your project (no external dependencies except Postgres), while Inngest and Restate require an external workflow orchestrator.
  Here's a blog post explaining the DBOS architecture in more detail: https://www.dbos.dev/blog/what-is-lightweight-durable-execut...
  Here's a comparison with Temporal, which is architecturally similar to Restate and Inngest: https://www.dbos.dev/blog/durable-execution-coding-compariso...
abtinf 14 hours ago
Why not just use Temporal?
[-]
- KraftyOne 13 hours ago
  We wanted to make workflows more lightweight--we're building a Postgres-backed library you can add to your existing application instead of an external orchestrator that requires you to rearchitect your system around it. This post goes into more detail: https://www.dbos.dev/blog/durable-execution-coding-compariso...
tonyhb 14 hours ago
Anything that guarantees exactly once is selling snake oil. Side effects happen inside any transaction, and only when it commits (checkpoints) are the side effects safe.
Want to send an email, but the app crashes before committing? Now you're at-least-once.
You can compress the window that causes at-least-once semantics, but it's always there. For this reason, this blog post oversells the capabilities of these types of systems as a whole. DBOS (and Inngest, see the disclaimer below) try to get as close to exactly once as possible, but the risk always exists, which is why you should always try to use idempotency in external API requests if they support it. Defense in layers.
Disclaimer: I built the original `step.run` APIs at https://www.inngest.com, which offers similar things on any platform... without being tied to DB transactions.
[-]
- KraftyOne 14 hours ago
  As the post says, the exactly-once guarantee is ONLY for steps performing database operations. For those, you actually can get an exactly-once guarantee by running the database operations in the same Postgres transaction as your durable checkpoint. That's a pretty cool benefit of building workflows on Postgres! Of course, if there are side effects outside the database, those happen at-least-once.
  [-]
  - tonyhb 14 hours ago
    You can totally leverage postgres transactions to give someone... postgres transactions!
    I just figured that the exactly once semantics were so worth discussing that any external side effects (which is what orchestration is for) aren't included in that, which is a big caveat.
- jedberg 13 hours ago
  > Anything that guarantees exactly once is selling snake oil.
  That's a pretty spicy take. I'll agree that exactly-once is hard, but it's not impossible. Obviously there are caveats, but the beauty of DBOS using Postgres as the method of coordination instead of the an external server (like Temporal or Inngest) is that the exactly-once guarantees of Postgres can carry over to the application. Especially so if you're using that same Postgres to store your application data.