I’m interested in self hosting this. What’s the recommendation here for state persistence and self healing? Wish there was a guide for a small team who wants to self host before trying managed cloud
The component which needs the highest uptime is our ingestion service [1]. This ingests events from the Hatchet SDKs and is responsible for writing the workflow execution path, and then sends messages downstream to our other engine components. This is a horizontally scalable service and you should run at least 2 replicas across different AZs. Also see how to configure different services for engine components [2].
The other piece of this is PostgreSQL, use your favorite managed provider which has point-in-time restores and backups. This is the core of our self-healing, I'm not sure where it makes sense to route writes if the primary goes down.
Let me know what you need for self-hosted docs, happy to write them up for you.