Open source · BUSL 1.1 · Free to self-host

AI agents that
survive failures

Write-ahead log via NATS JetStream. Batched persistence to your database. 7× faster than checkpointing directly to a database — and crash-recoverable.

Per-event write latency · p50

skialith (NATS PubAck) 133 us
Database-first (MySQL INSERT) 986 us

Measured locally against NATS + MySQL. Reproduce with cargo run --bin benchmark.

Agents fail. Restarts are expensive.

A multi-step agent that crashes mid-run loses all progress and replays every LLM call from the beginning. Existing checkpointers write synchronously to a database — making every event as slow as a database round-trip.

Hot path is slow

Synchronous DB writes add ~1ms per event. Across hundreds of steps, this compounds.

💥

Crashes lose work

Without a durable log, a process crash anywhere in the pipeline means restarting from zero.

🔄

Retries cause duplicates

At-least-once delivery without idempotency creates duplicate rows and corrupted state.

Write-ahead log first. Database second.

Every save_event call is acknowledged by NATS JetStream before returning to your agent. A background writer batches those events into your database — keeping the hot path fast and the data durable.

01

Agent calls save_event

Your agent publishes an event. Skialith serialises it and sends it to NATS JetStream.

02

NATS PubAck returned

JetStream confirms the write in ~133us. Your agent unblocks — no DB wait.

03

Background batch to DB

A background task collects events and flushes in efficient batches with automatic retry.

Agent
  |  save_event / checkpoint
  v
Skialith sidecar
  |-- NATS JetStream  <-- PubAck ~133us returned to caller
  |       |
  |       +-- Background batch writer
  |                 +-- MySQL / TiDB  (async, retried, idempotent)
  |
  +-- trace_ingest consumer  -->  agent_traces table

Numbers you can reproduce.

Run the benchmarks yourself against a local NATS + MySQL stack.

Scenario p50 p95 p99
save_event (NATS PubAck) 133 us 265 us 386 us
Baseline MySQL INSERT 986 us 1.5 ms 2.6 ms
cargo run --bin benchmark

Drop in. No rewrites.

Agents are plain async functions. SDKs are thin HTTP clients — no Rust required.

Python

from skialith import SkialithAgent

async with SkialithAgent(agent_id="my-agent") as agent:
    state = await agent.resume()
    await agent.checkpoint(
        step=state.step_index,
        data={"messages": messages}
    )
    await agent.save_event("step-1", {
        "kind": "thought", "text": "..."
    })

TypeScript

import { SkialithAgent } from "@skialith/agent-core";

const agent = new SkialithAgent({ agentId: "my-agent" });
const state = await agent.resume();

await agent.checkpoint(state.stepIndex, { messages });
await agent.saveEvent("step-1", {
  kind: "thought", text: "..."
});

LangGraph

from skialith.langchain import SkialithCheckpointer

checkpointer = SkialithCheckpointer()
app = graph.compile(checkpointer=checkpointer)

# No other changes needed
result = await app.ainvoke(
    {"messages": [...]},
    config={"configurable": {"thread_id": "agent-1"}}
)

Building agents at scale?

We are collaborating with a small number of teams running AI agents in production. If you are hitting the limits of existing checkpointing approaches and want to shape what we build next, we would like to hear from you.