Sargalay — LLM API Gateway

Full-stack LLM reselling platform for Myanmar developers. Hono + Bun backend acts as a transparent proxy to 300+ AI models with real-time usage-based billing, balance-first proxy guard, and MMPay/MMQR integration for local Myanmar payments. Includes a Next.js 15 portal with custom i18n (English + Burmese), a streaming AI assistant with search_models tool and conversation summarization, and non-blocking model price caching.

Visit
Full Cycle
Next.jsNext.jsViteViteTypeScriptTypeScriptHonoHonoPostgreSQLPostgreSQLDrizzle ORMDrizzle ORMTanStack QueryTanStack QueryshadcnshadcnGemini AIGemini AI
Screenshot of Sargalay — LLM API Gateway

AI Sargalay — LLM API Gateway

A full-stack LLM reselling platform built for Myanmar developers, providing OpenAI-compatible access to 100+ AI models (GPT-4, Claude, Gemini, Llama) with local payment integration.

Overview

AI Sargalay makes frontier AI models accessible to Myanmar developers through a unified OpenAI-compatible API with usage-based billing and local payment support. Developers get a single API endpoint and top up their balance using any Myanmar banking app — no USD card required.

Key Features

  • 100+ Models — OpenAI-compatible API access to GPT-4, Claude, Gemini, Llama, and more
  • Real-Time Billing — Actual token costs are captured from upstream SSE response chunks, marked up, and deducted from user balances immediately — no estimation error
  • Balance-First Guard — Balance is validated before every proxied request, preventing users from consuming credits they can't pay for
  • MMPay / MMQR Payments — Top up with KBZPay, WavePay, AYA Pay, or CB Pay — any Myanmar banking app — with real-time payment confirmation via Server-Sent Events
  • Model Catalog — Browseable catalog with MMK pricing, context windows, modality filters, and ISR rendering
  • AI Assistant — Streaming chat assistant that knows the platform: available models, pricing, payment methods, and integration guides

Architecture

Monorepo — Separate Deployments

ai-gateway/
├── apps/
│   ├── worker/   → Cloudflare Worker (Hono RPC + better-auth, port 8787)
│   ├── web/      → Vercel (Next.js 15, public pages only)
│   └── portal/   → Cloudflare Pages (Vite + TanStack Router, user dashboard)
└── packages/
    ├── @ai-gateway/types    — shared Zod schemas (single source of truth)
    ├── @ai-gateway/db       — Drizzle ORM schema + D1 migrations
    ├── @ai-gateway/core     — shared repos, use cases, auth utils
    ├── @ai-gateway/proxy-core — LLM proxy logic (revenue-critical)
    └── @ai-gateway/ui       — shared React components (shadcn/ui)

All apps share typed API contracts via Hono RPC (AppType) — the frontend client is fully type-safe with no code generation step.

Edge-First Architecture

The entire backend runs on Cloudflare Workers at the edge — zero cold starts, globally distributed. The database is Cloudflare D1 (SQLite), co-located with the worker for low-latency reads. Model pricing is cached at the CF CDN layer (5-minute TTL) so the catalog is served without hitting the database on every request.

Per-User API Keys

Each user is provisioned a dedicated API key at registration. The platform uses this key for proxying, with the upstream provider as the source of truth for balance and usage. Topup flow: fetch current limit_remaining → PATCH limit = remaining + topupUsd. This eliminates the need for local balance tracking or usage reconciliation.

SSE Payment Flow

MMQR payments are confirmed via MMPay webhook callbacks. The portal listens on a Server-Sent Events stream (GET /me/topup/:id/stream) that polls D1 every second and pushes a status event when the payment completes — the stream closes automatically on success or timeout.

AI Assistant

Streaming chat assistant built with Google Gemini 2.0 Flash:

  • search_models tool — filters the live model catalog by modality, provider, or price, returning IDs resolved client-side from React Query cache
  • browse_models tool — groups all models by output capability (text / image / audio / file)
  • Model reference syntax — LLM output can reference @provider/model@ which renders as an inline chip or full model card in the UI
  • Conversation summarization — older messages are compressed into a rolling summary every 8 turns, keeping token usage low without losing context

Key Technical Decisions

  • Proxy billing over estimation — costs are read from actual upstream SSE chunks (stream_options: { include_usage: true }), eliminating estimation drift
  • Shared Zod schemas — single source of truth for API validation and TypeScript types across worker, web, and portal
  • Hono RPC — end-to-end type safety between worker and frontends without a separate schema registry or code generation
  • Edge + D1 — Cloudflare Worker + D1 means the API, database, and CDN cache are all in the same infrastructure layer
  • Separate frontend appsapps/web (Next.js, public/SEO) and apps/portal (Vite, authenticated dashboard) are deployed independently, keeping public ISR pages fast and the dashboard bundle lean

Tech Stack

Backend: Cloudflare Worker, Hono RPC, TypeScript, Drizzle ORM, Cloudflare D1 (SQLite), Zod

Frontend: Next.js 15 (public), Vite + TanStack Router (portal), React Query, shadcn/ui, Tailwind CSS

Auth: Google OAuth via better-auth — session cookies shared across web and portal via worker-proxied auth endpoints

Payments: MMPay / MMQR (KBZPay, WavePay, AYA Pay, CB Pay)

Infra: Cloudflare (Worker + D1 + CDN cache), Vercel (Next.js)

Role

Solo Full-Stack Developer — end-to-end design and implementation of edge backend, billing engine, LLM proxy, portal, payment integration, AI assistant, and monorepo tooling.