Private GPU inference across two offices
Prachyam Studios produces Indian cultural and dharmic content at scale — a constant appetite for promotional art, regional-language voiceovers, audio dubs, and subtitle files. The team was paying per-call to cloud APIs: Midjourney-equivalents for image generation, ElevenLabs-class services for TTS, cloud transcription for subtitles — a combined burn rate of conservatively $100–400+/month. Meanwhile, the Pune office had GPU-capable workstations sitting largely idle.
The fix was architectural: co-locate all inference workloads on the existing Pune GPU machine, serve them as simple HTTP endpoints, and route the Varanasi office's requests through the Tailscale mesh already in place for mail and file storage. The content team gets unlimited generation capacity at zero per-call cost. No cloud accounts, no egress fees, no per-character billing.
Model selection was deliberately India-first. Kokoro and Parrot were chosen over English-first defaults specifically for Hindi and regional-language quality — the output gap versus generic open-source TTS was significant enough that the wrong choice would have produced unusable voiceovers.
The team's cloud AI spend scaled directly with output volume, which created exactly the wrong incentive: creators self-censored requests, batched generation jobs, and accepted first-pass results to avoid burning budget. Per-call pricing was suppressing iteration and hurting asset quality.
The Varanasi office added a distribution constraint. Replicating models to both sites was a non-starter — Varanasi machines had no GPU capacity. A public endpoint would have exposed the inference server to the internet. The right answer was to centralise compute on Pune's GPU and tunnel Varanasi traffic through the existing private mesh, treating the GPU as a shared internal service rather than a local tool.
Flux handles image generation — promotional posters, thumbnails, and content art — served via a local API wrapper on the Pune GPU machine. Kokoro provides high-quality multi-lingual TTS for regional Indian languages, used for narration, promos, and content previews. Parrot handles audio dubbing, converting content tracks into regional-language dubs in-house. A Whisper-family model generates subtitle files from uploaded audio, eliminating per-minute cloud transcription cost.
All four workloads are served as HTTP endpoints on the Pune machine's Tailscale IP. The Varanasi office machines were already on the same Tailscale mesh used for Mailcow and Nextcloud — adding the GPU machine as another mesh node required zero additional networking configuration on the Varanasi side. Both offices hit the same endpoint URL; requests route over Tailscale; generated images and audio files return directly with no internet hop. The inference server never has a public IP.
Centralised inference over distributed. Pune had GPU hardware; Varanasi did not. Multi-node inference orchestration across asymmetric hardware would have introduced coordination complexity with no benefit. One GPU, one endpoint base URL, routed over the mesh — operationally minimal, and latency-negligible for file-generation requests that run in seconds.
Reusing the existing Tailscale mesh. Every internal service added to the mesh — Mailcow, Nextcloud, then the inference server — immediately became available to all connected offices without any new networking work. The mesh compounds in value with each node; the marginal cost of adding the GPU machine was near zero.
India-first model selection. Kokoro and Parrot required hands-on evaluation against Coqui, VITS variants, and OpenVoice. For Hindi and South Indian languages, the quality gap between models was pronounced. Selecting the wrong model would have produced output the content team couldn't use — the research time was the real cost of the project.
Framing the service as unlimited. Removing the per-call constraint changed team behaviour immediately. Creators iterated on prompts, generated alternatives, and produced higher-quality assets because there was no budget meter running. Unlimited local capacity was a product decision as much as an infrastructure one.
All cloud AI API costs for image generation, TTS, transcription, and audio dubbing were eliminated — a conservative $100–400+/month in recurring spend reduced to $0/call on hardware the studio already owned. The content team gained the ability to produce regional-language voiceovers and audio dubs in-house without engaging a dubbing studio for every promotional asset. Both offices accessed the same inference endpoints transparently over Tailscale with no additional VPN setup. The Tailscale mesh pattern, proven across mail and file storage, extended cleanly to a fourth internal service — confirming it as the studio's composable private networking primitive for all subsequent infrastructure additions.
AI workloads self-hosted
offices on the mesh
per-call API cost
cloud spend eliminated
Did this resonate?