Give your agents ZeroGPU to ship viral AI apps autonomously

Community Article Published May 26, 2026

How I created the LongCat-Video-Avatar 1.5 Space (running on ZeroGPU, 35% faster than the reference path, MIT-licensed model) in a single agent session.

image

The unlock: a Hugging Face PRO subscription gives your agent its own AI lab (a live ZeroGPU Space) and 40 min/day of Blackwell GPU. Frame the goal, paste the gist, walk away. The agent designs, deploys, tests against the live API, fixes, and ships. Autonomously.

Below is the exact recipe. Anyone can do it.


What you need

  • Hugging Face PRO ($9/mo): host up to 10 ZeroGPU Spaces, 40 min/day on Blackwell (48 GB), priority queue. Beyond quota: $1 per 10 min using pre-paid credits if needed.
  • Any decent coding agent: Codex CLI, Claude Code, Cursor, whatever. Recommended: one that supports /goal (Codex CLI, Claude Code) so it can iterate autonomously toward the objective across many turns.
  • The model you want to demo: in our case meituan-longcat/LongCat-Video-Avatar-1.5.

That's it. No infra, no Docker, no Kubernetes, no GPU lease, no Vercel bill. You git push and a card with a public URL appears, served on a Blackwell.


Why ZeroGPU is the unlock

Normal cloud GPU = rent it 24/7, pay even when idle. ZeroGPU = the GPU attaches only when your function runs, then detaches. You decorate one function:

import spaces

@spaces.GPU(duration=120)
def generate(image, audio, prompt):
    return pipe(image, audio, prompt)
  • Your $9/mo lets people use your Space for free. Visitors don't need a HF account. Anonymous users get 2 min/day of GPU, free accounts 5 min/day, PRO 40 min/day. The quota is theirs, not yours.
  • You only burn GPU time during the decorated call. Idle = free.
  • Model goes to cuda at module level (PyTorch CUDA emulation handles it before a real GPU is attached).
  • Gradio SDK only; PyTorch 2.8+; Python 3.10 or 3.12.

This is the cheapest serious compute on the internet for shipping demos to a wide audience.


The recipe

Setup (one-time): Subscribe to Hugging Face PRO ($9/mo). This unlocks two things: hosting your own ZeroGPU Spaces, and 40 min/day of ZeroGPU quota that resets every 24h. Then install the hf CLI with the official one-liner and log in:

curl -LsSf https://hf.co/cli/install.sh | bash
# Windows: irm https://hf.co/cli/install.ps1 | iex

hf auth login

Paste this into your agent (Codex CLI or Claude Code, both have /goal):

/goal Build a ZeroGPU Space demoing <MODEL_CARD_URL>.

First, read https://gist.github.com/gary149/2aba2962375fa9ca56bb9ef53f00b73d.
These are the operational rules for iterating on HF Spaces.

Use the hf CLI to create the Space and clone it locally.

The deployed Space is your AI lab. Work autonomously: push, diagnose, fix,
repeat. Verify every change by calling the live API. Never declare a change
done without that round-trip.

Success = a Space that loads, runs the model on ZeroGPU, and returns a valid
result via the API.

That's the whole kickoff. The two non-obvious lines are "the deployed Space is your AI lab" and "verify every change by calling the live API." Together they license the agent to operate autonomously: it owns the deploy loop, it owns verification, you don't sit in the middle.

The gist link is doing the rest of the heavy lifting. It teaches the agent:

  • Builds are slow (1 to 15 min), reading logs is instant → iterate by logs, not guesses.
  • The iteration ladder: hot-reload → dev-mode SSH → selective upload → full rebuild.
  • ZeroGPU patterns: model on cuda at module level, @spaces.GPU on inference, dynamic duration=callable, 4-bit NF4 for LLMs ≥10B.
  • Verification means actually calling the deployed API via gradio_client.Client and inspecting the output file.

Once you have a first version live, steer it with one-liners to tweak behavior: "check the ZeroGPU docs about xlarge", "cache the Gradio examples", "limit generation to 4 seconds". The agent integrates each and keeps moving.


What the agent actually did

533 shell commands over ~2h. The loop: hf spaces logs (×97), hf spaces info (×50), selective hf upload (×18), hf spaces restart (×12), then gradio_client.Client(...).predict(...) to time the live API on every change.

Shipped: DBCache (from CacheDiT) caching denoise steps [2, 4, 6] for 35% faster generation (186s → 121s), Gradio 6.10 + 8-step DMD2 INT8 DiT, cache_examples=True, cache_mode="lazy" (1.3s instead of 80s), ElevenLabs voices for the examples. When asked about xlarge, it read the docs, surfaced the trade-off (2× quota, longer queue, full Blackwell), and then deployed on it. That's autonomous decision-making, not you babysitting.

Final tab: 1,834,906 tokens, ~2h 2m, (and still $9/mo for the GPU).


Why this stack beats everything else right now

  • $9/mo flat ceiling for hosting. No per-request invoice surprise.
  • ZeroGPU = idle is free. A demo with 0 users or 10k costs the same. One that goes viral autoscales on Hugging Face's infra.
  • Public URL out of the box. https://huggingface.co/spaces/victor/LongCat-Video-Avatar-1.5 is shareable, embeddable, and indexed.
  • Agent-native loop. hf CLI + gradio_client + --follow logs means an agent can drive the whole edit-deploy-verify cycle without a human in the loop.
  • The community sees it. A trending Space gets surfaced on the Hub homepage. Distribution is built in.

Let's go: pick a SOTA open-source model, give your agent the gist + a kickoff prompt, and ship.

Community

Sign up or log in to comment