2025/10/06

HunyuanImage-3.0 Developer Integration Guide: Transformers, API & Deployment

Complete technical guide for developers: integrate HunyuanImage-3.0 using Transformers, deploy locally, optimize performance with FlashAttention & FlashInfer, and build production applications.

As a developer, you want more than just using HunyuanImage-3.0 through a web interface—you want to integrate it into your applications, deploy it on your infrastructure, and build production-ready solutions.

This comprehensive guide covers everything from quick Transformers integration to advanced deployment optimization, helping you harness the full power of the world's largest open-source text-to-image model.

Quick Start: Transformers Integration (5 Minutes)

Prerequisites

Python: 3.12+ (tested and recommended)
PyTorch: 2.7.1 with CUDA 12.8
GPU: NVIDIA with at least 24GB VRAM (80GB recommended for production)
Storage: 170GB for model weights

Installation

# Step 1: Install PyTorch with CUDA 12.8
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
    --index-url https://download.pytorch.org/whl/cu128

# Step 2: Install Tencent Cloud SDK
pip install -i https://mirrors.tencent.com/pypi/simple/ \
    --upgrade tencentcloud-sdk-python

# Step 3: Install Transformers and dependencies
pip install transformers accelerate sentencepiece protobuf

Download Model Weights

# Using Hugging Face Hub (recommended)
pip install huggingface-hub

# Download model (170GB - this will take time)
huggingface-cli download tencent/HunyuanImage-3.0 \
    --local-dir ./HunyuanImage-3

Important: The directory name should NOT contain dots, as this can cause loading issues with Transformers.

Basic Usage with Transformers

from transformers import AutoModelForCausalLM

# Load the model
model_id = "./HunyuanImage-3"

# Configuration for standard inference
kwargs = dict(
    attn_implementation="sdpa",  # Use "flash_attention_2" if installed
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",  # Use "flashinfer" if installed
)

# Initialize model
model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)

# Generate image
prompt = "A brown and white dog running on the grass, \
photorealistic, professional photography"

image = model.generate_image(prompt=prompt, stream=True)
image.save("output.png")

print("Image saved to output.png")

Expected output:

First run: ~10 minutes (kernel compilation if FlashInfer is used)
Subsequent runs: 15-30 seconds per image
Quality: Production-ready, 1024x1024 resolution

Advanced Configuration Options

Memory Optimization

For limited VRAM environments:

import torch

# 4-bit quantization (experimental)
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True,
)

Trade-offs:

✅ Reduces VRAM usage by ~75%
⚠️ Slight quality degradation (~5-10%)
⚠️ Slower inference (~2x)

Multi-GPU Deployment

For distributed inference across multiple GPUs:

# Automatic distribution
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",  # Automatically distributes across available GPUs
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

# Manual device mapping (advanced)
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0-15": 0,
    "model.layers.16-31": 1,
    "model.layers.32-47": 2,
    "model.layers.48-63": 3,
    "model.norm": 3,
    "lm_head": 3,
}

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device_map,
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

Recommended Setup:

3x 80GB GPUs: Optimal for production
4x 80GB GPUs: Recommended for high throughput
2x 80GB GPUs: Minimum with reduced batch size

Generation Parameters

Control output quality and style:

image = model.generate_image(
    prompt="Your detailed prompt here",

    # Quality settings
    diff_infer_steps=50,        # 50 (default), 100 (high quality)

    # Resolution
    image_size="auto",          # "auto", "1280x768", "16:9", etc.

    # Randomization
    seed=42,                    # None (random) or integer for reproducibility

    # Output
    stream=True,                # Show progress (True) or return final only (False)
)

Parameter Guide:

Parameter	Values	Impact	Recommendation
`diff_infer_steps`	20-100	Quality vs. speed	50 (balanced), 100 (max quality)
`image_size`	"auto", WxH, ratio	Output resolution	"auto" for smart sizing
`seed`	None, integer	Reproducibility	None for variety, fixed for consistency

Local Installation & CLI Usage

For developers who prefer command-line workflows:

1. Clone Repository

git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/

2. Set Up Environment

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
    --index-url https://download.pytorch.org/whl/cu128

pip install -i https://mirrors.tencent.com/pypi/simple/ \
    --upgrade tencentcloud-sdk-python

pip install -r requirements.txt

3. Download Model (if not done via HF Hub)

# Using HuggingFace Hub
huggingface-cli download tencent/HunyuanImage-3.0 \
    --local-dir ./HunyuanImage-3

4. Configure Prompt Enhancement (Optional)

HunyuanImage-3.0 supports automatic prompt rewriting via DeepSeek API:

# Set environment variables
export DEEPSEEK_KEY_ID="your_deepseek_key_id"
export DEEPSEEK_KEY_SECRET="your_deepseek_key_secret"

Get your API keys from Tencent Cloud.

5. Run CLI Generation

python3 run_image_gen.py \
    --model-id ./HunyuanImage-3 \
    --prompt "A photorealistic portrait of a woman in a garden" \
    --diff-infer-steps 50 \
    --image-size auto \
    --save output.png \
    --verbose 1 \
    --sys-deepseek-prompt "universal"

CLI Arguments:

Argument	Description	Default
`--model-id`	Path to model weights	(required)
`--prompt`	Text description	(required)
`--attn-impl`	Attention: `sdpa` or `flash_attention_2`	`sdpa`
`--moe-impl`	MoE: `eager` or `flashinfer`	`eager`
`--seed`	Random seed	`None`
`--diff-infer-steps`	Diffusion steps	`50`
`--image-size`	Resolution	`auto`
`--save`	Output path	`image.png`
`--verbose`	Logging level (0-1)	`0`
`--rewrite`	Enable prompt rewriting	`1`
`--sys-deepseek-prompt`	Rewrite style: `universal` or `text_rendering`	`universal`

Performance Optimization

FlashAttention Integration (3x Speed Boost)

FlashAttention dramatically accelerates attention computation:

# Install FlashAttention (requires CUDA 11.8+)
pip install flash-attn==2.8.3 --no-build-isolation

Usage:

# In Python
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",  # Enable FlashAttention
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

# In CLI
python3 run_image_gen.py \
    --model-id ./HunyuanImage-3 \
    --attn-impl flash_attention_2 \
    --prompt "Your prompt"

Performance Impact:

✅ ~3x faster inference
✅ Lower memory usage (~20% reduction)
✅ No quality loss

FlashInfer for MoE Optimization

FlashInfer optimizes Mixture of Experts inference:

# Install FlashInfer (v0.3.1 tested)
pip install flashinfer-python

# Requires GCC 9+ for compilation

Usage:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",
    moe_impl="flashinfer",  # Enable FlashInfer for MoE
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

Important Notes:

⏱️ First run is slow (~10 minutes for kernel compilation)
⚡ Subsequent runs are fast (kernels cached)
✅ Best performance when combined with FlashAttention
⚠️ CUDA version must match PyTorch CUDA version

Combined Optimization (Maximum Performance)

# Install both optimizations
pip install flash-attn==2.8.3 --no-build-isolation
pip install flashinfer-python

# Run with all optimizations
python3 run_image_gen.py \
    --model-id ./HunyuanImage-3 \
    --attn-impl flash_attention_2 \
    --moe-impl flashinfer \
    --prompt "A cyberpunk cityscape at night"

Benchmark Results (Single Image):

Configuration	Time	VRAM	Quality
Baseline (sdpa + eager)	45s	72GB	100%
FlashAttention	28s	58GB	100%
FlashInfer	38s	68GB	100%
Both Optimizations	18s	52GB	100%

Building Production Applications

API Server with FastAPI

Create a production-ready API endpoint:

# server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM
import torch
import base64
from io import BytesIO

app = FastAPI()

# Load model once at startup
model = None

@app.on_event("startup")
async def load_model():
    global model
    model = AutoModelForCausalLM.from_pretrained(
        "./HunyuanImage-3",
        attn_implementation="flash_attention_2",
        moe_impl="flashinfer",
        trust_remote_code=True,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    model.load_tokenizer("./HunyuanImage-3")

class GenerationRequest(BaseModel):
    prompt: str
    steps: int = 50
    seed: int | None = None
    image_size: str = "auto"

class GenerationResponse(BaseModel):
    image_base64: str
    metadata: dict

@app.post("/generate", response_model=GenerationResponse)
async def generate_image(request: GenerationRequest):
    try:
        image = model.generate_image(
            prompt=request.prompt,
            diff_infer_steps=request.steps,
            seed=request.seed,
            image_size=request.image_size,
            stream=False,
        )

        # Convert to base64
        buffered = BytesIO()
        image.save(buffered, format="PNG")
        img_base64 = base64.b64encode(buffered.getvalue()).decode()

        return GenerationResponse(
            image_base64=img_base64,
            metadata={
                "prompt": request.prompt,
                "steps": request.steps,
                "seed": request.seed,
            }
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run the server:

# Install FastAPI and Uvicorn
pip install fastapi uvicorn python-multipart

# Start server
python server.py

Test the API:

curl -X POST "http://localhost:8000/generate" \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "A serene mountain landscape at sunset",
        "steps": 50,
        "image_size": "1024x768"
    }'

Batch Processing Pipeline

Process multiple prompts efficiently:

# batch_processor.py
from transformers import AutoModelForCausalLM
import torch
from pathlib import Path
import json

class HunyuanBatchProcessor:
    def __init__(self, model_path: str):
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            attn_implementation="flash_attention_2",
            moe_impl="flashinfer",
            trust_remote_code=True,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        self.model.load_tokenizer(model_path)

    def process_batch(
        self,
        prompts: list[str],
        output_dir: str,
        steps: int = 50,
    ):
        output_path = Path(output_dir)
        output_path.mkdir(exist_ok=True)

        results = []

        for idx, prompt in enumerate(prompts):
            print(f"Processing {idx+1}/{len(prompts)}: {prompt[:50]}...")

            image = self.model.generate_image(
                prompt=prompt,
                diff_infer_steps=steps,
                stream=False,
            )

            filename = f"output_{idx:04d}.png"
            filepath = output_path / filename
            image.save(filepath)

            results.append({
                "prompt": prompt,
                "output": str(filepath),
            })

        # Save metadata
        with open(output_path / "metadata.json", "w") as f:
            json.dump(results, f, indent=2)

        return results

# Usage
if __name__ == "__main__":
    processor = HunyuanBatchProcessor("./HunyuanImage-3")

    prompts = [
        "A serene lake at dawn",
        "A bustling city street at night",
        "A cozy library with antique books",
    ]

    processor.process_batch(
        prompts=prompts,
        output_dir="./batch_output",
        steps=50,
    )

Gradio Web Interface

Deploy an interactive web UI:

# Install Gradio
pip install gradio>=4.21.0

# Configure environment
export MODEL_ID="./HunyuanImage-3"
export GPUS="0,1,2,3"
export HOST="0.0.0.0"
export PORT="443"

# Launch with optimizations
sh run_app.sh --moe-impl flashinfer --attn-impl flash_attention_2

Custom Gradio App:

# app.py
import gradio as gr
from transformers import AutoModelForCausalLM

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "./HunyuanImage-3",
    attn_implementation="flash_attention_2",
    moe_impl="flashinfer",
    trust_remote_code=True,
    device_map="auto",
)
model.load_tokenizer("./HunyuanImage-3")

def generate(prompt, steps, seed):
    image = model.generate_image(
        prompt=prompt,
        diff_infer_steps=steps,
        seed=seed if seed != 0 else None,
        stream=True,
    )
    return image

# Create interface
demo = gr.Interface(
    fn=generate,
    inputs=[
        gr.Textbox(label="Prompt", lines=5),
        gr.Slider(20, 100, value=50, label="Steps"),
        gr.Number(value=0, label="Seed (0 for random)"),
    ],
    outputs=gr.Image(type="pil", label="Generated Image"),
    title="HunyuanImage-3.0 Generator",
    description="Generate stunning images with the world's largest open-source text-to-image model",
)

demo.launch(server_name="0.0.0.0", server_port=7860, share=True)

Deployment Architectures

1. Single-Server Setup (Small Scale)

Hardware:

1x Server with 4x A100 80GB GPUs
256GB RAM
2TB NVMe SSD

Software Stack:

Nginx (Load Balancer) → FastAPI (API Server) → HunyuanImage-3.0
                      → Redis (Queue)

Capacity: ~100-200 images/hour

2. Multi-Server Cluster (Medium Scale)

Hardware:

3x Inference Servers (4x A100 80GB each)
1x Coordinator Server (8GB RAM)

Software Stack:

Nginx → Coordinator (FastAPI) → RabbitMQ → Worker Nodes (HunyuanImage-3.0)
     → PostgreSQL (Metadata)
     → S3 (Image Storage)

Capacity: ~500-1000 images/hour

3. Cloud-Native Kubernetes (Large Scale)

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hunyuan-inference
spec:
  replicas: 5
  selector:
    matchLabels:
      app: hunyuan
  template:
    metadata:
      labels:
        app: hunyuan
    spec:
      containers:
      - name: hunyuan
        image: your-registry/hunyuan:latest
        resources:
          limits:
            nvidia.com/gpu: 4
        env:
        - name: MODEL_PATH
          value: "/models/HunyuanImage-3"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc

Capacity: 1000+ images/hour (scales horizontally)

Troubleshooting Common Issues

Issue: CUDA Out of Memory

# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Solution 2: Reduce batch size
# (If processing multiple prompts)

# Solution 3: Use 8-bit quantization
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)

Issue: Slow First Inference

Cause: FlashInfer kernel compilation

Solution: This is normal. First run takes ~10 minutes, subsequent runs are fast. Consider:

# Warm-up run during initialization
model.generate_image(prompt="test", diff_infer_steps=10)

Issue: Model Download Fails

# Use mirror (China users)
export HF_ENDPOINT=https://hf-mirror.com

# Or download via wget/aria2
wget https://huggingface.co/tencent/HunyuanImage-3.0/resolve/main/model.safetensors

Issue: CUDA Version Mismatch

Error: RuntimeError: CUDA version mismatch

Solution:

# Check versions
python -c "import torch; print(torch.version.cuda)"
nvidia-smi

# Reinstall matching PyTorch
pip uninstall torch
pip install torch==2.7.1+cu128 --index-url https://download.pytorch.org/whl/cu128

Get Started Today

Whether you're building a SaaS product, research tool, or creative application, HunyuanImage-3.0 provides enterprise-grade image generation with complete control.

For Developers Who Want Full Control:

Clone repository: git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0
Follow this guide for optimal setup
Join developer community: Discord

For Developers Who Want Quick Integration:

Visit Yuanic.com/api for:

⚡ Instant API access - No infrastructure setup
📚 Complete documentation - SDKs in Python, JavaScript, Go
🔌 Easy integration - REST API with OpenAPI spec
💰 Pay-as-you-go pricing - No upfront costs
🛡️ Enterprise SLA - 99.9% uptime guarantee

HunyuanImage-3.0's open-source nature gives developers unprecedented freedom to build, customize, and deploy AI image generation at any scale. From hobby projects to enterprise applications, the tools are in your hands.

Technical Resources:

All Posts

Author

Yuanic Team

HunyuanImage-3.0 Developer Integration Guide: Transformers, API & Deployment

Author

Categories

More Posts

HunyuanImage-3.0 MoE Architecture Explained: How 64 Experts Power the World's Best Image AI

HunyuanImage-3.0 Advanced Prompt Engineering: Master the Art of AI Image Creation

Hunyuan Image 3.0 vs Competitors: The Ultimate AI Image Generator Comparison (2025)

Newsletter