
HunyuanImage-3.0 Developer Integration Guide: Transformers, API & Deployment
Complete technical guide for developers: integrate HunyuanImage-3.0 using Transformers, deploy locally, optimize performance with FlashAttention & FlashInfer, and build production applications.
As a developer, you want more than just using HunyuanImage-3.0 through a web interface—you want to integrate it into your applications, deploy it on your infrastructure, and build production-ready solutions.
This comprehensive guide covers everything from quick Transformers integration to advanced deployment optimization, helping you harness the full power of the world's largest open-source text-to-image model.
Quick Start: Transformers Integration (5 Minutes)
Prerequisites
- Python: 3.12+ (tested and recommended)
- PyTorch: 2.7.1 with CUDA 12.8
- GPU: NVIDIA with at least 24GB VRAM (80GB recommended for production)
- Storage: 170GB for model weights
Installation
# Step 1: Install PyTorch with CUDA 12.8
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
--index-url https://download.pytorch.org/whl/cu128
# Step 2: Install Tencent Cloud SDK
pip install -i https://mirrors.tencent.com/pypi/simple/ \
--upgrade tencentcloud-sdk-python
# Step 3: Install Transformers and dependencies
pip install transformers accelerate sentencepiece protobuf
Download Model Weights
# Using Hugging Face Hub (recommended)
pip install huggingface-hub
# Download model (170GB - this will take time)
huggingface-cli download tencent/HunyuanImage-3.0 \
--local-dir ./HunyuanImage-3
Important: The directory name should NOT contain dots, as this can cause loading issues with Transformers.
Basic Usage with Transformers
from transformers import AutoModelForCausalLM
# Load the model
model_id = "./HunyuanImage-3"
# Configuration for standard inference
kwargs = dict(
attn_implementation="sdpa", # Use "flash_attention_2" if installed
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
moe_impl="eager", # Use "flashinfer" if installed
)
# Initialize model
model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)
# Generate image
prompt = "A brown and white dog running on the grass, \
photorealistic, professional photography"
image = model.generate_image(prompt=prompt, stream=True)
image.save("output.png")
print("Image saved to output.png")
Expected output:
- First run: ~10 minutes (kernel compilation if FlashInfer is used)
- Subsequent runs: 15-30 seconds per image
- Quality: Production-ready, 1024x1024 resolution
Advanced Configuration Options
Memory Optimization
For limited VRAM environments:
import torch
# 4-bit quantization (experimental)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True,
)
Trade-offs:
- ✅ Reduces VRAM usage by ~75%
- ⚠️ Slight quality degradation (~5-10%)
- ⚠️ Slower inference (~2x)
Multi-GPU Deployment
For distributed inference across multiple GPUs:
# Automatic distribution
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # Automatically distributes across available GPUs
torch_dtype=torch.float16,
trust_remote_code=True,
)
# Manual device mapping (advanced)
device_map = {
"model.embed_tokens": 0,
"model.layers.0-15": 0,
"model.layers.16-31": 1,
"model.layers.32-47": 2,
"model.layers.48-63": 3,
"model.norm": 3,
"lm_head": 3,
}
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=device_map,
torch_dtype=torch.float16,
trust_remote_code=True,
)
Recommended Setup:
- 3x 80GB GPUs: Optimal for production
- 4x 80GB GPUs: Recommended for high throughput
- 2x 80GB GPUs: Minimum with reduced batch size
Generation Parameters
Control output quality and style:
image = model.generate_image(
prompt="Your detailed prompt here",
# Quality settings
diff_infer_steps=50, # 50 (default), 100 (high quality)
# Resolution
image_size="auto", # "auto", "1280x768", "16:9", etc.
# Randomization
seed=42, # None (random) or integer for reproducibility
# Output
stream=True, # Show progress (True) or return final only (False)
)
Parameter Guide:
Parameter | Values | Impact | Recommendation |
---|---|---|---|
diff_infer_steps | 20-100 | Quality vs. speed | 50 (balanced), 100 (max quality) |
image_size | "auto", WxH, ratio | Output resolution | "auto" for smart sizing |
seed | None, integer | Reproducibility | None for variety, fixed for consistency |
Local Installation & CLI Usage
For developers who prefer command-line workflows:
1. Clone Repository
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
2. Set Up Environment
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
--index-url https://download.pytorch.org/whl/cu128
pip install -i https://mirrors.tencent.com/pypi/simple/ \
--upgrade tencentcloud-sdk-python
pip install -r requirements.txt
3. Download Model (if not done via HF Hub)
# Using HuggingFace Hub
huggingface-cli download tencent/HunyuanImage-3.0 \
--local-dir ./HunyuanImage-3
4. Configure Prompt Enhancement (Optional)
HunyuanImage-3.0 supports automatic prompt rewriting via DeepSeek API:
# Set environment variables
export DEEPSEEK_KEY_ID="your_deepseek_key_id"
export DEEPSEEK_KEY_SECRET="your_deepseek_key_secret"
Get your API keys from Tencent Cloud.
5. Run CLI Generation
python3 run_image_gen.py \
--model-id ./HunyuanImage-3 \
--prompt "A photorealistic portrait of a woman in a garden" \
--diff-infer-steps 50 \
--image-size auto \
--save output.png \
--verbose 1 \
--sys-deepseek-prompt "universal"
CLI Arguments:
Argument | Description | Default |
---|---|---|
--model-id | Path to model weights | (required) |
--prompt | Text description | (required) |
--attn-impl | Attention: sdpa or flash_attention_2 | sdpa |
--moe-impl | MoE: eager or flashinfer | eager |
--seed | Random seed | None |
--diff-infer-steps | Diffusion steps | 50 |
--image-size | Resolution | auto |
--save | Output path | image.png |
--verbose | Logging level (0-1) | 0 |
--rewrite | Enable prompt rewriting | 1 |
--sys-deepseek-prompt | Rewrite style: universal or text_rendering | universal |
Performance Optimization
FlashAttention Integration (3x Speed Boost)
FlashAttention dramatically accelerates attention computation:
# Install FlashAttention (requires CUDA 11.8+)
pip install flash-attn==2.8.3 --no-build-isolation
Usage:
# In Python
model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2", # Enable FlashAttention
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
)
# In CLI
python3 run_image_gen.py \
--model-id ./HunyuanImage-3 \
--attn-impl flash_attention_2 \
--prompt "Your prompt"
Performance Impact:
- ✅ ~3x faster inference
- ✅ Lower memory usage (~20% reduction)
- ✅ No quality loss
FlashInfer for MoE Optimization
FlashInfer optimizes Mixture of Experts inference:
# Install FlashInfer (v0.3.1 tested)
pip install flashinfer-python
# Requires GCC 9+ for compilation
Usage:
model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2",
moe_impl="flashinfer", # Enable FlashInfer for MoE
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
)
Important Notes:
- ⏱️ First run is slow (~10 minutes for kernel compilation)
- ⚡ Subsequent runs are fast (kernels cached)
- ✅ Best performance when combined with FlashAttention
- ⚠️ CUDA version must match PyTorch CUDA version
Combined Optimization (Maximum Performance)
# Install both optimizations
pip install flash-attn==2.8.3 --no-build-isolation
pip install flashinfer-python
# Run with all optimizations
python3 run_image_gen.py \
--model-id ./HunyuanImage-3 \
--attn-impl flash_attention_2 \
--moe-impl flashinfer \
--prompt "A cyberpunk cityscape at night"
Benchmark Results (Single Image):
Configuration | Time | VRAM | Quality |
---|---|---|---|
Baseline (sdpa + eager) | 45s | 72GB | 100% |
FlashAttention | 28s | 58GB | 100% |
FlashInfer | 38s | 68GB | 100% |
Both Optimizations | 18s | 52GB | 100% |
Building Production Applications
API Server with FastAPI
Create a production-ready API endpoint:
# server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM
import torch
import base64
from io import BytesIO
app = FastAPI()
# Load model once at startup
model = None
@app.on_event("startup")
async def load_model():
global model
model = AutoModelForCausalLM.from_pretrained(
"./HunyuanImage-3",
attn_implementation="flash_attention_2",
moe_impl="flashinfer",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
)
model.load_tokenizer("./HunyuanImage-3")
class GenerationRequest(BaseModel):
prompt: str
steps: int = 50
seed: int | None = None
image_size: str = "auto"
class GenerationResponse(BaseModel):
image_base64: str
metadata: dict
@app.post("/generate", response_model=GenerationResponse)
async def generate_image(request: GenerationRequest):
try:
image = model.generate_image(
prompt=request.prompt,
diff_infer_steps=request.steps,
seed=request.seed,
image_size=request.image_size,
stream=False,
)
# Convert to base64
buffered = BytesIO()
image.save(buffered, format="PNG")
img_base64 = base64.b64encode(buffered.getvalue()).decode()
return GenerationResponse(
image_base64=img_base64,
metadata={
"prompt": request.prompt,
"steps": request.steps,
"seed": request.seed,
}
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Run the server:
# Install FastAPI and Uvicorn
pip install fastapi uvicorn python-multipart
# Start server
python server.py
Test the API:
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "A serene mountain landscape at sunset",
"steps": 50,
"image_size": "1024x768"
}'
Batch Processing Pipeline
Process multiple prompts efficiently:
# batch_processor.py
from transformers import AutoModelForCausalLM
import torch
from pathlib import Path
import json
class HunyuanBatchProcessor:
def __init__(self, model_path: str):
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
attn_implementation="flash_attention_2",
moe_impl="flashinfer",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
)
self.model.load_tokenizer(model_path)
def process_batch(
self,
prompts: list[str],
output_dir: str,
steps: int = 50,
):
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
results = []
for idx, prompt in enumerate(prompts):
print(f"Processing {idx+1}/{len(prompts)}: {prompt[:50]}...")
image = self.model.generate_image(
prompt=prompt,
diff_infer_steps=steps,
stream=False,
)
filename = f"output_{idx:04d}.png"
filepath = output_path / filename
image.save(filepath)
results.append({
"prompt": prompt,
"output": str(filepath),
})
# Save metadata
with open(output_path / "metadata.json", "w") as f:
json.dump(results, f, indent=2)
return results
# Usage
if __name__ == "__main__":
processor = HunyuanBatchProcessor("./HunyuanImage-3")
prompts = [
"A serene lake at dawn",
"A bustling city street at night",
"A cozy library with antique books",
]
processor.process_batch(
prompts=prompts,
output_dir="./batch_output",
steps=50,
)
Gradio Web Interface
Deploy an interactive web UI:
# Install Gradio
pip install gradio>=4.21.0
# Configure environment
export MODEL_ID="./HunyuanImage-3"
export GPUS="0,1,2,3"
export HOST="0.0.0.0"
export PORT="443"
# Launch with optimizations
sh run_app.sh --moe-impl flashinfer --attn-impl flash_attention_2
Custom Gradio App:
# app.py
import gradio as gr
from transformers import AutoModelForCausalLM
# Load model
model = AutoModelForCausalLM.from_pretrained(
"./HunyuanImage-3",
attn_implementation="flash_attention_2",
moe_impl="flashinfer",
trust_remote_code=True,
device_map="auto",
)
model.load_tokenizer("./HunyuanImage-3")
def generate(prompt, steps, seed):
image = model.generate_image(
prompt=prompt,
diff_infer_steps=steps,
seed=seed if seed != 0 else None,
stream=True,
)
return image
# Create interface
demo = gr.Interface(
fn=generate,
inputs=[
gr.Textbox(label="Prompt", lines=5),
gr.Slider(20, 100, value=50, label="Steps"),
gr.Number(value=0, label="Seed (0 for random)"),
],
outputs=gr.Image(type="pil", label="Generated Image"),
title="HunyuanImage-3.0 Generator",
description="Generate stunning images with the world's largest open-source text-to-image model",
)
demo.launch(server_name="0.0.0.0", server_port=7860, share=True)
Deployment Architectures
1. Single-Server Setup (Small Scale)
Hardware:
- 1x Server with 4x A100 80GB GPUs
- 256GB RAM
- 2TB NVMe SSD
Software Stack:
Nginx (Load Balancer) → FastAPI (API Server) → HunyuanImage-3.0
→ Redis (Queue)
Capacity: ~100-200 images/hour
2. Multi-Server Cluster (Medium Scale)
Hardware:
- 3x Inference Servers (4x A100 80GB each)
- 1x Coordinator Server (8GB RAM)
Software Stack:
Nginx → Coordinator (FastAPI) → RabbitMQ → Worker Nodes (HunyuanImage-3.0)
→ PostgreSQL (Metadata)
→ S3 (Image Storage)
Capacity: ~500-1000 images/hour
3. Cloud-Native Kubernetes (Large Scale)
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: hunyuan-inference
spec:
replicas: 5
selector:
matchLabels:
app: hunyuan
template:
metadata:
labels:
app: hunyuan
spec:
containers:
- name: hunyuan
image: your-registry/hunyuan:latest
resources:
limits:
nvidia.com/gpu: 4
env:
- name: MODEL_PATH
value: "/models/HunyuanImage-3"
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
Capacity: 1000+ images/hour (scales horizontally)
Troubleshooting Common Issues
Issue: CUDA Out of Memory
# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()
# Solution 2: Reduce batch size
# (If processing multiple prompts)
# Solution 3: Use 8-bit quantization
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
Issue: Slow First Inference
Cause: FlashInfer kernel compilation
Solution: This is normal. First run takes ~10 minutes, subsequent runs are fast. Consider:
# Warm-up run during initialization
model.generate_image(prompt="test", diff_infer_steps=10)
Issue: Model Download Fails
# Use mirror (China users)
export HF_ENDPOINT=https://hf-mirror.com
# Or download via wget/aria2
wget https://huggingface.co/tencent/HunyuanImage-3.0/resolve/main/model.safetensors
Issue: CUDA Version Mismatch
Error: RuntimeError: CUDA version mismatch
Solution:
# Check versions
python -c "import torch; print(torch.version.cuda)"
nvidia-smi
# Reinstall matching PyTorch
pip uninstall torch
pip install torch==2.7.1+cu128 --index-url https://download.pytorch.org/whl/cu128
Get Started Today
Whether you're building a SaaS product, research tool, or creative application, HunyuanImage-3.0 provides enterprise-grade image generation with complete control.
For Developers Who Want Full Control:
- Clone repository:
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0
- Follow this guide for optimal setup
- Join developer community: Discord
For Developers Who Want Quick Integration:
Visit Yuanic.com/api for:
- ⚡ Instant API access - No infrastructure setup
- 📚 Complete documentation - SDKs in Python, JavaScript, Go
- 🔌 Easy integration - REST API with OpenAPI spec
- 💰 Pay-as-you-go pricing - No upfront costs
- 🛡️ Enterprise SLA - 99.9% uptime guarantee
HunyuanImage-3.0's open-source nature gives developers unprecedented freedom to build, customize, and deploy AI image generation at any scale. From hobby projects to enterprise applications, the tools are in your hands.
Technical Resources:
작성자

카테고리
더 많은 게시물

Hunyuan Image 3.0 Ranks #1 on LMArena: Breaking News and Achievements
Hunyuan Image 3.0 has claimed the #1 position on LMArena's text-to-image leaderboard, surpassing Google's Nano Banana and other top models. Discover what this means for AI image generation.


Hunyuan Image 3.0 vs Competitors: The Ultimate AI Image Generator Comparison (2025)
Comprehensive comparison of Hunyuan Image 3.0 against DALL-E 3, Midjourney, Stable Diffusion 3, and Google Imagen. Discover which AI image generator offers the best quality, features, and value for your needs.


How to Use Hunyuan Image 3.0: Complete Beginner's Guide with Examples
Step-by-step guide to using Hunyuan Image 3.0 for creating stunning AI-generated images. Learn prompt writing, settings optimization, and best practices with real examples.

뉴스레터
커뮤니티 참여
최신 뉴스와 업데이트를 위해 뉴스레터를 구독하세요