LTX-2 + ComfyUI: Run 4K AI Video Generation Locally (Complete Guide)

Why Generate AI Videos Locally?

Every major AI video tool—Sora, Veo 3, Runway Gen-4.5—runs in the cloud. You upload a prompt, wait in a queue, pay per second, and hope you don't hit a content filter. But as of January 2026, there's a genuine alternative: LTX-2, the first open-source model that generates 4K video with synchronized audio, running entirely on your own GPU.

Released at CES 2026 by Lightricks and optimized by NVIDIA, LTX-2 represents a fundamental shift. You own the model. There are no per-generation fees. No content restrictions. No internet required. And with the right hardware, you can generate a 720p video clip in about 25 seconds.

In this guide, we'll walk through everything you need to get LTX-2 running locally with ComfyUI—from hardware requirements to prompting techniques to the 4K upscaling pipeline.

What Is LTX-2?

LTX-2 is a 19-billion parameter DiT-based (Diffusion Transformer) audio-video foundation model created by Lightricks. It generates video and audio simultaneously through an asymmetric dual-stream transformer architecture—meaning dialogue, sound effects, background music, and motion are all produced together in a single pass.

Key Specifications

Specification	LTX-2
Parameters	19 billion
Max Resolution	4K (with RTX Video upscaling)
Max Frame Rate	50 FPS
Max Duration	20 seconds
Audio	Native (dialogue, SFX, music)
Inputs	Text, image, audio, depth maps, reference video
License	Open weights (Hugging Face)
Architecture	Asymmetric dual-stream DiT

Why LTX-2 Is Different

Earlier open-source video models like Stable Video Diffusion generated silent clips a few seconds long. LTX-2 changes the game in three ways:

Audio-video synchronization: Unlike Sora 2 and Runway Gen-4.5 (which generate silent video), LTX-2 produces synchronized audio from the start—matching Google Veo 3's biggest advantage
Multi-keyframe support: You can specify keyframes to control the narrative arc across a clip
Control LoRAs: Advanced conditioning lets you guide generation with depth maps, reference images, and motion cues

Hardware Requirements: What GPU Do You Need?

LTX-2's full-precision model demands 32GB+ VRAM. But thanks to NVIDIA's NVFP4/NVFP8 optimizations and community GGUF quantizations, you can run it on a much wider range of hardware.

GPU	VRAM	Recommended Settings	Generation Time
RTX 5090	32 GB	720p24, 4s clip, NVFP4	~25 seconds
RTX 4090	24 GB	720p24, 4s clip, NVFP8	~45 seconds
RTX 4080 / 3090	16–24 GB	540p24, 4s clip, GGUF Q4	~90 seconds
RTX 4070 / 3060	12 GB	540p24, 4s clip, GGUF Q4_K_M	~3 minutes
RTX 4060 / 8 GB	8 GB	540p24, 4s clip, heavy quantization	~5 minutes

Performance tip: NVIDIA's NVFP4 format delivers 3x faster generation with 60% less VRAM compared to full precision. NVFP8 offers 2x speed with 40% VRAM reduction.

Software Requirements

Python 3.12 or higher
CUDA 12.7 or higher
PyTorch 2.7+
ComfyUI (latest version from comfy.org)

Complete Setup Guide: LTX-2 + ComfyUI

Step 1: Install ComfyUI

Visit comfy.org and download the latest installer for Windows
Run the installer—it handles Python, CUDA, and dependency setup automatically
Launch ComfyUI and verify it detects your GPU

Step 2: Download LTX-2

In ComfyUI, open the Template Browser
Navigate to the Video section
Find LTX-2 and select your preferred variant:
- NVFP4 — Best for RTX 50 Series (32 GB)
- NVFP8 — Best for RTX 40 Series (24 GB)
- GGUF Q4_K_M — Best for 8–16 GB GPUs
ComfyUI will download the model weights automatically

Step 3: Generate Your First Video

Load the LTX-2 workflow template
Enter your prompt in the text node
Set your preferred resolution and duration:
- 24 GB+ VRAM: 720p, 24fps, 4 seconds, 20 steps
- 8–16 GB VRAM: 540p, 24fps, 4 seconds, 20 steps
Click Queue Prompt and wait for generation
Preview the output—video and audio will play together

Step 4: Upscale to 4K

LTX-2 generates at 720p natively. To reach 4K, use the RTX Video Super Resolution node in ComfyUI:

Connect the LTX-2 output to the RTX Video upscaler node
Set target resolution to 4K (3840×2160)
The upscaler runs in real time, sharpening edges and cleaning compression artifacts

The result: 4K video with audio, generated entirely on your local machine, with no cloud dependency.

Prompting Guide: Get Better Results

Basic Prompt Structure

LTX-2 responds best to structured, descriptive prompts. Think like a film director, not an image creator:

"A woman in a red dress walks through a rainy Tokyo street at night. Neon signs reflect off wet pavement. She opens an umbrella as thunder rumbles in the distance. Camera follows from behind at medium distance."

What Works Well

Action sequences: Describe what happens over time, not a static image
Audio cues: Include sound descriptions ("thunder rumbles," "jazz music plays," "crowd cheering")
Camera movement: Standard cinematography terms work (tracking, pan, dolly, close-up)
Environmental details: Lighting, weather, time of day
Emotional tone: "Tense," "joyful," "melancholic"

What to Avoid

Text and logos: AI video models still struggle to render readable text
Complex physics: Multi-object collisions, detailed finger movements
Overloaded scenes: Keep the focus on 1–2 subjects per clip
Static descriptions: Don't describe a photo—describe a scene unfolding

Advanced: Using Reference Images

LTX-2 supports image conditioning. Upload a reference image as the starting frame, then describe the motion that follows. This is especially useful for:

Product shots that need to match existing brand imagery
Character consistency across multiple clips
Animating still photographs into dynamic video

Local vs. Cloud: When Each Makes Sense

Factor	Local (LTX-2)	Cloud (Sora / Veo / Runway)
Cost	Free after GPU purchase	$12–$360/month
Privacy	Data never leaves your machine	Uploaded to third-party servers
Content restrictions	None	Platform-specific filters
Internet required	No (after model download)	Yes
Video quality (top-tier)	Good (approaching cloud)	Best (Gen-4.5, Veo 3)
Audio	Native (built-in)	Veo 3 only; others silent
Ease of use	Requires setup	Browser-based, instant
Generation speed	25s–5min (depends on GPU)	30s–2min typically
Full creative pipeline	Manual assembly	Genra: end-to-end (script → video)

When to Use Local Generation

Privacy-sensitive content: Medical, legal, or proprietary material
High volume: Hundreds of clips with no per-generation cost
Creative freedom: Content that cloud filters might block
Offline workflows: Travel, remote locations, air-gapped systems
Learning and experimentation: Unlimited iterations without cost anxiety

When to Use Cloud Tools

Highest quality needed: Runway Gen-4.5 and Veo 3 still lead on visual fidelity
No powerful GPU: Cloud tools work on any device with a browser
End-to-end workflow: Genra handles scripting, scene creation, music, and editing in one platform
Team collaboration: Shared projects, approvals, and version control

Advanced Workflows

Blender → LTX-2: 3D Scene Guidance

NVIDIA outlined a pipeline where Blender 3D scenes serve as structural guides for LTX-2 generation. You create a rough 3D layout, export depth maps, and use them as conditioning inputs. This gives you precise control over camera angles, object placement, and spatial composition—something pure text prompting can't achieve.

Multi-Clip Storytelling

Since LTX-2 supports multi-keyframe generation, you can create longer narratives by:

Planning your story in 4-second segments
Using the final frame of clip N as the starting image for clip N+1
Maintaining character consistency through reference images
Assembling the final sequence in any video editor

LoRA Fine-Tuning

LTX-2 supports Control LoRAs for style adaptation. Community members have already trained LoRAs for specific aesthetics (anime, film noir, product photography). This lets you create a consistent visual brand across all your generated content.

Current Limitations

LTX-2 is impressive for an open-source model, but it has clear limitations compared to cloud leaders:

Visual quality gap: Cloud models like Gen-4.5 and Veo 3 still produce more polished output, especially for complex human faces and fine details
Duration vs. quality tradeoff: Longer clips (8+ seconds) significantly increase generation time and can reduce quality
Hardware barrier: You need at minimum a 12 GB GPU, and the best experience requires 24 GB+
Setup complexity: ComfyUI's node-based interface has a learning curve for non-technical users
No built-in editing: Unlike Genra or Runway, there's no script-to-video pipeline—you assemble everything manually

The Verdict: Is Local AI Video Ready?

LTX-2 proves that local AI video generation is no longer a toy. With native audio, 4K upscaling, and NVIDIA optimization, it's a viable tool for creators who value privacy, cost control, and creative freedom.

But it's not a replacement for cloud tools—it's a complement. The ideal 2026 workflow might look like this:

LTX-2 for experimentation, prototyping, and high-volume generation
Genra for polished, end-to-end video production with scripting and music
Cloud models for final hero content that needs the absolute best quality

The era of AI video being locked behind cloud subscriptions is ending. LTX-2 just opened the door.

"With NVIDIA-optimized ComfyUI, LTX-2 delivers cloud-class 4K video locally—up to 3x faster with 60% less VRAM." — NVIDIA Blog, CES 2026