LTX-2 + ComfyUI: Run 4K AI Video Generation Locally (Complete Guide)

· Chris Sherman

Why Generate AI Videos Locally?

Every major AI video tool—Sora, Veo 3, Runway Gen-4.5—runs in the cloud. You upload a prompt, wait in a queue, pay per second, and hope you don't hit a content filter. But as of January 2026, there's a genuine alternative: LTX-2, the first open-source model that generates 4K video with synchronized audio, running entirely on your own GPU.

Released at CES 2026 by Lightricks and optimized by NVIDIA, LTX-2 represents a fundamental shift. You own the model. There are no per-generation fees. No content restrictions. No internet required. And with the right hardware, you can generate a 720p video clip in about 25 seconds.

In this guide, we'll walk through everything you need to get LTX-2 running locally with ComfyUI—from hardware requirements to prompting techniques to the 4K upscaling pipeline.

What Is LTX-2?

LTX-2 is a 19-billion parameter DiT-based (Diffusion Transformer) audio-video foundation model created by Lightricks. It generates video and audio simultaneously through an asymmetric dual-stream transformer architecture—meaning dialogue, sound effects, background music, and motion are all produced together in a single pass.

Key Specifications

Specification LTX-2
Parameters 19 billion
Max Resolution 4K (with RTX Video upscaling)
Max Frame Rate 50 FPS
Max Duration 20 seconds
Audio Native (dialogue, SFX, music)
Inputs Text, image, audio, depth maps, reference video
License Open weights (Hugging Face)
Architecture Asymmetric dual-stream DiT

Why LTX-2 Is Different

Earlier open-source video models like Stable Video Diffusion generated silent clips a few seconds long. LTX-2 changes the game in three ways:

  1. Audio-video synchronization: Unlike Sora 2 and Runway Gen-4.5 (which generate silent video), LTX-2 produces synchronized audio from the start—matching Google Veo 3's biggest advantage
  2. Multi-keyframe support: You can specify keyframes to control the narrative arc across a clip
  3. Control LoRAs: Advanced conditioning lets you guide generation with depth maps, reference images, and motion cues

Hardware Requirements: What GPU Do You Need?

LTX-2's full-precision model demands 32GB+ VRAM. But thanks to NVIDIA's NVFP4/NVFP8 optimizations and community GGUF quantizations, you can run it on a much wider range of hardware.

GPU VRAM Recommended Settings Generation Time
RTX 5090 32 GB 720p24, 4s clip, NVFP4 ~25 seconds
RTX 4090 24 GB 720p24, 4s clip, NVFP8 ~45 seconds
RTX 4080 / 3090 16–24 GB 540p24, 4s clip, GGUF Q4 ~90 seconds
RTX 4070 / 3060 12 GB 540p24, 4s clip, GGUF Q4_K_M ~3 minutes
RTX 4060 / 8 GB 8 GB 540p24, 4s clip, heavy quantization ~5 minutes

Performance tip: NVIDIA's NVFP4 format delivers 3x faster generation with 60% less VRAM compared to full precision. NVFP8 offers 2x speed with 40% VRAM reduction.

Software Requirements

  • Python 3.12 or higher
  • CUDA 12.7 or higher
  • PyTorch 2.7+
  • ComfyUI (latest version from comfy.org)

Complete Setup Guide: LTX-2 + ComfyUI

Step 1: Install ComfyUI

  1. Visit comfy.org and download the latest installer for Windows
  2. Run the installer—it handles Python, CUDA, and dependency setup automatically
  3. Launch ComfyUI and verify it detects your GPU

Step 2: Download LTX-2

  1. In ComfyUI, open the Template Browser
  2. Navigate to the Video section
  3. Find LTX-2 and select your preferred variant:
    • NVFP4 — Best for RTX 50 Series (32 GB)
    • NVFP8 — Best for RTX 40 Series (24 GB)
    • GGUF Q4_K_M — Best for 8–16 GB GPUs
  4. ComfyUI will download the model weights automatically

Step 3: Generate Your First Video

  1. Load the LTX-2 workflow template
  2. Enter your prompt in the text node
  3. Set your preferred resolution and duration:
    • 24 GB+ VRAM: 720p, 24fps, 4 seconds, 20 steps
    • 8–16 GB VRAM: 540p, 24fps, 4 seconds, 20 steps
  4. Click Queue Prompt and wait for generation
  5. Preview the output—video and audio will play together

Step 4: Upscale to 4K

LTX-2 generates at 720p natively. To reach 4K, use the RTX Video Super Resolution node in ComfyUI:

  1. Connect the LTX-2 output to the RTX Video upscaler node
  2. Set target resolution to 4K (3840×2160)
  3. The upscaler runs in real time, sharpening edges and cleaning compression artifacts

The result: 4K video with audio, generated entirely on your local machine, with no cloud dependency.

Prompting Guide: Get Better Results

Basic Prompt Structure

LTX-2 responds best to structured, descriptive prompts. Think like a film director, not an image creator:

"A woman in a red dress walks through a rainy Tokyo street at night. Neon signs reflect off wet pavement. She opens an umbrella as thunder rumbles in the distance. Camera follows from behind at medium distance."

What Works Well

  • Action sequences: Describe what happens over time, not a static image
  • Audio cues: Include sound descriptions ("thunder rumbles," "jazz music plays," "crowd cheering")
  • Camera movement: Standard cinematography terms work (tracking, pan, dolly, close-up)
  • Environmental details: Lighting, weather, time of day
  • Emotional tone: "Tense," "joyful," "melancholic"

What to Avoid

  • Text and logos: AI video models still struggle to render readable text
  • Complex physics: Multi-object collisions, detailed finger movements
  • Overloaded scenes: Keep the focus on 1–2 subjects per clip
  • Static descriptions: Don't describe a photo—describe a scene unfolding

Advanced: Using Reference Images

LTX-2 supports image conditioning. Upload a reference image as the starting frame, then describe the motion that follows. This is especially useful for:

  • Product shots that need to match existing brand imagery
  • Character consistency across multiple clips
  • Animating still photographs into dynamic video

Local vs. Cloud: When Each Makes Sense

Factor Local (LTX-2) Cloud (Sora / Veo / Runway)
Cost Free after GPU purchase $12–$360/month
Privacy Data never leaves your machine Uploaded to third-party servers
Content restrictions None Platform-specific filters
Internet required No (after model download) Yes
Video quality (top-tier) Good (approaching cloud) Best (Gen-4.5, Veo 3)
Audio Native (built-in) Veo 3 only; others silent
Ease of use Requires setup Browser-based, instant
Generation speed 25s–5min (depends on GPU) 30s–2min typically
Full creative pipeline Manual assembly Genra: end-to-end (script → video)

When to Use Local Generation

  • Privacy-sensitive content: Medical, legal, or proprietary material
  • High volume: Hundreds of clips with no per-generation cost
  • Creative freedom: Content that cloud filters might block
  • Offline workflows: Travel, remote locations, air-gapped systems
  • Learning and experimentation: Unlimited iterations without cost anxiety

When to Use Cloud Tools

  • Highest quality needed: Runway Gen-4.5 and Veo 3 still lead on visual fidelity
  • No powerful GPU: Cloud tools work on any device with a browser
  • End-to-end workflow: Genra handles scripting, scene creation, music, and editing in one platform
  • Team collaboration: Shared projects, approvals, and version control

Advanced Workflows

Blender → LTX-2: 3D Scene Guidance

NVIDIA outlined a pipeline where Blender 3D scenes serve as structural guides for LTX-2 generation. You create a rough 3D layout, export depth maps, and use them as conditioning inputs. This gives you precise control over camera angles, object placement, and spatial composition—something pure text prompting can't achieve.

Multi-Clip Storytelling

Since LTX-2 supports multi-keyframe generation, you can create longer narratives by:

  1. Planning your story in 4-second segments
  2. Using the final frame of clip N as the starting image for clip N+1
  3. Maintaining character consistency through reference images
  4. Assembling the final sequence in any video editor

LoRA Fine-Tuning

LTX-2 supports Control LoRAs for style adaptation. Community members have already trained LoRAs for specific aesthetics (anime, film noir, product photography). This lets you create a consistent visual brand across all your generated content.

Current Limitations

LTX-2 is impressive for an open-source model, but it has clear limitations compared to cloud leaders:

  • Visual quality gap: Cloud models like Gen-4.5 and Veo 3 still produce more polished output, especially for complex human faces and fine details
  • Duration vs. quality tradeoff: Longer clips (8+ seconds) significantly increase generation time and can reduce quality
  • Hardware barrier: You need at minimum a 12 GB GPU, and the best experience requires 24 GB+
  • Setup complexity: ComfyUI's node-based interface has a learning curve for non-technical users
  • No built-in editing: Unlike Genra or Runway, there's no script-to-video pipeline—you assemble everything manually

The Verdict: Is Local AI Video Ready?

LTX-2 proves that local AI video generation is no longer a toy. With native audio, 4K upscaling, and NVIDIA optimization, it's a viable tool for creators who value privacy, cost control, and creative freedom.

But it's not a replacement for cloud tools—it's a complement. The ideal 2026 workflow might look like this:

  • LTX-2 for experimentation, prototyping, and high-volume generation
  • Genra for polished, end-to-end video production with scripting and music
  • Cloud models for final hero content that needs the absolute best quality

The era of AI video being locked behind cloud subscriptions is ending. LTX-2 just opened the door.

"With NVIDIA-optimized ComfyUI, LTX-2 delivers cloud-class 4K video locally—up to 3x faster with 60% less VRAM." — NVIDIA Blog, CES 2026