Skip to content

Tencent-Hunyuan/OmniWeaving

Repository files navigation

OmniWeaving

OmniWeaving Logo

icon OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Kaihang Pan*1,2, Qi Tian*2, Jianwei Zhang2, Weijie Kong2, Jiangfeng Xiong2, Yanxin Long2, Shixue Zhang2, Haiyi Qiu1, Tan Wang3, Zheqi Lv1, Yue Wu§2, Liefeng Bo2, Siliang Tang§1, Zhao Zhong†2

1Zhejiang University   2Tencent Hunyuan   3Nanyang Technological University
*Equal Contribution   §Corresponding Authors   Project Leader
Work done during Kaihang Pan's internship at Tencent Hunyuan

🔥🔥🔥 News

  • 📌 OmniWeaving is developed by the HunyuanVideo team and is built upon the latest HunyuanVideo-1.5 as the backbone. If you find our work useful, please consider giving this repository a star and citing our paper~
  • 🚀 April 3, 2026: We release the code and model weights of OmniWeaving.
  • 🏃‍♂️ April 3, 2026: We release the IntelligentVBench.
  • 📖 Mar 26, 2026: We release the OmniWeaving paper on Arxiv.
  • 👋 Mar 25, 2026: We release the webpage of OmniWeaving.

📑 Open-source Plan

  • OmniWeaving
    • [✅] Inference Code
    • [✅] Model Checkpoints
    • [✅] Training Data Construction Code
    • [✅] Training Example Code
  • IntelligentVBench
    • [✅] Test cases
    • [✅] Evaluation Code

📋 Table of Contents

📖 Abstract

We propose OmniWeaving OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models.

🏗 Model Architecture

Following the paper, OmniWeaving is built as an integrated MLLM + MMDiT + VAE framework for unified free-form video generation. The MLLM serves as the semantic parser for interleaved text, images, and video inputs, mapping them into a high-level semantic space and forwarding its hidden states through an MLP connector. The VAE acts as the visual tokenizer, compressing visual inputs into low-level latents, while the MMDiT uses these semantic conditions together with latent noise to generate semantically aligned, high-fidelity videos.

On this basis, we further introduce two extra improvements tailored for advanced reasoning and composition.

  • (1) Activating Thinking Mode of the MLLM: Direct MLLM encoding of interleaved visual-text inputs often yields semantic ambiguity due to weak intra-correlations and unclear video creation intents. We elevate the MLLM from a passive feature extractor to an active reasoner. By activating the thinking mode to generate intermediate reasoning steps, it autonomously deduces a semantically precise, enhanced prompt. The hidden states of this enhanced prompt are then forwarded alongside the original MLLM features to condition the MMDiT, effectively bridging the cognitive gap between abstract user intent and pixel-level generation.
  • (2) Hidden States DeepStacking: Compositional video generation involving multiple subjects or intricate scenes often relies on both low- and high-level semantic representations. Drawing inspiration from the DeepStacking mechanism in Qwen3-VL, we extract hidden states from a broader range of intermediate MLLM layers to capture a rich semantic spectrum spanning from fine-grained details to high-level abstractions. An MLP connector projects these multi-level features into the MMDiT embedding space. These projected features are then directly added to the corresponding hidden states within the first three layers of the MMDiT conditioning branch, effectively injecting multi-granular semantic guidance into the generative process.
OmniWeaving Architecture

Figure 1. Overview of the OmniWeaving architecture, which consists of an MLLM for multimodal understanding and an MMDiT for generation.

🚀 Supported Tasks

OmniWeaving is flexible in its input and output configurations, supporting a wide range of unified video generation tasks:

Task Input Type Output Description Demo Input Demo Output
Text-to-Video (T2V) Text 📝 Video 🎬 Generating a video from text prompts.
First-Frame-to-Video (I2V) Image 🖼 + Text 📝 Video 🎬 Generating a video based on the first frame.
Key-Frames-to-Video 2 × Images 🖼 + Text 📝 Video 🎬 Generating a video conditioned on start and end frames.
Video-to-Video Editing Video 🎬 + Text 📝 Video 🎬 Instruction-based video manipulation and stylization.
Reference-to-Video Image 🖼 + Text 📝 Video 🎬 Single-subject reference-driven video generation.
Compositional Multi-Image-to-Video 2–4 × Images 🖼 + Text 📝 Video 🎬 Multi-subject compositional video generation.
Text-Image-Video-to-Video Video 🎬 + Image 🖼 + Text 📝 Video 🎬 Generating a video conditioned on text, image, and video inputs.

Reasoning-Augmented Video Generation Image(s) 🖼 + Text 📝 Reasoning 💭 + Video 🎬 Reasoning over user intent before generating the video.

🛠 Preparation

Step 1: Clone the Repository

git clone https://github.com/Tencent-Hunyuan/OmniWeaving
cd OmniWeaving

Step 2: Install Dependencies

OmniWeaving is built upon HunyuanVideo-1.5. The way to install dependencies is similar to HunyuanVideo-1.5. Specifically, you should install basic dependencies:

pip install -r requirements.txt

Additionally, install the attention libraries as needed (we use Flash Attention in practice):

  • Flash Attention: Install for faster inference and reduced GPU memory consumption. See Flash Attention for details.

  • Flex-Block-Attention: Required only for sparse attention to achieve faster inference:

    git clone https://github.com/Tencent-Hunyuan/flex-block-attn.git
    cd flex-block-attn
    git submodule update --init --recursive
    python3 setup.py install
  • SageAttention: For faster inference (will automatically disable Flex-Block-Attention):

    git clone https://github.com/cooper1637/SageAttention.git
    cd SageAttention 
    export EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 # Optional
    python3 setup.py install

Step 3: Download Models

OmniWeaving: Download from HuggingFace.

Detailed download instructions are available at download-checkpoint.md.

🔑 Inference

In our inference code, we define six task flags corresponding to the Supported Tasks. Their mapping is as follows:

Task Flag Full Name Description
t2v Text-to-Video Generate videos from text prompts.
i2v First-Frame-to-Video Animate a static image into a video guided by text.
interpolation Key-Frames-to-Video Generate a video conditioned on start and end frames.
reference2v Reference-to-Video / Compositional Multi-Image-to-Video Single- or multi-subject reference-driven video generation.
editing Video-to-Video Editing Instruction-based video manipulation and stylization.
tiv2v Text-Image-Video-to-Video Generate a video conditioned on text, image, and video inputs.

Among these, t2v, i2v, and interpolation can optionally enable thinking mode (--think) for Reasoning-Augmented Video Generation, where the MLLM first reasons over user intent before generating the video.

Common Configuration

All tasks share the following hyperparameters (configured at the top of generate.sh):

N_INFERENCE_GPU=8
SEED=0
ASPECT_RATIO=16:9
MODEL_PATH=/path/to/OmniWeaving

SAGE_ATTN=false ### Use Flash Attention
### SAGE_ATTN=true ### Use SageAttention
SPARSE_ATTN=false
OVERLAP_GROUP_OFFLOADING=false
ENABLE_CACHE=false
CACHE_TYPE=deepcache

Tips: If your GPU memory is limited and you encounter OOM errors, try:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128

If you have limited CPU memory, disable overlapped group offloading by setting OVERLAP_GROUP_OFFLOADING=false.

Task-Specific Inference Scripts

1. Text-to-Video (t2v)

Generate a video from a text prompt.

PROMPT="Put Your Prompt Here"
NEGATIVE_PROMPT="overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion"
OUTPUT_PATH=./outputs/t2v.mp4

torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
  --task t2v \
  --prompt "$PROMPT" \
  --negative_prompt "$NEGATIVE_PROMPT" \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
  --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
  --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH \
  # --think \          # Optional: enable reasoning-augmented generation (see note below)

The --think flag activates the MLLM's thinking mode, in which it reasons over user intent and generates an enriched prompt before video generation. The --think flag is supported by t2v, i2v, and interpolation tasks.

2. First-Frame-to-Video (i2v)

Animate a first-frame image into a video guided by a text prompt.

PROMPT="Put Your Prompt Here"
IMAGE_PATH=/path/to/reference.png
OUTPUT_PATH=./outputs/i2v.mp4

torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
  --task i2v \
  --prompt "$PROMPT" \
  --image_path $IMAGE_PATH \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
  --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
  --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH \
  # --think \          # Optional: enable reasoning-augmented generation (see note below)

3. Key-Frames-to-Video (interpolation)

Generate a video that bridges two key frames, guided by a text prompt.

PROMPT="Put Your Prompt Here"
REF_IMAGE_PATHS=(/path/to/first_frame.png /path/to/last_frame.png)
OUTPUT_PATH=./outputs/interpolation.mp4

torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
  --task interpolation \
  --prompt "$PROMPT" \
  --ref_image_paths "${REF_IMAGE_PATHS[@]}" \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
  --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
  --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH \
  # --think \          # Optional: enable reasoning-augmented generation

4. Reference-to-Video / Compositional Multi-Image-to-Video (reference2v)

Generate a video featuring one or more reference subjects. Provide one or more reference images via --ref_image_paths.

PROMPT="Put Your Prompt Here"
# Supports 1–4 reference images.
# For best results with multiple images, use the same aspect ratio across all images,
# as they will be center-cropped to match the size of the first image.
REF_IMAGE_PATHS=(/path/to/img1.png /path/to/img2.png ... /path/to/img4.png)  # up to 4 input images
OUTPUT_PATH=./outputs/reference2v.mp4

torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
  --task reference2v \
  --prompt "$PROMPT" \
  --ref_image_paths "${REF_IMAGE_PATHS[@]}" \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
  --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
  --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH

5. Video-to-Video Editing (editing)

Edit an existing video according to the text instruction (e.g., style transfer, object replacement).

PROMPT="Put Your Prompt Here"
CONDITION_VIDEO_PATH=/path/to/source_video.mp4
OUTPUT_PATH=./outputs/editing.mp4

# If you have pre-extracted VAE latents for the condition video, pass them via
# --condition_video_latents_path /path/to/latents.pt to skip VAE encoding at inference.
torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
  --task editing \
  --prompt "$PROMPT" \
  --condition_video_paths $CONDITION_VIDEO_PATH \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
  --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
  --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH \
  # --condition_video_latents_path /path/to/latents.pt  # Optional: skip VAE encoding by providing pre-extracted latents

6. Text-Image-Video-to-Video (tiv2v)

Edit a video while incorporating reference subject images (e.g., insert a character from a reference image into a source video).

PROMPT="Put Your Prompt Here"
CONDITION_VIDEO_PATH=/path/to/source_video.mp4
# Only one reference image is supported for tiv2v.
# For best results, use a reference image whose aspect ratio is close to the output video's aspect ratio.
REF_IMAGE_PATHS=(/path/to/ref_image.png)
OUTPUT_PATH=./outputs/tiv2v.mp4

# If you have pre-extracted VAE latents for the condition video, pass them via
# --condition_video_latents_path /path/to/latents.pt to skip VAE encoding at inference.
torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
  --task tiv2v \
  --prompt "$PROMPT" \
  --condition_video_paths $CONDITION_VIDEO_PATH \
  --ref_image_paths "${REF_IMAGE_PATHS[@]}" \
  --aspect_ratio $ASPECT_RATIO \
  --seed $SEED \
  --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
  --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
  --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
  --output_path $OUTPUT_PATH \
  --model_path $MODEL_PATH \
  # --condition_video_latents_path /path/to/latents.pt  # Optional: skip VAE encoding by providing pre-extracted latents

Other Optional Arguments

The arguments below can be appended to any of the task commands above for further customization:

Argument Type Default Description
--negative_prompt str "" Negative prompt for video generation. Default is empty. Setting a negative prompt (e.g., 'overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion') can improve quality, especially for tasks like t2v.
--num_inference_steps int 50 Number of denoising steps
--video_length int 81 Number of frames to generate
--fps int Auto Output FPS (default: 16 for ≤81 frames, 24 for >81 frames)
--dtype str bf16 Data type: bf16 or fp32
--offloading bool true Enable CPU offloading
--group_offloading bool None Enable group offloading (auto-enabled with offloading)
--pipeline_config str omniweaving Pipeline configuration preset that controls guidance_scale and flow_shift. Available presets: omniweaving (guidance_scale=6.0, flow_shift=7.0), omniweaving2 (guidance_scale=6.0, flow_shift=5.0).

Tuning guidance_scale / flow_shift: You can switch presets via --pipeline_config (e.g., --pipeline_config omniweaving2). If the available presets do not meet your needs, you can add a new key to the PIPELINE_CONFIGS dict in hyvideo/commons/__init__.py with your desired values. We recommend guidance_scale=6.0 with flow_shift=5.0 or 7.0.

🗂 Training Data Construction

To train OmniWeaving, beyond Foundational Video Generation Tasks, we design Multimodal Composition Tasks and Reasoning-Augmented Tasks. We provide reference data construction pipelines under process_data/ for three representative tasks: Intent-Driven Image-to-Video, Text-Image-Video-to-Video, and Interleaved Text-and-Multi-Image-to-Video. Each pipeline uses a VLM (e.g., Qwen3-VL-235B) for annotation, quality filtering, and instruction rewriting, with the final output saved in Apache Arrow format.

Prerequisites

Launch a vLLM server for VLM inference (used by all pipelines). We use Qwen3-VL-235B-A22B-Instruct as the default VLM. Please refer to the Qwen3-VL vLLM documentation for detailed deployment instructions. Below is an example:

vllm serve /path/to/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --async-scheduling \
  --port 8080 \
  --max-model-len 65536 \
  --mm-encoder-tp-mode data \
  --dtype bfloat16 \
  --max-num-batched-tokens 256 \
  --distributed-executor-backend mp \
  --gpu-memory-utilization 0.9 \
  --limit-mm-per-prompt.video 2

1. Intent-Driven Image-to-Video

📂 process_data/intent_driven_image_to_video/

A Reasoning-Augmented Task, where the model learns to formulate a reasoning trace detailing the temporal progression when visual and textual inputs lack explicit linkage in i2v tasks. This pipeline constructs training pairs of (intent prompt, reasoning trace) from raw videos. The pipeline for training data construction consists of 4 steps:

Step Script Description
1 step1.py Extract first frame from video; use VLM to generate motion descriptions.
2 step2.py Use VLM to correct motion descriptions and predict user intent.
3 step3.py VLM self-check: verify corrected motion and intent against the original video.
4 step4_make_arrow.py Export valid training data to Arrow format.

2. Text-Image-Video-to-Video

📂 process_data/text_image_video_to_video/

A Multimodal Composition Task, where the inputs consist of three modalities (image, text, and video), requiring the model to seamlessly integrate target visual elements extracted from reference images into the temporal dynamics of a source video. This pipeline builds training data for edit-conditioned video generation from pairs of condition/edited videos. The pipeline for training data construction consists of 6 steps:

Step Script Description
1 step1.py Use VLM to judge whether the editing data is usable and extract a new-element description from before/after key frames of the video.
2 step2.py Generate standalone subject images from edited frames using Flux.
3 step3.py Use VLM to verify the quality of extracted reference images against the edited video.
4 step4.py Use VLM to rewrite the editing instruction to concisely reference the reference image.
5 step5.py Use VLM for final quality check: ensure consistency among reference image, rewritten instruction, and edited video.
6 step6_make_arrow.py Export valid training data to Arrow format.

3. Interleaved Text-and-Multi-Image-to-Video

📂 process_data/interleaved_text_and_multi_image_to_video/

A Multimodal Composition Task, where the inputs contain multiple reference images (capturing key visual elements such as subjects or scenes) interleaved with text, requiring the model to accurately compose these elements into a cohesive video sequence. This pipeline decomposes videos into multi-subject compositions by extracting individual subject images and background. For simplicity, the provided pipeline currently omits the use of SAM3; please refer to the in-file comments for details. The pipeline for training data construction consists of 9 steps:

Step Script Description
1 step1.py Use VLM to identify main moving objects (up to 3) from video key frames and rewrite the instruction around them.
2 step2.py Use VLM to rewrite the instruction by replacing object references with [object1][object3] placeholders.
3 step3.py Use Flux to extract each identified object from the frame as standalone reference images.
4 step4.py Use Flux to generate a background-only image with subjects removed.
5 step5.py Use VLM to quality-check extracted reference images against the original frame.
6 step6.py Use VLM to quality-check the background image (subjects properly removed, scene consistent).
7 step7.py Merge results from previous steps, select the best reference image per object, and use VLM to produce the final instruction.
8 step8.py Use VLM for final validation: verify that reference images and background align with the original video.
9 step9_make_arrow.py Export valid training data to Arrow format.

🎓 Training

Following HunyuanVideo-1.5, OmniWeaving is trained using the Muon optimizer, which accelerates convergence and improves training stability. Below we provide a simple multi-task training example to illustrate how to perform multi-task mixed training on OmniWeaving. See train.py for the full implementation. Here is how to use the training script (train.py):

1. Implement Your DataLoaders

Replace the create_dummy_dataloader*() functions in train.py with your own implementations. Each dataset's __getitem__ method should return a single sample. OmniWeaving supports multi-task training with different dataloader types mapped to different tasks:

Dataloader Task Description
create_dummy_dataloader1 t2v / i2v Text-to-video and image-to-video
create_dummy_dataloader2 multi_imgs_to_v Interleaved Text-and-Multi-Image-to-Video
create_dummy_dataloader3 key_frames_to_v Key-frames-to-Video
create_dummy_dataloader4 v2v Video-to-video editing
create_dummy_dataloader5 tiv2v Text-Image-Video-to-Video

Data fields:

Field Used By Type Description
"pixel_values" All torch.Tensor Pixels in [-1, 1] with the shape of [C, F, H, W] (F must be 4n+1)
"text" All str Text prompt
"data_type" All str "video"
"latents" All torch.Tensor Pre-encoded VAE latents (skips VAE encoding)
"byt5_text_ids" All torch.LongTensor Pre-tokenized byT5 input ids
"byt5_text_mask" All torch.LongTensor Pre-tokenized byT5 attention mask
"condition" multi_imgs_to_v, key_frames_to_v list[PIL.Image] Condition images (subject references or key frames)
"frame_id" key_frames_to_v list[int] Each key-frame image's position in the target video (e.g., [0, 15] places two images at frame 0 and frame 15)
"input_video_path" v2v, tiv2v str Path to the conditioning video
"input_video_latents" v2v, tiv2v torch.Tensor Pre-encoded conditioning video latents
"input_img" tiv2v list[PIL.Image] Reference image(s) for image conditioning

See the create_dummy_dataloader*() functions in train.py for detailed format documentation and examples.

2. Run Training

Single GPU:

python train.py --pretrained_model_root <path_to_pretrained_model> --dataloader_probs <task_probs> [other args]

Multi-GPU:

N=8
torchrun --nproc_per_node=$N train.py --pretrained_model_root <path_to_pretrained_model> --dataloader_probs <task_probs> [other args]

Example:

torchrun --nproc_per_node=8 train.py \
  --pretrained_model_root ./ckpts \
  --learning_rate 1e-5 \
  --batch_size 1 \
  --max_steps 10000 \
  --output_dir ./outputs \
  --enable_fsdp \
  --enable_gradient_checkpointing \
  --sp_size 8 \
  --dataloader_probs "t2v:0.3,i2v:0.2,key_frames_to_v:0.1,multi_imgs_to_v:0.1,v2v:0.15,tiv2v:0.15"

3. Key Training Parameters

Parameter Description Default
--dataloader_probs Multi-task mixing probabilities (e.g., "t2v:0.3,i2v:0.2,...") (required) Required
--pretrained_model_root Path to pretrained model (required) ckpts
--learning_rate Learning rate 1e-5
--batch_size Batch size 1
--max_steps Maximum training steps 10000
--warmup_steps Warmup steps 500
--gradient_accumulation_steps Gradient accumulation steps 1
--enable_fsdp Enable FSDP for distributed training true
--enable_gradient_checkpointing Enable gradient checkpointing true
--sp_size Sequence parallelism size (must divide world_size) 8
--use_muon Use Muon optimizer true
--resume_from_checkpoint Resume from checkpoint directory None
--use_lora Enable LoRA fine-tuning false
--lora_r LoRA rank 8
--lora_alpha LoRA alpha scaling parameter 16
--lora_dropout LoRA dropout rate 0.0
--pretrained_lora_path Path to pretrained LoRA adapter None
--deepstack DeepStacking layer indices for text encoder 8 16 24

4. Monitor Training

  • Checkpoints are saved to output_dir at intervals specified by --save_interval
  • Validation videos are generated at intervals specified by --validation_interval
  • Training logs are printed to console at intervals specified by --log_interval

📊 Evaluation on IntelligentVBench

We introduce IntelligentVBench, the first comprehensive benchmark for evaluating intelligent unified video generation. The benchmark covers four task categories with test cases provided under IntelligentVBench/csvs/. All evaluations use Gemini 2.5 Pro as the judge model, scoring each sample on three dimensions (1–5 scale) and reporting both AVG (mean) and MIN (minimum) across dimensions.

Benchmark Tasks

Task CSV Description Key Input Fields
Implicit I2V Implicit_I2V.csv First-frame-to-video with abstract/implicit text prompts ref_image_path, prompt
Interpolative DI2V Interpolative_DI2V.csv Generate video bridging first and last frames first_frame, last_frame, prompt
TIV2V TIV2V.csv Text-image-guided video editing condition_video, condition_image, prompt
Compositional MI2V Compositional_MI2V_subject{1,2,3}.csv Multi-subject-and-scene compositional generation (1–3 subjects + background) subject_img_{1,2,3}, background_img, prompt

CSV Field Descriptions

Field Used In Description
index All Unique sample identifier; the generated video should be named {index}.mp4
prompt All Text instruction describing the desired video content or editing operation
ref_image_path Implicit I2V Path to the reference first-frame image to be animated
first_frame Interpolative DI2V Path to the starting frame image
last_frame Interpolative DI2V Path to the ending frame image
edited_type TIV2V Editing category: local_change (object replacement), local_add (add object), or back_change (background swap)
condition_video TIV2V Path to the source video to be edited
condition_image TIV2V Path to the reference image providing the target appearance for editing
subject_img_1 Compositional MI2V Path to the first subject reference image
subject_img_2 Compositional MI2V (subject2, subject3) Path to the second subject reference image
subject_img_3 Compositional MI2V (subject3) Path to the third subject reference image
background_img Compositional MI2V Path to an optional background image (empty if not used)

Running Evaluation

Step 1: Download the IntelligentVBench test data from HuggingFace:

huggingface-cli download --repo-type dataset --resume-download midbee/IntelligentVBench \
  --local-dir ./IntelligentVBench \
  --local-dir-use-symlinks False

The downloaded dataset has the following structure:

IntelligentVBench/
├── csvs/                                   # Test case definitions
│   ├── Implicit_I2V.csv                    # 250 test cases
│   ├── Interpolative_DI2V.csv              # 250 test cases
│   ├── TIV2V.csv                           # 210 test cases
│   ├── Compositional_MI2V_subject1.csv     # 130 test cases (1 subject + background)
│   ├── Compositional_MI2V_subject2.csv     # 120 test cases (2 subjects + background)
│   └── Compositional_MI2V_subject3.csv     #  70 test cases (3 subjects + background)
├── images/
│   ├── Implicit_I2V/                       # Reference first-frame images ({index}.png)
│   ├── Interpolative_DI2V/                 # Start/end frame pairs ({index}_first.png, {index}_last.png)
│   ├── TIV2V/                              # Reference images for editing ({index}.png)
│   └── Compositional_MI2V/
│       ├── subject1/                       # Single subject images ({index}_subject1.png, optional {index}_background.png)
│       ├── subject2/                       # Two subject images ({index}_subject{1,2}.png, optional background)
│       └── subject3/                       # Three subject images ({index}_subject{1,2,3}.png, optional background)
└── videos/
    └── TIV2V/                              # Source videos for editing ({index}.mp4)

Step 2: Generate videos for each test case. Each video should be named {index}.mp4 (using the index field from the CSV) and saved under a task-specific output directory. For example:

# Implicit I2V (250 videos)
/path/to/implicit_i2v_videos/
├── 00001.mp4
├── 00002.mp4
├── ...
└── 00250.mp4

# Interpolative DI2V (250 videos)
/path/to/interpolative_di2v_videos/
├── 00001.mp4
├── 00002.mp4
├── ...
└── 00250.mp4

# TIV2V (210 videos)
/path/to/tiv2v_videos/
├── 00001.mp4
├── 00002.mp4
├── ...
└── 00210.mp4

# Compositional MI2V — one directory per subject count
/path/to/subject1_videos/        # 130 videos
├── 00001.mp4
├── ...
└── 00130.mp4
/path/to/subject2_videos/        # 120 videos
├── 00001.mp4
├── ...
└── 00120.mp4
/path/to/subject3_videos/        # 70 videos
├── 00001.mp4
├── ...
└── 00070.mp4

Step 3: Set your Gemini API credentials in each evaluation script (evaluate_implicit_i2v.py, evaluate_interpolative_di2v.py, evaluate_tiv2v.py, evaluate_compositional_mi2v.py). Open the file and fill in the api_key and gemini_url fields at the top:

api_key = "your-api-key"
gemini_url = "https://your-gemini-endpoint"

Step 4: Run the corresponding evaluation script:

# Implicit I2V
python IntelligentVBench/evaluate_implicit_i2v.py \
  --input_csv IntelligentVBench/csvs/Implicit_I2V.csv \
  --edit_path /path/to/generated_videos \
  --output_csv /path/to/implicit_i2v_eval.csv

# Interpolative DI2V
python IntelligentVBench/evaluate_interpolative_di2v.py \
  --input_csv IntelligentVBench/csvs/Interpolative_DI2V.csv \
  --edit_path /path/to/generated_videos \
  --output_csv /path/to/interpolative_di2v_eval.csv

# TIV2V
python IntelligentVBench/evaluate_tiv2v.py \
  --input_csv IntelligentVBench/csvs/TIV2V.csv \
  --edit_path /path/to/generated_videos \
  --output_csv /path/to/tiv2v_eval.csv

# Compositional MI2V
python IntelligentVBench/evaluate_compositional_mi2v.py \
  --input_csv1 IntelligentVBench/csvs/Compositional_MI2V_subject1.csv \
  --input_csv2 IntelligentVBench/csvs/Compositional_MI2V_subject2.csv \
  --input_csv3 IntelligentVBench/csvs/Compositional_MI2V_subject3.csv \
  --edit_path1 /path/to/subject1_videos \
  --edit_path2 /path/to/subject2_videos \
  --edit_path3 /path/to/subject3_videos \
  --output_dir /path/to/compositional_mi2v_eval

--file_parent_path option: The image and video paths stored in the CSV files (e.g., ref_image_path, first_frame, condition_video, subject_img_1, etc.) are relative paths. You can use --file_parent_path to prepend a root to all relative paths. For example:

python IntelligentVBench/evaluate_implicit_i2v.py \
  --input_csv IntelligentVBench/csvs/Implicit_I2V.csv \
  --edit_path /path/to/generated_videos \
  --output_csv /path/to/implicit_i2v_eval.csv \
  --file_parent_path /path/to/dataset/root

If --file_parent_path is not specified (default None), the paths in the CSV are used as-is.

For Compositional MI2V, the --output_dir will produce the following structure:

/path/to/compositional_mi2v_eval/
├── subject1/
│   └── gemini-2.5-pro_Compositional_MI2V_subject1.csv
├── subject2/
│   └── gemini-2.5-pro_Compositional_MI2V_subject2.csv
└── subject3/
    └── gemini-2.5-pro_Compositional_MI2V_subject3.csv

Step 5: Calculate aggregated scores:

# For Implicit I2V (change --input_csv, --output_csv, --task_type accordingly for other tasks)
python IntelligentVBench/calculate_score.py \
  --input_csv IntelligentVBench/csvs/Implicit_I2V.csv \
  --output_csv /path/to/implicit_i2v_eval.csv \
  --task_type implicit_i2v  # options: implicit_i2v, interpolative_di2v, tiv2v

# For Compositional MI2V
python IntelligentVBench/calculate_score_compositional_mi2v.py \
  --input_csv1 IntelligentVBench/csvs/Compositional_MI2V_subject1.csv \
  --input_csv2 IntelligentVBench/csvs/Compositional_MI2V_subject2.csv \
  --input_csv3 IntelligentVBench/csvs/Compositional_MI2V_subject3.csv \
  --output_csv1 /path/to/compositional_mi2v_eval/subject1/gemini-2.5-pro_Compositional_MI2V_subject1.csv \
  --output_csv2 /path/to/compositional_mi2v_eval/subject2/gemini-2.5-pro_Compositional_MI2V_subject2.csv \
  --output_csv3 /path/to/compositional_mi2v_eval/subject3/gemini-2.5-pro_Compositional_MI2V_subject3.csv

🎬 Qualitative Examples

For more qualitative examples across all tasks, please refer to our project page.

Task Input Output
Text-to-Video (T2V)
📋 Show promptA beautiful Chinese woman with long hair, wearing sunglasses and a long fur coat over a fitted knit dress and high heels, slowly lay down in the thick snow.
📋 Show promptIn the tranquil waters of the lake, a crocodile stealthily lurks, gliding through the depths. Gradually, it emerges, its massive, scaled head breaking the surface.
First-Frame-to-Video (I2V)
📋 Show promptThe woman is pleading for her life while being threatened with a firearm.

📋 Show promptHighlight the facial expressions and reactions of women after experiencing terrible events.
Key-Frames-to-Video
📋 Show promptCamera tilts up to reveal giant anime girl's face.

📋 Show promptMan turns head toward camera, smiling subtly with mouth slightly open.
Video-to-Video Editing
📋 Show promptConvert the sketch-animated video to a live-action, photorealistic format.

📋 Show promptChange man's hair color to platinum blonde while keeping hair shape.
Reference-to-Video
📋 Show promptThe man in the image, initially speaks with a sad expression. As he continues, he gently tilts his head down toward the other person.
Compositional Multi-Image-to-Video
📋 Show promptThe woman in the first image holds the handbag in the second image up towards the camera while speaking over the background in the third image.

📋 Show promptThe pink handbag in the first image, the green handbag in the second image, and the blue handbag in the third image remain stationary but shift slightly in the frame due to the camera's slow upward movement over the background in the fourth image.
Text-Image-Video-to-Video

📋 Show promptReplace the man in the video with the man in the image.


📋 Show promptAdd the coffee mug in the image onto the desk near the keyboard in the video.

📚 Citation

If you find our work helpful, please consider giving us a star ⭐ on this repo and citing our papers as follows:

OmniWeaving

@article{pan2026omniweaving,
  title={OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning},
  author={Pan, Kaihang and Tian, Qi and Zhang, Jianwei and Kong, Weijie and Xiong, Jiangfeng and Long, Yanxin and Zhang, Shixue and Qiu, Haiyi and Wang, Tan and Lv, Zheqi and others},
  journal={arXiv preprint arXiv:2603.24458},
  year={2026}
}

HunyuanVideo 1.5

@article{wu2025hunyuanvideo,
  title={Hunyuanvideo 1.5 technical report},
  author={Wu, Bing and Zou, Chang and Li, Changlin and Huang, Duojun and Yang, Fang and Tan, Hao and Peng, Jack and Wu, Jianbing and Xiong, Jiangfeng and Jiang, Jie and others},
  journal={arXiv preprint arXiv:2511.18870},
  year={2025}
}

🙏 Acknowledgements

We would like to thank the contributors to HunyuanVideo 1.5, Transformers, Diffusers, HuggingFace and Qwen-VL, for their open research and exploration.

About

Official Implementation of OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages