Official code for the paper UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations.
UniCom is a unified large-scale multimodal model that performs generation directly over compressed visual embeddings. This repository includes the inference pipeline for text-to-image generation, image editing, and image reconstruction.
Figure: We compare different unified modeling choices in terms of convergence speed and consistency on editing tasks, and ultimately build UniCom with the Path I transfusion-style formulation rather than the Path II query-guided design.
Model: We propose UniCom, a unified large-scale multimodal model that performs generation directly over compressed visual embeddings and serves as a unified interface for both understanding and generation.Paradigm: We establish an effective paradigm for unifying visual understanding and generation by predicting continuous compressed visual embeddings, and show that compressing visual features along the channel dimension is a particularly effective way to preserve both semantics and fine-grained details.Results: UniCom achieves state-of-the-art or competitive performance across image reconstruction, text-to-image generation, and challenging image editing tasks, with especially strong performance on editing benchmarks.
Download all checkpoints at once via huggingface-cli:
huggingface-cli download tencent/Unicom-Unified-Multimodal-Modeling-via-Compressed-Continuous-Semantic-Representations --repo-type model --local-dir ./model_zoo/ --resume-downloadYou can also download each component separately:
| Component | Local Path | Link |
|---|---|---|
| UniCom (text β SigLIP) | model_zoo/unicom_hf_model/ |
Download |
| Decoder Transformer (SigLIP β image) | model_zoo/unicom_decoder_transformer.pt |
Download |
| Flux VAE | model_zoo/flux-vae/ |
Download |
| SigLIP2 | model_zoo/siglip2-so400m-patch16-naflex/ |
Download |
After downloading, verify the expected directory layout:
model_zoo/
βββ unicom_hf_model/
βββ unicom_decoder_transformer.pt
βββ flux-vae/
βββ siglip2-so400m-patch16-naflex/
conda create -n unicom python=3.12 -y
conda activate unicomInstall PyTorch first according to your CUDA version. Example for CUDA 12.8:
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128pip install -r requirements.txtpython run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--prompt "A ginger kitten tangled in a ball of wool, looking puzzled." \
--output-dir ./output/t2i_demo \
--diff-infer-steps 50 \
--seed 42 \
--image-size auto \
--n-samples-per-prompt 4python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--prompt "Add a blue baseball cap on the boy's head" \
--image ./UniCom/assets/demo_imgs/input_0.jpg \
--image-size auto \
--seed 42 \
--output-dir ./output/ti2i_demo \
--diff-infer-steps 50python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--prompt "Place the chair from the second image onto the snow in the third image, and then place the coffee cup from the first image onto the chair." \
--image ./UniCom/assets/demo_imgs/input_1_0.png ./UniCom/assets/demo_imgs/input_1_1.png ./UniCom/assets/demo_imgs/input_1_2.png \
--image-size auto \
--seed 42 \
--output-dir ./output/ti2i_multi_demo \
--diff-infer-steps 50python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--csv-path ./UniCom/eval/t2i.csv \
--output-dir ./output/t2i_demo_csv \
--num-gpus 8 \
--decoder-device 0,1,2,3,4,5,6,7 \
--image-size auto \
--diff-infer-steps 50 \
--n-samples-per-prompt 4# no cot
python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--csv-path ./UniCom/eval/t2i.csv \
--output-dir ./output/t2i_demo_csv_nocot \
--num-gpus 8 \
--decoder-device 0,1,2,3,4,5,6,7 \
--image-size auto \
--diff-infer-steps 50 \
--bot-task vanilla \
--use-system-prompt en_vanilla \
--n-samples-per-prompt 4The pipeline first exports latent representations, then decodes them into images:
output_dir/
|-- latents/
| |-- results.csv
| `-- *.pt
`-- images/
`-- *.png
UniCom_Decoder also supports reconstruction directly from input images.
bash UniCom_Decoder/scripts/run.sh \
--config-file UniCom_Decoder/configs/reconstruction_demo.yamlThe demo images are stored in UniCom_Decoder/assets/demo_recon_imgs/.
Each saved output is a side-by-side comparison:
- left: input image
- right: reconstructed image
The default demo config already uses the recommended settings:
mode: eval_gtaba_mode: compression_64_siglipcondition_mode: siglip2cfg_scale: 1.0infer_steps: 50flow_shift: 3.0siglip2_max_num_patches: 1024
This project builds upon several excellent open-source projects and research efforts.
If you find UniCom useful for your research, please cite:
@misc{zhao2026unicomunifiedmultimodalmodeling,
title={UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations},
author={Yaqi Zhao and Wang Lin and Zijian Zhang and Miles Yang and Jingyuan Chen and Wentao Zhang and Zhao Zhong and Liefeng Bo},
year={2026},
eprint={2603.10702},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.10702},
}