Tencent Robotics X Γ HY Vision Team
HY-Embodied-0.5-X is an enhanced open-source embodied multimodal foundation model
jointly released by Tencent Robotics X and the HY Vision Team. Built on
top of the HY-Embodied-0.5 MoT-2B architecture (4B total parameters with
only 2B activated), it is specifically optimized for the core loop of
real-world robotics β "understand, reason, and act".
The model reaches state-of-the-art performance on 10 mainstream embodied task-planning benchmarks, ranking 1st among edge-side domain models on 7 of them. Compared with general-purpose multimodal models, HY-Embodied-0.5-X focuses more tightly on the problems that matter in real-world robot interaction, with dedicated improvements in fine-grained manipulation understanding, spatial reasoning, action prediction, risk assessment, multimodal reference grounding, and long-horizon planning β pushing the model from "seeing" to "doing".
[2026-04-24]π Released HY-Embodied-0.5-X, an embodied-focused enhancement on top of HY-Embodied-0.5 MoT-2B, together with inference and training code.
- π§ Stronger Spatial Understanding β accurately reasons about object positions, scene layout, relative spatial relations, and manipulation states, providing a reliable perceptual basis for action decisions.
- π Stronger Long-Horizon Planning β handles multi-step, strongly-dependent complex tasks, producing stable task decomposition, action planning, and execution decisions across continuous interactions.
- π€ Stronger Embodied Interaction β beyond visual understanding and dialogue, supports task parsing, reference resolution, action decisions, risk judgement, and failure reflection, closely matching the real robot interaction loop.
- π¦ Edge-Friendly β built on the MoT-2B architecture (4B total / 2B activated), suitable for on-device deployment and real-time response.
HY-Embodied-0.5-X combines self-collected first-person robot manipulation data, robotic-arm manipulation data, and open-source embodied data into a high-quality corpus that covers manipulation understanding, first-person task reasoning, and multimodal reference grounding:
- Robotic-arm / human-hand trajectories β dedicated data for state understanding, next-action prediction, manipulation-risk assessment, failure diagnosis, and pairwise candidate-action comparison.
- First-person embodied tasks β fine-grained action recognition, subtask progress estimation, hand spatial localization, depth estimation, relative spatial reasoning, camera pose inference, and more.
- Multimodal interactive reference grounding β data built around ambiguous real-world instructions such as "put this over there", combining speech and gesture cues.
All core samples are paired with chain-of-thought (CoT) annotations and a full "generate β verify β correct β eval-regression" data-quality loop. Embodied, internet, and 3D data are further unified through a standardized reconstruction pipeline that turns heterogeneous sources into consistent, high-quality embodied reasoning data.
Training follows a staged iterative strategy:
- Quickly validate training configs and data cleaning on a small, high-quality subset.
- Progressively scale up training data and compute.
- Kick off full-scale training only after the optimal data mix and training strategy are confirmed.
This ensures each unit of compute is invested in the most valuable data.
Across 10 open-source benchmarks covering planning, spatial reasoning, embodied QA, visual reference, and trajectory understanding, HY-Embodied-0.5-X stays in the top tier.
We built an internal embodied-planning benchmark on AI2Thor with 1,011 tasks across four household scenes (kitchen, bedroom, living room, bathroom), evaluating planning and execution on navigation, grasping, placement, appliance operation, and food cutting. HY-Embodied-0.5-X shows clear gains on long-horizon manipulation, self-awareness, and spatial understanding:
HY-Embodied-0.5-X is integrated with the PlaygroundX simulation framework (built on Tairos). It produces full plans for household instructions such as "throw the potato into the trash", "close the fridge door", or "put the tomato in the fridge", and adjusts execution based on environmental feedback β including on-the-fly replanning when an initial plan fails, forming a complete ReAct loop: reason β execute β detect failure β replan.
A one-click conda setup script setup_env.sh is provided. It creates the
environment, installs PyTorch / flash_attn / transformers (native
HY-Embodied support) / and all remaining dependencies. flash_attn
compiles from source and takes ~10β20 minutes:
bash setup_env.sh
conda activate hy_embodied_x
# (optional) expose the package as a console script
pip install -e .| Item | Requirement |
|---|---|
| OS | Linux |
| Python | 3.12 |
| CUDA | 12.6 |
| PyTorch | 2.10.0 |
| GPU | NVIDIA GPU with β₯ 16 GB VRAM |
Key dependencies:
transformers(specific commit, native HY-Embodied support),flash_attn==2.8.3,accelerate,deepspeed,timm,liger-kernel. Seesetup_env.shandrequirements.txtfor the pinned list.
hf download tencent/HY-Embodied-0.5-X \
--local-dir ckpts/HY-Embodied-0.5-XWeights (*.safetensors) are git-ignored and expected under
ckpts/HY-Embodied-0.5-X/. The inference and training code also accepts
the Hub repo id directly, which triggers on-demand download via
transformers.
# Default: thinking mode disabled
python -m hy_embodied.cli.infer \
--model ckpts/HY-Embodied-0.5-X \
--image ./assets/demo.jpg \
--prompt "Describe this image"
# Enable thinking mode (chain-of-thought reasoning)
python -m hy_embodied.cli.infer \
--model ckpts/HY-Embodied-0.5-X \
--image ./assets/demo.jpg \
--prompt "Describe this image" \
--enable-thinkingThe legacy python inference.py ... invocation also works (it forwards to
the same code path).
import torch
from hy_embodied.inference import GenerationConfig, HyEmbodiedPipeline
pipe = HyEmbodiedPipeline.from_pretrained(
"ckpts/HY-Embodied-0.5-X",
device="cuda",
torch_dtype=torch.bfloat16,
)
# Default: thinking disabled
print(pipe.generate(
"Describe the image in detail.",
image="./assets/demo.jpg",
generation_config=GenerationConfig(max_new_tokens=32768, temperature=0.05),
))
# Enable thinking mode
print(pipe.generate(
"Describe the image in detail.",
image="./assets/demo.jpg",
generation_config=GenerationConfig(
max_new_tokens=32768,
temperature=0.05,
enable_thinking=True,
),
))See docs/inference.md for batch inference and
multi-image / video examples.
Launch a server that exposes the standard /v1/chat/completions endpoint:
# Quick start
bash scripts/run_server.sh
# Or with custom options
python -m hy_embodied.cli.server \
--model ckpts/HY-Embodied-0.5-X \
--host 0.0.0.0 --port 8080
# After `pip install -e ".[serve]"`
hy-embodied-server --model ckpts/HY-Embodied-0.5-X --port 8080Then use any OpenAI-compatible client to call the model:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="any")
# Text-only (thinking disabled by default)
resp = client.chat.completions.create(
model="HY-Embodied-0.5-X",
messages=[{"role": "user", "content": "How to open a fridge?"}],
)
print(resp.choices[0].message.content)
# With image (thinking disabled by default)
resp = client.chat.completions.create(
model="HY-Embodied-0.5-X",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/img.jpg"}},
{"type": "text", "text": "Describe this image."},
],
}],
)
# Enable thinking mode (chain-of-thought reasoning)
resp = client.chat.completions.create(
model="HY-Embodied-0.5-X",
messages=[{"role": "user", "content": "How to open a fridge?"}],
extra_body={"enable_thinking": True},
)
# Streaming
stream = client.chat.completions.create(
model="HY-Embodied-0.5-X",
messages=[{"role": "user", "content": "Plan how to clean the table."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Supports curl as well:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "HY-Embodied-0.5-X",
"messages": [{"role":"user","content":"Hello!"}]
}'
# Enable thinking mode
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "HY-Embodied-0.5-X",
"messages": [{"role":"user","content":"Hello!"}],
"enable_thinking": true
}'See docs/inference.md for full server documentation.
# Single GPU smoke test β no torchrun / DeepSpeed required (β₯ 16 GB VRAM)
CUDA_VISIBLE_DEVICES=0 python -m hy_embodied.cli.train \
--config configs/sft/example_small_single_gpu.yaml
# or simply:
bash scripts/run_sft_single_gpu.sh
# 1 node Γ 8 GPUs with DeepSpeed ZeRO-2
bash scripts/run_sft_1node_8gpu.sh
# 4 nodes Γ 8 GPUs
bash scripts/run_sft_4node_8gpu.shTwo reference configs are shipped:
configs/sft/example_small_single_gpu.yamlβ single-GPU config with DeepSpeed disabled. Can be launched with plainpython -m(notorchrunneeded). Best for quick validation and debugging.configs/sft/example_small.yamlβ multi-GPU config with DeepSpeed ZeRO-2 enabled. Must be launched viatorchrunoraccelerate. Its training/optimizer defaults match the release recipe; new users typically only need to editdata.train_data_paths/data.train_data_sampling_ratiosto point at their own JSONL mixture.
Both configs ship with data_examples/data_demo.jsonl (14 samples across
6 capabilities, images bundled in the repo) so the default commands run
end-to-end with no external data.
See docs/training.md for the config reference and
distributed strategies.
- Point:
(x, y)or[(x1, y1), (x2, y2)] - Box:
[xmin, ymin, xmax, ymax] - Coordinates are normalized to the integer range (0, 1000).
- Thinking mode (when enabled): The response is structured as
<think>[reasoning]</think><answer>[answer]</answer>. - Direct mode (default): The response contains only the answer without a reasoning section.
HY-Embodied-0.5-X/
βββ README.md # this file
βββ LICENSE # Apache-2.0
βββ pyproject.toml # packaging + console scripts
βββ requirements.txt # full pinned dependency list
βββ setup_env.sh # one-click env setup
β
βββ src/hy_embodied/ # Python package
β βββ cli/ # `python -m hy_embodied.cli.train / .infer / .server`
β βββ training/ # SFT trainer, data pipeline, chat template
β βββ inference/ # HyEmbodiedPipeline
β
βββ configs/
β βββ sft/ # training config (example_small.yaml)
β βββ accelerate/ # accelerate launcher configs
β βββ deepspeed/ # ZeRO configs
β βββ fsdp/ # FSDP configs
β
βββ scripts/ # shell launchers (1-node / 4-node)
βββ data_examples/ # per-capability sample JSONLs (+ README)
βββ docs/ # data_format / training / inference / architecture
βββ assets/ # images used in docs / README
βββ ckpts/ # (gitignored) `hf download` target
βββ outputs/ # (gitignored) training run outputs
βββ inference.py # backward-compat shim for the legacy CLI
See docs/architecture.md for the architectural
rationale and dependency rules.
HY-Embodied-0.5-X targets the following embodied scenarios:
- Home service / tabletop manipulation β spatial reasoning, fine-grained manipulation reasoning, task understanding, and failure reflection in real environments.
- Task planning & simulation evaluation β planning evaluation and multimodal interaction research in simulated settings.
- Local deployment & development β on-device validation and downstream development of embodied capabilities.
@article{tencent2026hyembodied05x,
title = {HY-Embodied-0.5-X: An Enhanced Embodied Foundation Model for Real-World Agents},
author = {Tencent Robotics X and HY Vision Team},
year = {2026}
}Thanks to the Hugging Face community, and all open-source contributors. By open-sourcing HY-Embodied-0.5-X we hope to offer the embodied-AI community a more deployment-oriented foundation, and to push models from general understanding toward real-world execution.



