TokenHSI

Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Liang Pan^1,2, Zeshi Yang³, Zhiyang Dou², Wenjia Wang², Buzhen Huang⁴,
Bo Dai^2,5, Taku Komura², Jingbo Wang^1,†

¹Shanghai AI Laboratory, ²The University of Hong Kong, ³Independent Researcher,
⁴Southeast University, ⁵Feeling AI

(†: corresponding author)

CVPR 2025
🏆️ Oral Presentation (Top 3.3%)

Also Spotlight in the 1st Workshop on Humanoid Agents at CVPR 2025

arXiv Code Poster Oral Slides

Introducing TokenHSI, a unified model that enables physics-based characters to perform diverse human-scene interaction tasks.

Abstract

Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks.

Pipeline

TokenHSI consists of two stages:
(left) foundational skill learning and (right) policy adaptation.

Foundational Skill Learning

TokenHSI

Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Introducing TokenHSI, a unified model that enables physics-based characters to perform diverse human-scene interaction tasks.

Abstract

Pipeline

TokenHSI consists of two stages:
(left) foundational skill learning and (right) policy adaptation.

Foundational Skill Learning

TokenHSI excels at seamlessly unifying multiple foundational HSI skills within a single transformer.

Path-following

Sitting

Climbing

Carrying

Policy Adaptation

The learned skills can be flexibly and efficiently adapted to challenging new tasks through our transformer-based policy adaptation.

(1) Skill Composition

We train a new task tokenizer to combine each of path-following, sitting, and climbing with carrying to create new composite skills.

(2) Object Shape Variation

We fine-tune the task tokenizer (previously trained for box-carrying) to generalize it to more objects, such as chairs and tables.

(3) Terrain Shape Variation

We introduce a new height map tokenizer to enable the humanoid to perform path-following and carrying tasks on uneven terrain.

(4) Long-horizon Task Completion in a Complex Dynamic Environment

We jointly fine-tune multiple task tokenizers to tackle challenges in long-horizon tasks, such as skill transition and collision avoidance.

BibTeX

TokenHSI

Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Introducing TokenHSI, a unified model that enables physics-based characters to perform diverse human-scene interaction tasks.

Abstract

Pipeline

TokenHSI consists of two stages: (left) foundational skill learning and (right) policy adaptation.

Foundational Skill Learning

TokenHSI excels at seamlessly unifying multiple foundational HSI skills within a single transformer.

Path-following

Sitting

Climbing

Carrying

Policy Adaptation

The learned skills can be flexibly and efficiently adapted to challenging new tasks through our transformer-based policy adaptation.

(1) Skill Composition

We train a new task tokenizer to combine each of path-following, sitting, and climbing with carrying to create new composite skills.

(2) Object Shape Variation

We fine-tune the task tokenizer (previously trained for box-carrying) to generalize it to more objects, such as chairs and tables.

(3) Terrain Shape Variation

We introduce a new height map tokenizer to enable the humanoid to perform path-following and carrying tasks on uneven terrain.

(4) Long-horizon Task Completion in a Complex Dynamic Environment

We jointly fine-tune multiple task tokenizers to tackle challenges in long-horizon tasks, such as skill transition and collision avoidance.

BibTeX

TokenHSI consists of two stages:
(left) foundational skill learning and (right) policy adaptation.