I am a third-year Ph.D. student at The Chinese University of Hong Kong, advised by Prof. Pheng-Ann Heng and Prof. Chi-Wing Fu. My research focuses on multimodal agent and audio-visual reasoning: building foundation models that can see, hear, reason, and act.
Before CUHK, I worked at Tencent AI Lab on RL-based game agents, designing reward strategies for diverse playstyles. During my Ph.D., I interned at Alibaba Qwen, building agentic RL frameworks, algorithms, and data pipelines for omni-modal reasoning in foundation models. I hold an M.Sc. in Big Data Technology from HKUST and a B.Sc. (Hons) in Computer Science and Technology from Beijing Normal-Hong Kong Baptist University.
My long-term goal is to build multimodal agents that understand the world as it unfolds. I see video, naturally coupled with audio, as the bridge from multimodal perception to agentic interaction in open-ended environments.
Outside the lab, I keep a steady routine around basketball, tennis, and regular gym workouts.
News & Announcements
- Jul2026 Invited to give a talk on OmniAgent at the Alibaba Qwen booth during ICML 2026 in Seoul (#B400, Jul 8, 13:20–13:40).
- Jun2026 Selected for the Tencent Project Up Scholarship for ICML 2026.
- May2026 One paper accepted to ICML 2026! OmniAgent introduces the first native omni-modal agent for active perception, treating video perception as an iterative reasoning process. The arXiv paper, code, and models are now publicly released.
- Mar2026 Qwen3.5-Omni is here!
- Jan2026 One paper accepted to ICLR 2026.
- Jan2026 One paper accepted to IJCV.
- Sep2025 Qwen3-Omni is here!
- Sep2025 One paper accepted to NeurIPS 2025.
- Feb2025 One paper accepted to CVPR 2025.
- Sep2024 One paper accepted to TIP.
- Jul2024 Moved into campus housing after a long wait!
Selected Publications
* marks joint first authors. Includes peer-reviewed publications and technical reports.
- Native Active Perception as Reasoning for Omni-Modal Understanding Conference ICML 2026 arXiv Code Model
- Qwen3.5-Omni Technical Report Technical Report 2026 arXiv Qwen Chat
- Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception Conference ICLR 2026 arXiv Code
- Qwen3-Omni Technical Report Technical Report 2025 arXiv Code Qwen Chat
- EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning Technical Report 2025 arXiv Code
- EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights Conference CVPR 2025 Paper Code
- Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era Journal IJCV 2026 Paper arXiv Code
- Video Instance Shadow Detection Under the Sun and Sky Journal TIP 2024 Paper arXiv Code
Invited Talks
- Native Active Perception as Reasoning for Omni-Modal Understanding