Zhenghao Xing

I am a third-year Ph.D. student at The Chinese University of Hong Kong, advised by Prof. Pheng-Ann Heng and Prof. Chi-Wing Fu. My research focuses on multimodal agent and audio-visual reasoning: building foundation models that can see, hear, reason, and act.

Before CUHK, I worked at Tencent AI Lab on RL-based game agents, designing reward strategies for diverse playstyles. During my Ph.D., I interned at Alibaba Qwen, building agentic RL frameworks, algorithms, and data pipelines for omni-modal reasoning in foundation models. I hold an M.Sc. in Big Data Technology from HKUST and a B.Sc. (Hons) in Computer Science and Technology from Beijing Normal-Hong Kong Baptist University.

My long-term goal is to build multimodal agents that understand the world as it unfolds. I see video, naturally coupled with audio, as the bridge from multimodal perception to agentic interaction in open-ended environments.

Outside the lab, I keep a steady routine around basketball, tennis, and regular gym workouts.

News & Announcements

Jul2026 Invited to give a talk on OmniAgent at the Alibaba Qwen booth during ICML 2026 in Seoul (#B400, Jul 8, 13:20–13:40).
Jun2026 Selected for the Tencent Project Up Scholarship for ICML 2026.
May2026 One paper accepted to ICML 2026! OmniAgent introduces the first native omni-modal agent for active perception, treating video perception as an iterative reasoning process. The arXiv paper, code, and models are now publicly released.
Mar2026 Qwen3.5-Omni is here!
Jan2026 One paper accepted to ICLR 2026.
Jan2026 One paper accepted to IJCV.
Sep2025 Qwen3-Omni is here!
Sep2025 One paper accepted to NeurIPS 2025.
Feb2025 One paper accepted to CVPR 2025.
Sep2024 One paper accepted to TIP.
Jul2024 Moved into campus housing after a long wait!

Selected Publications

* marks joint first authors. Includes peer-reviewed publications and technical reports.

Native Active Perception as Reasoning for Omni-Modal Understanding Zhenghao Xing*, Ruiyang Xu*, Yuxuan Wang*, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, and Pheng-Ann Heng. Conference ICML 2026 arXiv Code Model
Qwen3.5-Omni Technical Report Qwen Team Technical Report 2026 arXiv Qwen Chat
Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception Ziyang Ma*, Ruiyang Xu*, Zhenghao Xing*, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, Eng Siong Chng, and Xie Chen. Conference ICLR 2026 arXiv Code
Qwen3-Omni Technical Report Qwen Team Technical Report 2025 arXiv Code Qwen Chat
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, and Pheng-Ann Heng. Technical Report 2025 arXiv Code
EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights Zhenghao Xing*, Hao Chen*, Binzhu Xie, Jiaqi Xu, Ziyu Guo, Xuemiao Xu, Jianye Hao, Chi-Wing Fu, Xiaowei Hu, and Pheng-Ann Heng. Conference CVPR 2025 Paper Code
Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era Xiaowei Hu*, Zhenghao Xing*, Tianyu Wang, Chi-Wing Fu, and Pheng-Ann Heng. Journal IJCV 2026 Paper arXiv Code
Video Instance Shadow Detection Under the Sun and Sky Zhenghao Xing*, Tianyu Wang*, Xiaowei Hu, Haoran Wu, Chi-Wing Fu, and Pheng-Ann Heng. Journal TIP 2024 Paper arXiv Code

Invited Talks

Native Active Perception as Reasoning for Omni-Modal Understanding Alibaba Qwen Booth #B400, ICML 2026 Jul 8, 2026, 13:20–13:40 · Seoul, Korea

Academic Service

Reviewer CVPR 2026 ICML 2026 (Silver Reviewer)(Silver) ECCV 2026