About Zhihang Yuan

I am a member of ByteDance Seed, where I lead a multimodal large language model efficiency team. I earned my bachelor’s degree from Peking University in 2017, followed by my PhD from the School of Computer Science at Peking University in 2022. To date, I have developed various products powered by multimodal LLMs — including full-duplex real-time video and audio chat — alongside music generation, audio/video generation, and sound recognition models.

My research interests center on neural network compression algorithms (diffusion distillation, quantization, pruning, speculative decoding, sparse attention), software-hardware codesign, reinforcement learning for generative models. Per Google Scholar, my published work has received over 3,500 citations.

袁之航于2017年获得北京大学学士学位，于2022年获得北京大学计算机学院博士学位，博士导师为孙广宇老师。袁之航的研究领域为高效人工智能（Efficient AI），在ICLR/NeurIPS/ICML/CVPR/ICCV/ECCV/TPAMI/TCAD/HPCA等会议和期刊上发表多篇研究论文。根据 Google Scholar，他的文章总被引用次数超过 3500 次。目前他在字节跳动Seed团队从事多模态大模型的工作，方向为神经网络的压缩算法、大模型强化学习、高效神经网络设计等。喜欢走走，曾参加过北京大学自行车协会2014年暑期远征和北京大学山鹰社2022年英吉沙科考。

Publications

Please visit Google Scholar page for more detailed information.

* indicates equal contribution, + indicates communication author

Wang Z, Yuan Z, et al. OmniFit: Bridging Modalities via Layer-Adaptive Token Compression for Omnimodal Large Language Models. ICML 2026 (spotlight).
Li A*, Wang Y*, Yuan Z*, et al. LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs[J]. arXiv preprint, 2025.
Guo Y*, Wang W*, Yuan Z*, et al. SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling[J]. arXiv preprint, 2025.
Zhang H*, Su R*, Yuan Z+, et al. DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers, ICCV 2025.
Yuan Z*, Xie R*, Shang Y, et al. VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate, ICCV 2025.
Hu X, Chen Z, Yang D, et al. MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance[J]. ICML 2025.
Duanmu H, Li X, Yuan Z, et al. MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design[J]. ICML 2025.
Zhou S*, Yuan Z*, Yang D, et al. PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware Histogram, CVPR 2025.
Wang K, Shi M, Zhou Y, et al. A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training, CVPR 2025.
Yuan Z*, Wang S*, Shang Y, et al. DLFR-VAE: Dynamic Latent Frame Rate VAE for Efficient Video Generation, ACM MM, 2025.
Hu X*, Cheng Y*, Yang D*…, Yuan Z+. OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting, ICLR 2025.
Xu Z*, Yue Y*, Hu X, et al. MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods, ICLR 2025.
Yuan Z*, Shang Y*, Zhang H, et al. E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling. arXiv preprint arXiv:2412.14170, 2024.
Yuan Z*, Lu P*, Zhang H*, et al. DiTFastAttn: Attention Compression for Diffusion Transformer Models, NeurIPS 2024.
Duanmu H*, Yuan Z*, Li X, et al. SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models. COLM 2024 (Oral).
Han Y*, Liu Z*, Yuan Z*, et al. Latency-aware unified dynamic networks for efficient image recognition. TPAMI 2024.
Yuan Z*, Shang Y*, Zhou Y*, et al. LLM Inference Unveiled: Survey and Roofline Model Insights. arXiv preprint arXiv:2402.16363, 2024.
Yue Y*, Yuan Z*, Duanmu H, et al. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024.
Shang Y*, Yuan Z*, et al. PB-LLM: Partially Binarized Large Language Models. ICLR 2024.
Zhang C, Yuan Z, et al. Algorithm-hardware co-design for Energy-Efficient A/D conversion in ReRAM-based accelerators. DATE 2024.
Guo A, Chen X, Dong F, et al. 34.3 A 22nm 64kb Lightning-Like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNs, IEEE International Solid-State Circuits Conference (ISSCC) 2024.
Yuan Z*, Shang Y*, et al. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models. arXiv 2023.
Shang Y, Yuan Z, et al. MIM4DD: Mutual Information Maximization for Dataset Distillation, NeurIPS 2023.
Yuan Z*, Lin N*, Liu J, et al. RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv preprint arXiv:2304.01089, 2023.
Niu L, Liu J, Yuan Z+, et al. Improving Post-Training Quantization on Object Detection with Task Loss-Guided Lp Metric. arXiv preprint arXiv:2304.09785, 2023.
Yuan Z*, Liu J*, Wu J, et al. Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance. AdvML-Frontiers 2023.
Shang Y*, Yuan Z*, Xie B, et al. Post-training Quantization on Diffusion Models. CVPR 2023.
Liu J, Niu L, Yuan Z+, et al. PD-Quant: Post-Training Quantization based on Prediction Difference Metric. CVPR 2023.
Han Y*, Yuan Z*, Pu Y, et al. Latency-aware Spatial-wise Dynamic Networks, NeurIPS 2022.
Li X*, Yuan Z*, Guan Y, et al. Flatfish: a Reinforcement Learning Approach for Application-Aware Address Mapping. TCAD 2022.
Li X, Bing Z, Guang Y, et al. Enabling High-Quality Uncertainty Quantification in a PIM Designed for Bayesian Neural Network. HPCA 2022.
Yuan Z*, Xue C*, Chen Y, et al. PTQ4ViT: Post-Training Quantization Framework for Vision Transformers. ECCV 2022.
Yuan Z, Chen Y, Xue C, et al. PTQ-SL: Exploring the Sub-layerwise Post-training Quantization. arXiv preprint arXiv:2110.07809, 2021.
Yuan Z, Jingze L, Xingchen L, et al. NAS4RRAM: Neural Network Architecture Search for Inference on RRAM-based Accelerators. SCIENCE CHINA Information Sciences (SCIS), 2021.
Ding M, Kang Y, Yuan Z, et al. Detection of facial landmarks by a convolutional neural network in patients with oral and maxillofacial disease. International Journal of Oral and Maxillofacial Surgery, 2021, 50(11): 1443-1449.
Yuan Z*, Wu B*, Sun G, et al. S2DNAS: Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search. ECCV 2020 (oral).
Yuan Z, Liu X, Wu B, et al. ENAS4D: Efficient Multi-stage CNN Architecture Search for Dynamic Inference. arXiv preprint, 2020.
Guan Y, Sun G, Yuan Z, et al. Crane: Mitigating Accelerator Under-utilization Caused by Sparsity Irregularities in CNNs. IEEE Transactions on Computers (TC), 2020.
Guan Y, Yuan Z, Sun G, et al. FPGA-based accelerator for long short-term memory recurrent neural networks. Asia and South Pacific Design Automation Conference (ASP-DAC), 2017.
Wu B, Liu Z, Yuan Z, et al. Reducing overfitting in deep convolutional neural networks using redundancy regularizer. International Conference on Artificial Neural Networks (ICANN), 2017.