About Zhihang Yuan

Hello! I’m Zhihang Yuan. I received my Bachelor’s degree from Peking University in 2017 and my Ph.D. degree in Computer Science from Peking University in 2022 under the guidance of Professor Guangyu Sun. Currently, I am mainly engaged in Efficient AI, focusing on the model compression and inference acceleration of neural networks, as well as the collaborative optimization of software and hardware for deep learning.

In 2021, I joined Houmo AI, a startup specializing in AI accelerator focusing on Computing in Memory (CIM) technique. During my time there, I participated in the design of the AI accelerators. I also led the development of quantization algorithms and tools. In 2024, I continued my career as an innovative algorithm researcher at Infini-AI.

我于2017年获得北京大学学士学位,于2022年获得北京大学计算机学院博士学位,博士导师为孙广宇长聘副教授。我专注于高效AI相关的研究工作,研究方向为神经网络的量化及推理加速、深度学习的软硬件协同优化。 2021年加入后摩智能担任资深算法工程师,参与了数字存算AI加速器设计。 2024年加入无问芯穹担任创新算法研究员。 喜欢走走,曾参加过北京大学自行车协会2014年暑期远征和北京大学山鹰社2022年英吉沙科考。


  • Yuan Z, Shang Y, Zhou Y, et al. LLM Inference Unveiled: Survey and Roofline Model Insights[J]. arXiv preprint arXiv:2402.16363, 2024.
  • Yue Y, Yuan Z, Duanmu H, et al. Wkvquant: Quantizing weight and key/value cache for large language models gains more[J]. arXiv preprint arXiv:2402.12065, 2024. (co-first)
  • Wang H, Shang Y, Yuan Z, et al. QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning[J]. arXiv preprint arXiv:2402.03666, 2024.
  • Yang D, He N, Hu X, et al. Post-training quantization for re-parameterization via coarse & fine weight splitting[J]. Journal of Systems Architecture, 2024, 147: 103065.
  • Yuan Z, Shang Y, et al. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models. arXiv 2023. (co-first)
  • Shang Y, Yuan Z, et al. PB-LLM: Partially Binarized Large Language Models. ICLR 2024. (co-first)
  • Shang Y, Yuan Z, et al. MIM4DD: Mutual Information Maximization for Dataset Distillation, NeurIPS, 2023.
  • Yuan Z, Lin N, Liu J, et al. RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv preprint arXiv:2304.01089, 2023.
  • Niu L, Liu J, Yuan Z, et al. Improving Post-Training Quantization on Object Detection with Task Loss-Guided Lp Metric. arXiv preprint arXiv:2304.09785, 2023.
  • Yuan Z, Liu J, Wu J, et al. Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance. AdvML-Frontiers 2023.
  • Shang Y, Yuan Z, Xie B, et al. Post-training Quantization on Diffusion Models. CVPR 2023. (co-first)
  • Liu J, Niu L, Yuan Z, et al. PD-Quant: Post-Training Quantization based on Prediction Difference Metric. CVPR 2023. (communication)
  • Han Y, Yuan Z, Pu Y, et al. Latency-aware Spatial-wise Dynamic Networks, NeurIPS 2022. (co-first)
  • Li X, Yuan Z, Guan Y, et al. Flatfish: a Reinforcement Learning Approach for Application-Aware Address Mapping. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2022. (co-first)
  • Li X, Bing Z, Guang Y, et al. Enabling High-Quality Uncertainty Quantification in a PIM Designed for Bayesian Neural Network. HPCA, 2022.
  • Yuan Z, Xue C, Chen Y, et al. PTQ4ViT: Post-Training Quantization Framework for Vision Transformers. European Conference on Computer Vision (ECCV), 2022. (co-first)
  • Yuan Z, Chen Y, Xue C, et al. PTQ-SL: Exploring the Sub-layerwise Post-training Quantization. arXiv preprint arXiv:2110.07809, 2021.
  • Yuan Z, Jingze L, Xingchen L, et al. NAS4RRAM: Neural Network Architecture Search for Inference on RRAM-based Accelerators. SCIENCE CHINA Information Sciences, 2021.
  • Yuan Z, Wu B, Sun G, et al. S2DNAS: Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search. European Conference on Computer Vision (ECCV oral), 2020.
  • Yuan Z, Liu X, Wu B, et al. ENAS4D: Efficient Multi-stage CNN Architecture Search for Dynamic Inference. arXiv preprint, 2020.
  • Guan Y, Sun G, Yuan Z, et al. Crane: Mitigating Accelerator Under-utilization Caused by Sparsity Irregularities in CNNs. IEEE Transactions on Computers (TC), 2020.
  • Guan Y, Yuan Z, Sun G, et al. FPGA-based accelerator for long short-term memory recurrent neural networks. Asia and South Pacific Design Automation Conference (ASP-DAC), 2017.
  • Wu B, Liu Z, Yuan Z, et al. Reducing overfitting in deep convolutional neural networks using redundancy regularizer. International Conference on Artificial Neural Networks (ICANN), 2017.