About Me
I’m a PhD candidate in Tsinghua University, advised by Jianyong Wang. Meanwhile, I am a Pre-Career Scholar in Shanghai Innovation Institute. I have been working as a research intern in Microsoft Research Asia since Jul 2023, mentored by Li Dong.
My research interests span two main directions. The first focuses on the architecture and pre-training of LLM, including attention mechanism design, and scalable inference and generation for long-context modeling. The second centers on multimodal world models, with an emphasis on unified multimodal architectures and autoregressive video generation, as well as their applications in embodied intelligence and real-time video interaction.
Email: syt23@mails.tsinghua.edu.cn
Links: [GitHub] [Twitter] [Google Scholar]
Education
- Ph.D., Tsinghua University (2023/08 ~ )
 - Undergrauate, Tsinghua University (2018/08 ~ 2023/07)
    
- Computer Science and Technology (2020/08 ~ 2023/07)
 - Mathematics and Physics (2018/08 ~ 2020/07)
 
 - Taiyuan No.5 Middle School (2015/08 ~ 2018/07)
    
- Participated in Physics Olympics and achieved nothing
 
 
Selected Publications
Preprint
- Efficient attention mechanisms for large language models: A survey
Yutao Sun*, Zhenyu Li*, Yike Zhang*, Tengyu Pan*, Bowen Dong*, Yuyi Guo, Jianyong Wang arXiv:2507.19595, 2025. - Rectified Sparse Attention
Yutao Sun*, Tianzhu Ye*, Li Dong*, Yuqing Xia*, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, Furu Wei.
arXiv:2506.04108, 2025.
[pdf][code] - Multimodal Latent Language Modeling with Next-Token Diffusion
Yutao Sun*, Hangbo Bao*, Wenhui Wang*, Zhiliang Peng*, Li Dong*, Shaohan Huang, Jianyong Wang, Furu Wei arXiv:2412.08635, 2024.
[pdf][code] - Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun*, Li Dong*, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei.
arXiv:2307.08621, 2023.
[pdf][code] 
Conference
- Differential Transformer
Tianzhu Ye*, Li Dong*, Yuqing Xia*, Yutao Sun*, Yi Zhu, Gao Huang, Furu Wei.
arXiv:2307.08621, 2023. International Conference on Learning Representations (ICLR), Oral, 2025. [pdf][code] - You Only Cache Once: Decoder-Decoder Architectures for Language Models
Yutao Sun*, Li Dong*, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei.
Neural Information Processing Systems (NeurIPS), Oral, 2024.
[pdf][code] - A Length-Extrapolatable Transformer
Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei. Association for Computational Linguistics (ACL), Long paper, 2023.
[pdf][code] - Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, Furu Wei. Findings of Association for Computational Linguistics (Findings of ACL), Long paper, 2023.
[pdf] 
Talks
- (2025) The Revolution of Foundation Architecture at SII
 - (2024) YOCO at Unify, SDU, and Danqi’s group in Princeton
 - (2023) RetNet at BAAI, DAMO Academy, and HanLab in MIT
 
Honors & Awards
- (06/2023) Outstanding Graduate & Thesis, Tsinghua University
 - (09/2022) Tang Jun-Yuan Scolarship, Tsinghua University
 - (09/2020) Academic & Social Work Excellence Award, Tsinghua University
 
Teaching Experience
- Teaching Assistant in Data Mining (2025 Fall)
 - Teaching Assistant in Object-Oriented Programming (2024 Spring)
 - Teaching Assistant in Software Engineering (2022 Spring, 2022 Fall, 2023 Fall)