Publications - Zhihan's Home Page

2026

PreServe: Intelligent Management for LMaaS Systems via Hierarchical Prediction

Zhihan Jiang, Yujie Huang, Guangba Yu, Junjie Huang, Jiazhen Gu, Michael R. Lyu

ICSE'26 The IEEE/ACM International Conference on Software Engineering, Rio de Janeiro, Brazil, Apr 2026.

Language-Model-as-a-Service (LMaaS) platforms handle millions of daily requests and must meet low-latency, SLO, and efficiency goals, but conventional cloud managers falter under LMaaS’s dynamic, bursty workloads. We introduce a hierarchical-prediction management framework that pairs a coarse-grained service-workload predictor with a fine-grained request-load predictor to build per-instance load anticipators. By fusing long- and short-term forecasts, it proactively auto-scales resources and routes requests based on current and anticipated load, preventing under-/over-provisioning and instance load imbalancing.

PreServe: Intelligent Management for LMaaS Systems via Hierarchical Prediction

Zhihan Jiang, Yujie Huang, Guangba Yu, Junjie Huang, Jiazhen Gu, Michael R. Lyu

ICSE'26 The IEEE/ACM International Conference on Software Engineering, Rio de Janeiro, Brazil, Apr 2026.

2025

LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems

Zhihan Jiang, Jinyang Liu, Yichen Li, Haiyu Huang, Xiao He, Tieying Zhang^†, Jianjun Chen, Yi Li, Rui Shi, Michael R. Lyu

ASE'25 The IEEE/ACM International Conference on Automated Software Engineering, Seoul, South Korea, Nov 2025.

Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reasoning. To overcome these limitations, we introduce LogPilot, an intent-aware and scalable log-based framework powered by LLMs for automated alert diagnosis.

LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems

Zhihan Jiang, Jinyang Liu, Yichen Li, Haiyu Huang, Xiao He, Tieying Zhang^†, Jianjun Chen, Yi Li, Rui Shi, Michael R. Lyu

ASE'25 The IEEE/ACM International Conference on Automated Software Engineering, Seoul, South Korea, Nov 2025.

iKnow: an Intent-Guided Chatbot for Cloud Operations with Retrieval-Augmented Generation

Junjie Huang, Yuedong Zhong, Guangba Yu, Zhihan Jiang^†, Minzhi Yan, Wenfei Luan, Tianyu Yang, Rui Ren, Michael R. Lyu

ASE'25 The IEEE/ACM International Conference on Automated Software Engineering, Seoul, South Korea, Nov 2025. 🏆 ACM SIGSOFT Distinguished Paper Award

While the sheer volume of operational documentation required for managing complex cloud services hinders efficient knowledge acquisition, Retrieval-Augmented Generation (RAG) offers a streamlined solution by retrieving relevant knowledge to generate concise, referenced answers. However, deploying a reliable RAG-based chatbot for cloud operation remains a challenge. In this experience paper, we first analyze the development and deployment of RAG-based chatbots for operational question answering (OpsQA) at a large-scale cloud vendor. Base on the findings, we propose iKnow, an intent-guided RAG-based chatbot that integrates intent detection, query rewriting tailored to each intent, and missing knowledge detection to enhance answer quality.

iKnow: an Intent-Guided Chatbot for Cloud Operations with Retrieval-Augmented Generation

Junjie Huang, Yuedong Zhong, Guangba Yu, Zhihan Jiang^†, Minzhi Yan, Wenfei Luan, Tianyu Yang, Rui Ren, Michael R. Lyu

ASE'25 The IEEE/ACM International Conference on Automated Software Engineering, Seoul, South Korea, Nov 2025. 🏆 ACM SIGSOFT Distinguished Paper Award

Automated Proactive Logging Quality Improvement for Large-Scale Codebases

Yichen Li, Jinyang Liu, Junsong Pu, Zhihan Jiang, Zhuangbin Chen, Xiao He, Tieying Zhang^†, Jianjun Chen, Yi Li, Rui Shi, Michael R. Lyu

ASE'25 The IEEE/ACM International Conference on Automated Software Engineering, Seoul, South Korea, Nov 2025.

High-quality logging is critical for the reliability of cloud services, yet the industrial process for improving it is typically manual, reactive, and unscalable. Existing automated tools inherit this reactive nature, failing to answer the crucial whether-to-log question and are constrained to simple logging statement insertion, thus addressing only a fraction of the real-world logging improvement. To address these gaps and cope with logging debt in large-scale codebases, we propose LogImprover, a framework powered by LLMs that automates proactive logging quality improvement and introduce two paradigm shifts: from reactive generation to proactive discovery, and from simple insertion to holistic logging patch generation.

Automated Proactive Logging Quality Improvement for Large-Scale Codebases

Yichen Li, Jinyang Liu, Junsong Pu, Zhihan Jiang, Zhuangbin Chen, Xiao He, Tieying Zhang^†, Jianjun Chen, Yi Li, Rui Shi, Michael R. Lyu

ASE'25 The IEEE/ACM International Conference on Automated Software Engineering, Seoul, South Korea, Nov 2025.

ErrorPrism: Reconstructing Error Propagation Paths in Cloud Service Systems

Junsong Pu, Yichen Li, Zhuangbin Chen^†, Jinyang Liu, Zhihan Jiang, Jianjun Chen, Rui Shi, Zibin Zheng, Tieying Zhang

ASE'25 The IEEE/ACM International Conference on Automated Software Engineering, Seoul, South Korea, Nov 2025.

Reliability management in cloud service systems is challenging due to the cascading effect of failures. Error wrapping, a practice prevalent in modern microservice development, enriches errors with context at each layer of the function call stack, constructing an error chain that describes a failure from its technical origin to its business impact. However, this also presents a significant traceability problem when recovering the complete error propagation path from the final log message back to its source. Existing approaches are ineffective at addressing this problem. To fill this gap, we present ErrorPrism for automated reconstruction of error propagation paths in production microservice systems by integrating static analysis and an LLM agent.

ErrorPrism: Reconstructing Error Propagation Paths in Cloud Service Systems

Junsong Pu, Yichen Li, Zhuangbin Chen^†, Jinyang Liu, Zhihan Jiang, Jianjun Chen, Rui Shi, Zibin Zheng, Tieying Zhang

ASE'25 The IEEE/ACM International Conference on Automated Software Engineering, Seoul, South Korea, Nov 2025.

LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms

Zhihan Jiang, Rui Ren, Guangba Yu^†, Yulun Wu, Wenwei Gu, Yichen Li, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

DSN'25 The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Naples, Italy, Jun 2025.

Multi-tenant large-scale LLM training platforms have been built to offer LLM training services, while performance issues occur frequently and can result in substantial resource wastage. The limited visibility from the perspective of platform providers impedes existing profiling methods and poses challenges to the performance monitoring and diagnosis of LLM training jobs. This paper proposes LLMPrism, the first black-box performance diagnosis solution for LLM training platforms by utilizing underlying network flow data and the distinct characteristics in the LLM training procedure. By progressively recognizing LLM training jobs, identifying their parallelism strategies, and reconstructing the training timelines, LLMPrism achieves non-intrusive, lightweight, and continuous monitoring of LLM training systems.

LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms

Zhihan Jiang, Rui Ren, Guangba Yu^†, Yulun Wu, Wenwei Gu, Yichen Li, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

DSN'25 The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Naples, Italy, Jun 2025.

No More Labelled Examples? An Unsupervised Log Parser with LLMs

Junjie Huang, Zhihan Jiang, Zhuangbin Chen^†, Michael R. Lyu

FSE'25 The ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, Jun 2025.

Log parsing is a critical prerequisite for many log analysis tasks. However, existing language model-based parsers often rely heavily on high-quality labeled examples to perform well, which limits their practicality in real-world scenarios. To overcome this limitation, we propose LUNAR, an unsupervised, LLM-based method for efficient and ready-to-use log parsing, which is based on the key insight that while LLMs struggle with direct log parsing, their performance can be significantly improved through comparative analysis of multiple logs that differ only in their parameter components.

No More Labelled Examples? An Unsupervised Log Parser with LLMs

Junjie Huang, Zhihan Jiang, Zhuangbin Chen^†, Michael R. Lyu

FSE'25 The ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, Jun 2025.

L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis

Zhihan Jiang, Junjie Huang, Guangba Yu^†, Zhuangbin Chen, Yichen Li, Renyi Zhong, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael R. Lyu

FSE'25 The ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, Jun 2025.

The training process of Large Language Models (LLMs) requires substantial resources, as evidenced by scaling laws, which frequently leads to inevitable failures. In this paper, we present the first empirical study of LLM training failures on our production platform. Leveraging the obtained insights and the distinct cross-job, spatial, and temporal patterns present in LLM training logs, we propose L4, the first log-based large-scale LLM training failure diagnosis framework, which can automatically extract failure-indicating information (i.e., log events, nodes, stages, and iterations) from extensive training logs, thereby reducing manual effort and facilitating failure recovery.

L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis

Zhihan Jiang, Junjie Huang, Guangba Yu^†, Zhuangbin Chen, Yichen Li, Renyi Zhong, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael R. Lyu

FSE'25 The ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, Jun 2025.

COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge

Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu^†, Michael R. Lyu

ICSE'25 The IEEE/ACM International Conference on Software Engineering, Ottawa, Ontario, Canada, Apr 2025.

Automatically identifying the root cause of runtime failures is critical for ensuring the reliability of distributed systems. However, prevailing automatic RCA approaches rely on comprehensive runtime monitoring data, which is often not fully available in issue platforms. To obtain more accurate and comprehensive RCA results, we propose COCA, a code knowledge enhanced RCA approach for issue reports.

COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge

Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu^†, Michael R. Lyu

ICSE'25 The IEEE/ACM International Conference on Software Engineering, Ottawa, Ontario, Canada, Apr 2025.

2024

Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis

Junjie Huang, Zhihan Jiang, Jinyang Liu, Yintong Huo, Jiazhen Gu, Zhuangbin Chen^†, Cong Feng, Hui Dong, Zengyin Yang, Michael R. Lyu

ISSRE'24 The International Symposium on Software Reliability Engineering, Tsukuba, Japan, Oct 2024.

Logs are crucial for maintaining online service systems, but manual investigation of logs by engineers is labor-intensive and prone to errors. We find that engineers typically prioritize two categories of log information for diagnosis: fault-indicating descriptions (FID) that highlight abnormal events, and fault-indicating parameters (FIP) that identify associated entities. Motivated by these findings, we propose Log4d, a two-stage approach with novel prompt-based tuning to automatically extract fault-indicating information from logs for fault diagnosis.

Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis

Junjie Huang, Zhihan Jiang, Jinyang Liu, Yintong Huo, Jiazhen Gu, Zhuangbin Chen^†, Cong Feng, Hui Dong, Zengyin Yang, Michael R. Lyu

ISSRE'24 The International Symposium on Software Reliability Engineering, Tsukuba, Japan, Oct 2024.

Exploring the Effectiveness of LLMs in Automated Logging Statement Generation: An Empirical Study

Yichen Li, Yintong Huo, Zhihan Jiang, Renyi Zhong, Pinjia He, Yuxin Su, Lionel C. Briand, Michael R. Lyu

TSE'24 IEEE Transactions on Software Engineering, 2024.

Despite advancements in natural language generation and programming language comprehension, the potential of large language models (LLMs) for generating logging statements remains unexplored. To fill the gap, we conduct the first study on LLMs for logging statement generation. We create a logging statement generation dataset and evaluate the effectiveness and generalization capabilities of 13 top-performing LLMs. Our empirical analysis reveals the limitations of current logging methods, highlights the promise of LLM-based logging tools, and offers actionable guidance for developing more practical models.

Exploring the Effectiveness of LLMs in Automated Logging Statement Generation: An Empirical Study

Yichen Li, Yintong Huo, Zhihan Jiang, Renyi Zhong, Pinjia He, Yuxin Su, Lionel C. Briand, Michael R. Lyu

TSE'24 IEEE Transactions on Software Engineering, 2024.

A Large-scale Evaluation for Log Parsing Techniques: How Far are We?

Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen^†, Jieming Zhu, Michael R. Lyu

ISSTA'24 The ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, Sep 2024.

Log parsing is essential for converting unstructured logs into structured data for automated analysis. Evaluating the characteristics and performance of various log parsers is crucial, however, the existing Loghub dataset is limited in scale and representativeness. We introduce Loghub-2.0, comprising 14 datasets with an average of 3.6 million logs each. Based on these datasets, we thoroughly re-evaluate 15 state-of-the-art log parsers in a more rigorous and practical setting, offering valuable insights.

A Large-scale Evaluation for Log Parsing Techniques: How Far are We?

Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen^†, Jieming Zhu, Michael R. Lyu

ISSTA'24 The ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, Sep 2024.

LILAC: Log Parsing using LLMs with Adaptive Parsing Cache

Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu^†, Michael R. Lyu

FSE'24 The ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, July 2024.

Log parsing serves as a prerequisite for various log analysis tasks, but the performance of current syntax-based and semantic-based parsers remains unsatisfactory. Leveraging large language models (LLMs) to overcome the limitations of existing log parsers is promising; however, it presents challenges related to specialization, consistency and efficiency. To address these practical issues, we propose LILAC, the first practical Log parsIng framework using LLMs with Adaptive parsing Cache.

LILAC: Log Parsing using LLMs with Adaptive Parsing Cache

Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu^†, Michael R. Lyu

FSE'24 The ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, July 2024.

Go Static: Contextualized Logging Statement Generation

Yichen Li, Yintong Huo^†, Renyi Zhong, Zhihan Jiang, Jinyang Liu, Junjie Huang, Jiazhen Gu, Michael R. Lyu

FSE'24 The ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, July 2024.

Logging practices have been extensively studied to assist developers in writing logging statements. However, existing automatic logging methods with single-method contexts face three key limitations: limited static scope, inconsistent logging styles, and missing variables type information. To tackle these limitations, we propose SCLogger, the first approach to generate contextualized logging statements using large language models with inter-method static contexts.

Go Static: Contextualized Logging Statement Generation

Yichen Li, Yintong Huo^†, Renyi Zhong, Zhihan Jiang, Jinyang Liu, Junjie Huang, Jiazhen Gu, Michael R. Lyu

FSE'24 The ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, July 2024.

TraceMesh: Scalable and Streaming Sampling for Distributed Traces

Zhuangbin Chen, Zhihan Jiang, Yuxin Su, Michael R. Lyu, Zibin Zheng^†

CLOUD'24 The IEEE International Conference on Cloud Computing, Shenzhen, China, July 2024. 🏆 Best Paper Award

Distributed tracing is a fundamental monitoring tool for cloud systems; however, it typically captures overlapping and redundant information. Existing tail-based trace samplers fall short of considering the high-dimensional and dynamic nature of trace data. To address these practical challenges, we introduce TraceMesh, a scalable and streaming sampler for distributed traces, which adapts to evolving trace features and dynamically samples uncommon traces.

TraceMesh: Scalable and Streaming Sampling for Distributed Traces

Zhuangbin Chen, Zhihan Jiang, Yuxin Su, Michael R. Lyu, Zibin Zheng^†

CLOUD'24 The IEEE International Conference on Cloud Computing, Shenzhen, China, July 2024. 🏆 Best Paper Award

FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems

Junjie Huang, Jinyang Liu, Zhuangbin Chen, Zhihan Jiang, Yichen Li, Jiazhen Gu^†, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

ICSE'24 The IEEE/ACM International Conference on Software Engineering, Software Engineering in Practice, Lisbon, Portugal, Apr 2024.

Postmortem analysis is essential for managing cloud system incidents, involving profiling incidents to classify them into unique fault patterns. Current manual approaches are labor-intensive and error-prone, resulting in only the most severe incidents being analyzed, which leads to a skewed fault pattern overview. To address these limitations, we propose an automated approach called FaultProfIT, for Fault Pattern Profiling of Incident Tickets, utilizing hierarchy-guided contrastive learning.

FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems

Junjie Huang, Jinyang Liu, Zhuangbin Chen, Zhihan Jiang, Yichen Li, Jiazhen Gu^†, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

ICSE'24 The IEEE/ACM International Conference on Software Engineering, Software Engineering in Practice, Lisbon, Portugal, Apr 2024.

2023

Prism: Revealing Hidden Functional Clusters of Massive Instances in Cloud Systems

Jinyang Liu*, Zhihan Jiang*, Jiazhen Gu, Junjie Huang, Zhuangbin Chen^†, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu (* equal contribution)

ASE'23 The IEEE/ACM International Conference on Automated Software Engineering, Kirchberg, Luxembourg, Sep 2023.

To improve observability of large-scale cloud systems, we propose to infer functional clusters, i.e., groups of instances having similar functionalities, to bridge the gap betwwen instance and service layer. Our pilot study demonstrates that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we propose a non-intrusive solution, Prism, to reveal functional clusters in cloud systems based on communication traces and performance metrics.

Prism: Revealing Hidden Functional Clusters of Massive Instances in Cloud Systems

Jinyang Liu*, Zhihan Jiang*, Jiazhen Gu, Junjie Huang, Zhuangbin Chen^†, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu (* equal contribution)

ASE'23 The IEEE/ACM International Conference on Automated Software Engineering, Kirchberg, Luxembourg, Sep 2023.