SC'25 Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs
Shengkun Cui, Archit Patke, Hung Nguyen, Aditya Ranjan, Ziheng Chen, Phuong Cao, Gregory Bauer, Brett Bode, Catello Di Martino, Saurabh Jha, Chandra Narayanaswami, Daby Sow, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
πŸ“ CitationπŸ“„ Paper
ICML'25 ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks
Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir O. Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, Pranjal Gupta, Suranjana Samanta, Pooja Aggarwal, Rong Lee, Pavankumar Murali, Jae-wook Ahn, Debanjana Kar, Ameet Rahane, Carlos Fonseca, Amit Paradkar, Yu Deng, Pratibha Moogi, Prateeti Mohapatra, Naoki Abe, Chandrasekhar Narayanaswami, Tianyin Xu, Lav R. Varshney, Ruchi Mahindru, Anca Sailer, Laura Shwartz, Daby Sow, Nicholas C. M. Fuller, Ruchir Puri
Proceedings of the 42nd International Conference on Machine Learning
πŸ“ CitationπŸ“„ PaperπŸ’» CodeπŸ“Š SlidesπŸŽ™οΈ Talk
Spotlight + Oral Poster at ICML 2025
DSN'25 Characterizing Modern GPU Resilience and Impact in HPC Systems: A Case Study of A100 GPUs
Shengkun Cui, Archit Patke, Ziheng Chen, Aditya Ranjan, Hung Nguyen, Phuong Cao, Brett Bodet, Gregory Bauer, Saurabh Jha, Chandra Narayanaswami, Daby Sow, Catello Di Martino, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops
πŸ“ CitationπŸ“„ Paper
ICS'25 INDIGO: Page Migration for Hardware Memory Disaggregation Across a Network
Archit Patke, Christian Pinto, Saurabh Jha, Haoran Qiu, Zbigniew Kalbarczyk, Ravishankar Iyer
Proceedings of the 39th ACM International Conference on Supercomputing
πŸ“ CitationπŸ“„ Paper
DSN'24 Fault Localization Using Interventional Causal Learning for Cloud-Native Applications
Saurabh Jha, Jesus Rios, Frank Bagehorn, Larisa Shwartz, Naoki Abe
Proceedings of the 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume
πŸ“ CitationπŸ“„ PaperπŸ“’ Announcement🎀 Interview
Insights used by IBM Instana
DSN'24 iPrism: Characterize and Mitigate Risk by Quantifying Change in Escape Routes
Shengkun Cui, Saurabh Jha, Ziheng Chen, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
πŸ“ CitationπŸ“„ PaperπŸ’» Code
DSN'24 When Green Computing Meets Performance and Resilience SLOs
Haoran Qiu, Weichao Mao, Chen Wang, Saurabh Jha, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer
Proceedings of the 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
πŸ“ CitationπŸ“„ Paper
AIOps'24 Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer
Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024
πŸ“ CitationπŸ’» CodeπŸ”— Preprint
AIOps'24 QLM: Queue Management for Large Language Model Serving
Archit Patke, Dhemath Reddy, Saurabh Jha, Christian Pinto, Haoran Qiu, Shengkun Cui, Chandra Narayanaswami, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024
πŸ“ CitationπŸ’» Code
Insights used by IBM watsonxAdopted by vLLM and AIBrixaibrixibmvllm
ATC'24 Power-aware Deep Learning Model Serving with Β΅-Serve
Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer
Proceedings of the 2024 USENIX Annual Technical Conference
πŸ“ CitationπŸ“„ PaperπŸ“Š Slides
SoCC'24 Queue Management for SLO-Oriented Large Language Model Serving
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 13th Symposium on Cloud Computing
πŸ“ CitationπŸ“„ PaperπŸ’» Code
Integrated in AIBrix and vLLM.aibrixibmvllm
CLOUD'24 SAM: Subseries Augmentation-based Meta-learning for Generalizing AIOps Model in Multi-Cloud Migration
Xi Yang, Paulito Palmes, Saurabh Jha, Bekir Turkkan, Gerard Vanloo, Frank Bagehorn, Chandra Narayanaswami, Larisa Shwartz, Naoki Abe, Yu Deng, Daby M. Sow
Proceedings of the 16th International Conference on Cloud Computing
πŸ“„ Paper
AAAI-24 Optimizing IT FinOps and Sustainability through Unsupervised Workload Characterization
Xi Yang, Rohan R. Arora, Saurabh Jha, Chandra Narayanaswami, Cheuk Lam, Jerrold Leichter, Yu Deng, Daby M. Sow
Thirty-Eighth Annual Conference on Innovative Applications of Artificial Intelligence
πŸ“ CitationπŸ“„ PaperπŸ“Š SlidesπŸŽ™οΈ Talk
Ongoing integration in IBM Turbonomic
IAAI'23 Fault Injection based Interventional Causal Learning for Distributed Applications
Qing Wang, Jesus Rios, Saurabh Jha, Karthikeyan Shanmugam, Frank Bagehorn, Xi Yang, Robert Filepp, Naoki Abe, Larisa Shwartz
Thirty-Fifth Annual Conference on Innovative Applications of Artificial Intelligence
πŸ“ CitationπŸ“„ Paper
DSN'22 Exploiting Temporal Data Diversity for Detecting Safety-critical Faults in AV Compute Systems
Saurabh Jha, Shengkun Cui, T. Tsai, S. K. S. Hari, M. B. Sullivan, Zbigniew T. Kalbarczyk, Steve Keckler, Ravishankar K. Iyer
Proceedings of the 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks
πŸ“ CitationπŸ“„ Paper
CLOUD'22 Localizing and Explaining Faults in Microservices using Distributed Tracing
Jesus Rios, Saurabh Jha, Larisa Shwartz
Proceedings of the 15th International Conference on Cloud Computing
πŸ“„ Paper
COMPSYS'22 Evaluating hardware memory disaggregation under delay and contention
Archit Patke, Haoran Qiu, Saurabh Jha, Srikumar Venugopal, Michele Gazzetti, Christian Pinto, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 36th IEEE International Parallel \& Distributed Processing Symposium Workshops

Best Presentation
ASE'22 WOLFFI: A fault injection platform for learning AIOps models
Frank Bagehorn, Jesus Rios, Saurabh Jha, Robert Filepp, Larisa Shwartz, Naoki Abe, Xing Yang
2022 37th IEEE/ACM International Conference on Automated Software Engineering
πŸ“ CitationπŸ“„ PaperπŸŽ™οΈ Talk
ASPLOS'21 BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics
Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems
πŸ“ CitationπŸ“„ PaperπŸ“Š SlidesπŸŽ™οΈ Talk
ICS'21 Delay sensitivity-driven congestion mitigation for HPC systems
Archit Patke, Saurabh Jha, Haoran Qiu, Jim Brandt, Ann Gentile, Joe Greenseid, Zbigniew Kalbarczyk, Ravishankar K. Iyer
Proceedings of the ACM International Conference on Supercomputing
πŸ“ CitationπŸ“„ Paper
TOR'21 Data-Driven Application-Oriented Reliability Model of a High-Performance Computing System
Bentolhoda Jafary, Saurabh Jha, Lance Findella, Ravishankar K. Iyer
IEEE Transactions on Reliability
πŸ“ CitationπŸ“„ Paper
WOSC'21 Is Function-as-a-Service a Good Fit for Latency-Critical Services?
Haoran Qiu, Saurabh Jha, Subho S. Banerjee, Archit Patke, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the Seventh International Workshop on Serverless Computing colocated with ACM/IFIP International Middleware Conference
πŸ“ CitationπŸ“„ PaperπŸ’» CodeπŸŽ™οΈ Talk
ML4AD'21 Watch out for the risky actors: Assessing risk in dynamic environments for safe driving
Saurabh Jha, Yan Miao, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Workshop on Machine Learning for Autonomous Driving colocated with NeurIPS
πŸ“ CitationπŸ“„ PaperπŸŽ™οΈ Talk
ISSRE'20 AV-FUZZER: Finding safety violations in autonomous driving systems
Guanpeng Li, Yiran Li, Saurabh Jha, T. Tsai, S. K. S. Hari, M. B. Sullivan, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the IEEE International Conference on Software Reliability Engineering
πŸ“ CitationπŸ“„ PaperπŸ’» Code
Best Papercsl
OSDI'20 FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices
Haoran Qui, Subho Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation
πŸ“ CitationπŸ“„ PaperπŸ“Š SlidesπŸŽ™οΈ Talk
SC'20 Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems
Saurabh Jha, Shengkun Cui, Subho Banerjee, Tianyin Xu, Jeremy Enos, Mike Showerman, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
πŸ“ CitationπŸ“„ PaperπŸ’» CodeπŸ“’ Announcement🎀 Interview
Integrated with IBM InstanaBest Student Paper FinalistBest Paper Finalist
ICML'20 Inductive Bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters
Subho Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 37th International Conference on Machine Learning
πŸ“ CitationπŸ“„ PaperπŸŽ™οΈ Talkcsl
DSN'20 ML-driven Malware that Targets AV Safety
Saurabh Jha, Shengkun Cui, Subho Banerjee, James Cyriac, T. Tsai, Zbigniew T. Kalbarczyk, Steve Keckler, Ravishankar K. Iyer
Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
πŸ“ CitationπŸ“„ Paper
DSN'20 The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems
Rakesh Kumar, Saurabh Jha, Ashraf Mahgoub, Zbigniew T Kalbarczyk, Kramer William, Ravishankar K Iyer, Saurabh Bagchi
Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
πŸ“ CitationπŸ“„ Paper
NSDI'20 Measuring Congestion in High-Performance Datacenter Interconnects
Saurabh Jha, Archit Patke, Jim Brandt, Ann Gentile, Benjamin Lim, Mike Showerman, Greg Bauer, Larry Kaplan, Zbigniew Kalbarczyk, William Kramer, Ravi Iyer
Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation
πŸ“ CitationπŸ“„ PaperπŸ’» CodeπŸ“Š SlidesπŸ“Š DataπŸŽ™οΈ Talk
HOTI'20 A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
Saurabh Jha, Archit Patke, Jim Brandt, Ann Gentile, Mike Showerman, Eric Roman, Zbigniew Kalbarczyk, William Kramer, Ravi Iyer
Proceedings of the IEEE 26th Annual Symposium on High-Performance Interconnects
πŸ“ CitationπŸ“„ PaperπŸ“Š Slides
DSN'19 ML-Based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection
Saurabh Jha, Subho Banerjee, T. Tsai, S. K. S. Hari, M. B. Sullivan, Zbigniew T. Kalbarczyk, Steve Keckler, Ravishankar K. Iyer
Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
πŸ“ CitationπŸ“„ Papereurekaguanchaillinisciencesinaspace
DSN'19 Towards a Bayesian Approach for Assessing Fault Tolerance of Deep Neural Networks
Subho Banerjee, James Cyriac, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Supplemental Volume
πŸ“ CitationπŸ“„ Paper
DSN'18 Hands Off the Wheel in Autonomous Vehicles?: A Systems Perspective on over a Million Miles of Field Data
Subho Banerjee, Saurabh Jha, James Cyriac, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
πŸ“ CitationπŸ“„ PaperπŸ’» Code
TDSC'18 Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters
Saurabh Jha, Valerio Formicola, Catello Di Martino, Mark Dalton, William Kramer, Zbigniew Kalbarczyk, Ravishankar K. Iyer
IEEE Transactions on Dependable and Secure Computing
πŸ“ CitationπŸ“„ Paper
DSN'18 AVFI: Fault Injection for Autonomous Vehicles
Saurabh Jha, Subho Banerjee, James Cyriac, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops
πŸ“ CitationπŸ“„ PaperπŸ’» Code
CLUSTER'17 Holistic Measurement-Driven System Assessment
Saurabh Jha, Jim Brandt, Ann Gentile, Zbigniew T. Kalbarczyk, Greg Bauer, Jeremy Enos, Mike Showerman, Larry Kaplan, Brett Bode, Annette Greiner, Amanda Bonnie, Mike Mason, Ravishankar K. Iyer, William Kramer
Workshop on Machine Learning for Autonomous Driving colocated with NeurIPS
πŸ“ CitationπŸ“„ Paper
CUG'16 Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo
Valerio Formicola, Saurabh Jha, Daniel Chen, Fei Deng, Amanda Bonnie, Mike Mason, Jim Brandt, Ann Gentile, Larry Kaplan, Jason Repik, Jeremy Enos, Mike Showerman, Annette Greiner, Zbigniew Kalbarczyk, Ravishankar K. Iyer, William Kramer
2016 Cray User Group
πŸ“ CitationπŸ“„ Paper
VLDB'15 Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach
Saurabh Jha, Bingsheng He, Mian Lu, Xuntao Cheng, Huynh Phung Huynh
Proceedings of the 2015 VLDB Endowment
πŸ“ CitationπŸ“„ PaperπŸ’» Code
FTXS'15 LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications
Catello Di Martino, Saurabh Jha, William Kramer, Zbigniew Kalbarczyk, Ravishankar K. Iyer
Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale colocated with HPDC 2015
πŸ“ CitationπŸ“„ Paper
HPDC'13 P-HGRMS: A Parallel Hypergraph Based Root Mean Square Algorithm for Image Denoising
Tejaswi Agarwal, Saurabh Jha, Rajesh Kanna
22nd ACM Symposium on High-Performance Parallel and Distributed Computing
πŸ“„ PaperπŸ“Š SlidesπŸ–ΌοΈ Poster
Best Poster