SC'25 |
Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs Shengkun Cui, Archit Patke, Hung Nguyen, Aditya Ranjan, Ziheng Chen, Phuong Cao, Gregory Bauer, Brett Bode, Catello Di Martino, Saurabh Jha, Chandra Narayanaswami, Daby Sow, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis π Citationπ Paper |
ICML'25 |
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir O. Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, Pranjal Gupta, Suranjana Samanta, Pooja Aggarwal, Rong Lee, Pavankumar Murali, Jae-wook Ahn, Debanjana Kar, Ameet Rahane, Carlos Fonseca, Amit Paradkar, Yu Deng, Pratibha Moogi, Prateeti Mohapatra, Naoki Abe, Chandrasekhar Narayanaswami, Tianyin Xu, Lav R. Varshney, Ruchi Mahindru, Anca Sailer, Laura Shwartz, Daby Sow, Nicholas C. M. Fuller, Ruchir Puri Proceedings of the 42nd International Conference on Machine Learning π Citationπ Paperπ» Codeπ SlidesποΈ Talk Spotlight + Oral Poster at ICML 2025 |
DSN'25 |
Characterizing Modern GPU Resilience and Impact in HPC Systems: A Case Study of A100 GPUs Shengkun Cui, Archit Patke, Ziheng Chen, Aditya Ranjan, Hung Nguyen, Phuong Cao, Brett Bodet, Gregory Bauer, Saurabh Jha, Chandra Narayanaswami, Daby Sow, Catello Di Martino, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops π Citationπ Paper |
ICS'25 |
INDIGO: Page Migration for Hardware Memory Disaggregation Across a Network Archit Patke, Christian Pinto, Saurabh Jha, Haoran Qiu, Zbigniew Kalbarczyk, Ravishankar Iyer Proceedings of the 39th ACM International Conference on Supercomputing π Citationπ Paper |
DSN'24 |
Fault Localization Using Interventional Causal Learning for Cloud-Native Applications Saurabh Jha, Jesus Rios, Frank Bagehorn, Larisa Shwartz, Naoki Abe Proceedings of the 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume π Citationπ Paperπ’ Announcementπ€ Interview Insights used by IBM Instana |
DSN'24 |
iPrism: Characterize and Mitigate Risk by Quantifying Change in Escape Routes Shengkun Cui, Saurabh Jha, Ziheng Chen, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks π Citationπ Paperπ» Code |
DSN'24 |
When Green Computing Meets Performance and Resilience SLOs Haoran Qiu, Weichao Mao, Chen Wang, Saurabh Jha, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer BaΕar, Ravishankar K. Iyer Proceedings of the 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks π Citationπ Paper |
AIOps'24 |
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer BaΕar, Ravishankar K. Iyer Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024 π Citationπ» Codeπ Preprint |
AIOps'24 |
QLM: Queue Management for Large Language Model Serving Archit Patke, Dhemath Reddy, Saurabh Jha, Christian Pinto, Haoran Qiu, Shengkun Cui, Chandra Narayanaswami, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024 π Citationπ» Code Insights used by IBM watsonxAdopted by vLLM and AIBrixaibrixibmvllm |
ATC'24 |
Power-aware Deep Learning Model Serving with Β΅-Serve Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer BaΕar, Ravishankar K. Iyer Proceedings of the 2024 USENIX Annual Technical Conference π Citationπ Paperπ Slides |
SoCC'24 |
Queue Management for SLO-Oriented Large Language Model Serving Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 13th Symposium on Cloud Computing π Citationπ Paperπ» Code Integrated in AIBrix and vLLM.aibrixibmvllm |
CLOUD'24 |
SAM: Subseries Augmentation-based Meta-learning for Generalizing AIOps Model in Multi-Cloud Migration Xi Yang, Paulito Palmes, Saurabh Jha, Bekir Turkkan, Gerard Vanloo, Frank Bagehorn, Chandra Narayanaswami, Larisa Shwartz, Naoki Abe, Yu Deng, Daby M. Sow Proceedings of the 16th International Conference on Cloud Computing π Paper |
AAAI-24 |
Optimizing IT FinOps and Sustainability through Unsupervised Workload Characterization Xi Yang, Rohan R. Arora, Saurabh Jha, Chandra Narayanaswami, Cheuk Lam, Jerrold Leichter, Yu Deng, Daby M. Sow Thirty-Eighth Annual Conference on Innovative Applications of Artificial Intelligence π Citationπ Paperπ SlidesποΈ Talk Ongoing integration in IBM Turbonomic |
IAAI'23 |
Fault Injection based Interventional Causal Learning for Distributed Applications Qing Wang, Jesus Rios, Saurabh Jha, Karthikeyan Shanmugam, Frank Bagehorn, Xi Yang, Robert Filepp, Naoki Abe, Larisa Shwartz Thirty-Fifth Annual Conference on Innovative Applications of Artificial Intelligence π Citationπ Paper |
DSN'22 |
Exploiting Temporal Data Diversity for Detecting Safety-critical Faults in AV Compute Systems Saurabh Jha, Shengkun Cui, T. Tsai, S. K. S. Hari, M. B. Sullivan, Zbigniew T. Kalbarczyk, Steve Keckler, Ravishankar K. Iyer Proceedings of the 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks π Citationπ Paper |
CLOUD'22 |
Localizing and Explaining Faults in Microservices using Distributed Tracing Jesus Rios, Saurabh Jha, Larisa Shwartz Proceedings of the 15th International Conference on Cloud Computing π Paper |
COMPSYS'22 |
Evaluating hardware memory disaggregation under delay and contention Archit Patke, Haoran Qiu, Saurabh Jha, Srikumar Venugopal, Michele Gazzetti, Christian Pinto, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 36th IEEE International Parallel \& Distributed Processing Symposium Workshops Best Presentation |
ASE'22 |
WOLFFI: A fault injection platform for learning AIOps models Frank Bagehorn, Jesus Rios, Saurabh Jha, Robert Filepp, Larisa Shwartz, Naoki Abe, Xing Yang 2022 37th IEEE/ACM International Conference on Automated Software Engineering π Citationπ PaperποΈ Talk |
ASPLOS'21 |
BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems π Citationπ Paperπ SlidesποΈ Talk |
ICS'21 |
Delay sensitivity-driven congestion mitigation for HPC systems Archit Patke, Saurabh Jha, Haoran Qiu, Jim Brandt, Ann Gentile, Joe Greenseid, Zbigniew Kalbarczyk, Ravishankar K. Iyer Proceedings of the ACM International Conference on Supercomputing π Citationπ Paper |
TOR'21 |
Data-Driven Application-Oriented Reliability Model of a High-Performance Computing System Bentolhoda Jafary, Saurabh Jha, Lance Findella, Ravishankar K. Iyer IEEE Transactions on Reliability π Citationπ Paper |
WOSC'21 |
Is Function-as-a-Service a Good Fit for Latency-Critical Services? Haoran Qiu, Saurabh Jha, Subho S. Banerjee, Archit Patke, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the Seventh International Workshop on Serverless Computing colocated with ACM/IFIP International Middleware Conference π Citationπ Paperπ» CodeποΈ Talk |
ML4AD'21 |
Watch out for the risky actors: Assessing risk in dynamic environments for safe driving Saurabh Jha, Yan Miao, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Workshop on Machine Learning for Autonomous Driving colocated with NeurIPS π Citationπ PaperποΈ Talk |
ISSRE'20 |
AV-FUZZER: Finding safety violations in autonomous driving systems Guanpeng Li, Yiran Li, Saurabh Jha, T. Tsai, S. K. S. Hari, M. B. Sullivan, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the IEEE International Conference on Software Reliability Engineering π Citationπ Paperπ» Code Best Papercsl |
OSDI'20 |
FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices Haoran Qui, Subho Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation π Citationπ Paperπ SlidesποΈ Talk |
SC'20 |
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems Saurabh Jha, Shengkun Cui, Subho Banerjee, Tianyin Xu, Jeremy Enos, Mike Showerman, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis π Citationπ Paperπ» Codeπ’ Announcementπ€ Interview Integrated with IBM InstanaBest Student Paper FinalistBest Paper Finalist |
ICML'20 |
Inductive Bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters Subho Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 37th International Conference on Machine Learning π Citationπ PaperποΈ Talkcsl |
DSN'20 |
ML-driven Malware that Targets AV Safety Saurabh Jha, Shengkun Cui, Subho Banerjee, James Cyriac, T. Tsai, Zbigniew T. Kalbarczyk, Steve Keckler, Ravishankar K. Iyer Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks π Citationπ Paper |
DSN'20 |
The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems Rakesh Kumar, Saurabh Jha, Ashraf Mahgoub, Zbigniew T Kalbarczyk, Kramer William, Ravishankar K Iyer, Saurabh Bagchi Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks π Citationπ Paper |
NSDI'20 |
Measuring Congestion in High-Performance Datacenter Interconnects Saurabh Jha, Archit Patke, Jim Brandt, Ann Gentile, Benjamin Lim, Mike Showerman, Greg Bauer, Larry Kaplan, Zbigniew Kalbarczyk, William Kramer, Ravi Iyer Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation π Citationπ Paperπ» Codeπ Slidesπ DataποΈ Talk |
HOTI'20 |
A Study of Network Congestion in Two Supercomputing High-Speed Interconnects Saurabh Jha, Archit Patke, Jim Brandt, Ann Gentile, Mike Showerman, Eric Roman, Zbigniew Kalbarczyk, William Kramer, Ravi Iyer Proceedings of the IEEE 26th Annual Symposium on High-Performance Interconnects π Citationπ Paperπ Slides |
DSN'19 |
ML-Based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection Saurabh Jha, Subho Banerjee, T. Tsai, S. K. S. Hari, M. B. Sullivan, Zbigniew T. Kalbarczyk, Steve Keckler, Ravishankar K. Iyer Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks π Citationπ Papereurekaguanchaillinisciencesinaspace |
DSN'19 |
Towards a Bayesian Approach for Assessing Fault Tolerance of Deep Neural Networks Subho Banerjee, James Cyriac, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks β Supplemental Volume π Citationπ Paper |
DSN'18 |
Hands Off the Wheel in Autonomous Vehicles?: A Systems Perspective on over a Million Miles of Field Data Subho Banerjee, Saurabh Jha, James Cyriac, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks π Citationπ Paperπ» Code |
TDSC'18 |
Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters Saurabh Jha, Valerio Formicola, Catello Di Martino, Mark Dalton, William Kramer, Zbigniew Kalbarczyk, Ravishankar K. Iyer IEEE Transactions on Dependable and Secure Computing π Citationπ Paper |
DSN'18 |
AVFI: Fault Injection for Autonomous Vehicles Saurabh Jha, Subho Banerjee, James Cyriac, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops π Citationπ Paperπ» Code |
CLUSTER'17 |
Holistic Measurement-Driven System Assessment Saurabh Jha, Jim Brandt, Ann Gentile, Zbigniew T. Kalbarczyk, Greg Bauer, Jeremy Enos, Mike Showerman, Larry Kaplan, Brett Bode, Annette Greiner, Amanda Bonnie, Mike Mason, Ravishankar K. Iyer, William Kramer Workshop on Machine Learning for Autonomous Driving colocated with NeurIPS π Citationπ Paper |
CUG'16 |
Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo Valerio Formicola, Saurabh Jha, Daniel Chen, Fei Deng, Amanda Bonnie, Mike Mason, Jim Brandt, Ann Gentile, Larry Kaplan, Jason Repik, Jeremy Enos, Mike Showerman, Annette Greiner, Zbigniew Kalbarczyk, Ravishankar K. Iyer, William Kramer 2016 Cray User Group π Citationπ Paper |
VLDB'15 |
Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach Saurabh Jha, Bingsheng He, Mian Lu, Xuntao Cheng, Huynh Phung Huynh Proceedings of the 2015 VLDB Endowment π Citationπ Paperπ» Code |
FTXS'15 |
LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications Catello Di Martino, Saurabh Jha, William Kramer, Zbigniew Kalbarczyk, Ravishankar K. Iyer Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale colocated with HPDC 2015 π Citationπ Paper |
HPDC'13 |
P-HGRMS: A Parallel Hypergraph Based Root Mean Square Algorithm for Image Denoising Tejaswi Agarwal, Saurabh Jha, Rajesh Kanna 22nd ACM Symposium on High-Performance Parallel and Distributed Computing π Paperπ SlidesπΌοΈ Poster Best Poster |