AWE
Automatic Workload Evaluation (AWE). Using modern SML techniques to understand and characterize complex workloads and their performance on distributed systems.Our goal is to predict simultaneously several aspects of system performance when stimulated by a previously unseen workload. We use Kernel Canonical Correlation Analysis (KCCA) to predict message counts, running time and disk operations for a database business-intelligence workload, after showing that simpler prediction techniques give poor results. Given two data spaces (in this case, the space of database query features and the space of measured performance characteristics of each query), KCCA finds maximally-correlated subspaces of fixed dimension embedded in those spaces. We use these findings to predict the performance of previously unseen queries via interpolation. Our approach achieves predictions within 20% of measured values more than 80% of the time on a real customer workload, even in cases where the database’s built-in query optimizer gives poor estimates.We’re now working on applying this approach to predict the performance of Hadoop (i.e. MapReduce-style) batch jobs and the performance of automatically tuned scientific codes on multicore parallel processors.
- Students: Archana Ganapathi, Kristal Sauer
- Collaborators: Harumi Kuno, Umeshwar Dayal, Janet Wiener (HP Labs, a RAD Lab Affiliate Sponsor)
Recent papers: (PDF files and abstracts can be found here)
- Archana Ganapathi, Kaushik Datta, Armando Fox, David Patterson. Using Machine Learning to Auto-tune a Stencil Code on a Multicore Architecture. Submitted to SysML 2008.
- Archana Ganapathi, Harumi Kuno, Umeshwar Dayal , Janet Wiener, Armando Fox , Michael Jordan , David Patterson. Predicting Multiple Performance Metrics for Queries: Better Decisions Enabled by Machine Learning. Proc. ICDE 2009 (to appear).
More Detail:
My current work, also with HP, takes the problem of predicting resource utilization of long-running database queries using query workload features, and maps that problem onto an instance of a Kernel Canonical Correlation Analysis (KCCA) problem. While KCCA is a recent and fairly sophisticated SML technology, we found that simpler SML methods such as regression do a poor job of prediction, motivating investigation of KCCA.To our knowledge, no previous work attempts to predict the actual performance of a multi-query database workload. Using a real customer workload, our model predicts individual query running times to within 20% for over 85% of queries, outperforming a state-of-the-art commercial predictor and achieving R2≥0.95 in simultaneously predicting utilization of multiple resources. On investigation, a main reason we outperform the commercial predictor is that the cardinality estimation errors that affect conventional predictors are “normalized out” by our SML-based prediction process. Thus we believe our approach represents a fundamental advance over other performance prediction methods. Ongoing work includes applying this approach to very large MapReduce workloads and interactive Web applications. In addition we plan to use the KCCA models to drive synthetic workload generators. This would allow researchers to use realistic workload data synthesized from the KCCA models of real commercial workloads, without having to obtain sensitive or proprietary workload data directly.