Rigorous Benchmarking in Reasonable Time

Kalibera, Tomas, Jones, Richard E. (2013) Rigorous Benchmarking in Reasonable Time. In: ACM SIGPLAN International Symposium on Memory Management (ISMM 2013). . pp. 63-74. ACM, New York (doi:10.1145/2464157.2464160) (KAR id:33611)

PDF Author's Accepted Manuscript Language: English
Download this file (PDF/484kB)	Preview
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: https://doi.org/10.1145/2464157.2464160

Abstract

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.

In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.

We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.

NOTE: this version corrects the ISMM 2013 version

Item Type:	Conference or workshop item (Paper)
DOI/Identification number:	10.1145/2464157.2464160
Projects:	Garbage Collection for Multicore Systems
Uncontrolled keywords:	Benchmarking methodology; statistical methods; DaCapo; SPEC CPU
Subjects:	Q Science > QA Mathematics (inc Computing science) > QA 76 Software, computer programming, > QA76.76 Computer software
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Funders:	Organisations -1 not found.
Depositing User:	Richard Jones
Date Deposited:	15 Apr 2013 15:48 UTC
Last Modified:	20 May 2025 10:13 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/33611 (The current URI for this page, for reference purposes)

University of Kent Author Information

Kalibera, Tomas.

Creator's ORCID:
CReDIT Contributor Roles:

Jones, Richard E..

Creator's ORCID:	https://orcid.org/0000-0002-8159-0297
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.