Evaluating the effectiveness of replication for tail-tolerance

Zhan Qiu, Juan F. Perez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Scopus citations

Abstract

Computing clusters (CC) are a cost-effective high-performance platform for computation-intensive scientific and engineering applications. A key challenge in managing CCs is to consistently achieve low response times. In particular, tail-tolerant methods aim to keep the tail of the response-time distribution short. In this paper we explore concurrent replication with cancelling, a tail-tolerant approach that involves processing requests and their replicas concurrently, retrieving the result from the first replica that completes, and cancelling all other replicas. We propose a stochastic model that considers any number of replicas, general processing and inter-arrival times, and computes the response time distribution. We show that replication can be very effective in keeping the response-time tail short, but these benefits highly depend on the processing-time distribution, as well as on the CC utilization and the statistical characteristics of the arrival process. We also exploit the model to support the selection of the optimal number of replicas, and a resource provisioning strategy that meets service-level objectives on the response-time percentiles.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE/ACM 15th International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages443-452
Number of pages10
ISBN (Electronic)9781479980062
DOIs
StatePublished - Jan 1 2015
Externally publishedYes
Event15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015 - Shenzhen, China
Duration: May 4 2015May 7 2015

Conference

Conference15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015
CountryChina
CityShenzhen
Period5/4/155/7/15

    Fingerprint

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Networks and Communications
  • Software

Cite this

Qiu, Z., & Perez, J. F. (2015). Evaluating the effectiveness of replication for tail-tolerance. In Proceedings - 2015 IEEE/ACM 15th International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015 (pp. 443-452). [7152510] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CCGrid.2015.22