Enhancing reliability and response times via replication in computing clusters

Zhan Qiu, Juan F. Perez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.

Original languageEnglish (US)
Title of host publication2015 IEEE Conference on Computer Communications, IEEE INFOCOM 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1355-1363
Number of pages9
Volume26
ISBN (Electronic)9781479983810
DOIs
StatePublished - Aug 21 2015
Externally publishedYes
Event34th IEEE Annual Conference on Computer Communications and Networks, IEEE INFOCOM 2015 - Hong Kong, Hong Kong
Duration: Apr 26 2015May 1 2015

Conference

Conference34th IEEE Annual Conference on Computer Communications and Networks, IEEE INFOCOM 2015
CountryHong Kong
CityHong Kong
Period4/26/155/1/15

Fingerprint

Response time (computer systems)
Cluster computing
Fault tolerance
Stochastic models

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Electrical and Electronic Engineering

Cite this

Qiu, Z., & Perez, J. F. (2015). Enhancing reliability and response times via replication in computing clusters. In 2015 IEEE Conference on Computer Communications, IEEE INFOCOM 2015 (Vol. 26, pp. 1355-1363). [7218512] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/INFOCOM.2015.7218512
Qiu, Zhan ; Perez, Juan F. / Enhancing reliability and response times via replication in computing clusters. 2015 IEEE Conference on Computer Communications, IEEE INFOCOM 2015. Vol. 26 Institute of Electrical and Electronics Engineers Inc., 2015. pp. 1355-1363
@inproceedings{6131f892be0b449fb113b8dac2845172,
title = "Enhancing reliability and response times via replication in computing clusters",
abstract = "Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.",
author = "Zhan Qiu and Perez, {Juan F.}",
year = "2015",
month = "8",
day = "21",
doi = "10.1109/INFOCOM.2015.7218512",
language = "English (US)",
volume = "26",
pages = "1355--1363",
booktitle = "2015 IEEE Conference on Computer Communications, IEEE INFOCOM 2015",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

Qiu, Z & Perez, JF 2015, Enhancing reliability and response times via replication in computing clusters. in 2015 IEEE Conference on Computer Communications, IEEE INFOCOM 2015. vol. 26, 7218512, Institute of Electrical and Electronics Engineers Inc., pp. 1355-1363, 34th IEEE Annual Conference on Computer Communications and Networks, IEEE INFOCOM 2015, Hong Kong, Hong Kong, 4/26/15. https://doi.org/10.1109/INFOCOM.2015.7218512

Enhancing reliability and response times via replication in computing clusters. / Qiu, Zhan; Perez, Juan F.

2015 IEEE Conference on Computer Communications, IEEE INFOCOM 2015. Vol. 26 Institute of Electrical and Electronics Engineers Inc., 2015. p. 1355-1363 7218512.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Enhancing reliability and response times via replication in computing clusters

AU - Qiu, Zhan

AU - Perez, Juan F.

PY - 2015/8/21

Y1 - 2015/8/21

N2 - Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.

AB - Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.

UR - http://www.scopus.com/inward/record.url?scp=84954506482&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84954506482&partnerID=8YFLogxK

U2 - 10.1109/INFOCOM.2015.7218512

DO - 10.1109/INFOCOM.2015.7218512

M3 - Conference contribution

AN - SCOPUS:84954506482

VL - 26

SP - 1355

EP - 1363

BT - 2015 IEEE Conference on Computer Communications, IEEE INFOCOM 2015

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Qiu Z, Perez JF. Enhancing reliability and response times via replication in computing clusters. In 2015 IEEE Conference on Computer Communications, IEEE INFOCOM 2015. Vol. 26. Institute of Electrical and Electronics Engineers Inc. 2015. p. 1355-1363. 7218512 https://doi.org/10.1109/INFOCOM.2015.7218512