Evaluating replication for parallel jobs: An efficient approach

Zhan Qiu; Juan F. Pérez

doi:10.1109/TPDS.2015.2496593

Evaluating replication for parallel jobs: An efficient approach

Zhan Qiu, Juan F. Pérez

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.

Translated title of the contribution	Evaluación de la reproducción para jobs paralelos: Un enfoque eficiente
Original language	English (US)
Article number	7313012
Pages (from-to)	2288-2302
Number of pages	15
Journal	IEEE Transactions on Parallel and Distributed Systems
Volume	27
Issue number	8
DOIs	https://doi.org/10.1109/TPDS.2015.2496593
State	Published - Aug 1 2016
Externally published	Yes

All Science Journal Classification (ASJC) codes

Signal Processing
Hardware and Architecture
Computational Theory and Mathematics

Access to Document

10.1109/TPDS.2015.2496593

Cite this

@article{a7b0006af0644f84a7aa067fe8a4ce88,

title = "Evaluating replication for parallel jobs: An efficient approach",

abstract = "Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.",

author = "Zhan Qiu and P{\'e}rez, {Juan F.}",

year = "2016",

month = aug,

day = "1",

doi = "10.1109/TPDS.2015.2496593",

language = "English (US)",

volume = "27",

pages = "2288--2302",

journal = "IEEE Transactions on Parallel and Distributed Systems",

issn = "1045-9219",

publisher = "IEEE Computer Society",

number = "8",

}

TY - JOUR

T1 - Evaluating replication for parallel jobs

T2 - An efficient approach

AU - Qiu, Zhan

AU - Pérez, Juan F.

PY - 2016/8/1

Y1 - 2016/8/1

N2 - Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.

AB - Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.

UR - http://www.scopus.com/inward/record.url?scp=84978719052&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978719052&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2015.2496593

DO - 10.1109/TPDS.2015.2496593

M3 - Article

AN - SCOPUS:84978719052

SN - 1045-9219

VL - 27

SP - 2288

EP - 2302

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

IS - 8

M1 - 7313012

ER -

Evaluating replication for parallel jobs: An efficient approach

Abstract

All Science Journal Classification (ASJC) codes

Access to Document

Other files and links

Fingerprint

Cite this