Evaluating replication for parallel jobs: An efficient approach

Título traducido de la contribución: Evaluación de la reproducción para jobs paralelos: Un enfoque eficiente

Zhan Qiu, Juan F. Pérez

Resultado de la investigación: Contribución a RevistaArtículo

6 Citas (Scopus)

Resumen

Muchas aplicaciones de software modernas se basan en el procesamiento paralelo de trabajos para explotar grandes reservas de recursos disponibles en infraestructuras de nube y grid. El tiempo de respuesta de un trabajo paralelo, formado por muchas subtareas, viene determinado por la última subtarea que finalice. Por lo tanto, una única subtarea o un fallo que requiera reprocesamiento puede aumentar sustancialmente el tiempo de respuesta. Para superar estos problemas, exploramos la replicación concurrente con la cancelación. Este mecanismo ejecuta dos réplicas de trabajos simultáneamente, y recupera el resultado de la primera réplica que se completa, cancelando inmediatamente la otra. Para analizar este mecanismo proponemos un modelo estocástico que considera la replicación tanto a nivel de trabajo como de tarea. Encontramos que la replicación a nivel de tarea logra una fiabilidad mucho mayor y tiempos de respuesta más cortos que la replicación a nivel de trabajo. También observamos que el impacto de la replicación depende de la utilización del sistema, la fiabilidad de las subtareas y la correlación entre los fallos de las réplicas. Basado en el modelo, proponemos una estrategia de aprovisionamiento de recursos que determina el número mínimo de nodos de cálculo necesarios para alcanzar un objetivo de nivel de servicio (SLO) definido como un percentil de tiempo de respuesta. Esta estrategia se evalúa considerando patrones de tráfico realistas de un cluster paralelo, donde la replicación a nivel de tarea muestra el potencial para reducir los requerimientos de recursos para SLOs con tiempos de respuesta ajustados.
Idioma originalEnglish (US)
Número de artículo7313012
Páginas (desde-hasta)2288-2302
Número de páginas15
PublicaciónIEEE Transactions on Parallel and Distributed Systems
Volumen27
N.º8
DOI
EstadoPublished - ago 1 2016
Publicado de forma externa

Huella dactilar

Stochastic models
Processing
Application programs

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Citar esto

@article{a7b0006af0644f84a7aa067fe8a4ce88,
title = "Evaluating replication for parallel jobs: An efficient approach",
abstract = "Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.",
author = "Zhan Qiu and P{\'e}rez, {Juan F.}",
year = "2016",
month = "8",
day = "1",
doi = "10.1109/TPDS.2015.2496593",
language = "English (US)",
volume = "27",
pages = "2288--2302",
journal = "IEEE Transactions on Parallel and Distributed Systems",
issn = "1045-9219",
publisher = "IEEE Computer Society",
number = "8",

}

Evaluating replication for parallel jobs : An efficient approach. / Qiu, Zhan; Pérez, Juan F.

En: IEEE Transactions on Parallel and Distributed Systems, Vol. 27, N.º 8, 7313012, 01.08.2016, p. 2288-2302.

Resultado de la investigación: Contribución a RevistaArtículo

TY - JOUR

T1 - Evaluating replication for parallel jobs

T2 - An efficient approach

AU - Qiu, Zhan

AU - Pérez, Juan F.

PY - 2016/8/1

Y1 - 2016/8/1

N2 - Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.

AB - Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.

UR - http://www.scopus.com/inward/record.url?scp=84978719052&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978719052&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2015.2496593

DO - 10.1109/TPDS.2015.2496593

M3 - Article

AN - SCOPUS:84978719052

VL - 27

SP - 2288

EP - 2302

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

SN - 1045-9219

IS - 8

M1 - 7313012

ER -