Assessing the impact of concurrent replication with canceling in Parallel Jobs

Título traducido de la contribución: Evaluación del impacto de la reproducción simultánea con la cancelación en jobs paralelos

Zhan Qiu, Juan F. Pérez

Resultado de la investigación: Contribución a Revista

4 Citas (Scopus)

Resumen

El procesamiento paralelo de trabajos se ha convertido en una característica clave de muchas aplicaciones de software, por ejemplo, en la informática científica. La paralelización permite que estas aplicaciones exploten grandes grupos de recursos, como centros de datos en nube o en grid. Sin embargo, un trabajo compuesto por un gran número de tareas paralelas sufrirá un fracaso si alguna de sus tareas falla, lo que requerirá un reprocesamiento y retrasos adicionales. En este trabajo, exploramos el efecto que la replicación de trabajos paralelos tiene sobre la fiabilidad y el tiempo de respuesta del trabajo, así como sobre la utilización de los recursos. El mecanismo de replicación consiste en procesar simultáneamente réplicas, ya sea a nivel de trabajo o de tarea, recuperando los resultados de la réplica que termina primero, si la hay, y cancelando cualquier réplica que quede en proceso. Proponemos un modelo estocástico que considera explícitamente el procesamiento paralelo de trabajos, la replicación tanto a nivel de trabajo como de tarea, y maneja los procesos generales de llegada. Desarrollamos un algoritmo numéricamente eficiente para resolver instancias a gran escala del modelo y calcular métricas clave de rendimiento. Observamos que el mecanismo de cancelación de tareas ofrece una forma efectiva de limitar el aumento en la utilización de recursos, permitiendo el uso de réplicas que no sólo aumentan la confiabilidad del trabajo, sino que tienen el potencial de reducir los tiempos de respuesta.
Idioma originalEnglish (US)
Número de artículo7033635
Páginas (desde-hasta)31-40
Número de páginas10
PublicaciónProceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
Volumen2015-February
N.ºFebruary
DOI
EstadoPublished - ene 1 2015
Publicado de forma externa
Evento2014 22nd Annual IEEE International Symposium on Modeling, Analysis and Simulation of Computer, and Telecommunication Systems, MASCOTS 2014 - Paris
Duración: sep 9 2014sep 11 2014

Huella dactilar

Replica
Replication
Concurrent
Processing
Response Time
Resources
Natural sciences computing
Stochastic models
Application programs
Scientific Computing
Data Center
Performance Metrics
Cancellation
Parallelization
Stochastic Model
Efficient Algorithms
Limiting
Grid
Software
Model

All Science Journal Classification (ASJC) codes

  • Electrical and Electronic Engineering
  • Computer Networks and Communications
  • Software
  • Modeling and Simulation

Citar esto

Qiu, Z., & Pérez, J. F. (2015). Assessing the impact of concurrent replication with canceling in Parallel Jobs. Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS, 2015-February(February), 31-40. [7033635]. https://doi.org/10.1109/MASCOTS.2014.13
Qiu, Zhan ; Pérez, Juan F. / Assessing the impact of concurrent replication with canceling in Parallel Jobs. En: Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS. 2015 ; Vol. 2015-February, N.º February. pp. 31-40.
@article{772c9a84a23b468ebfc0b118bc12a753,
title = "Assessing the impact of concurrent replication with canceling in Parallel Jobs",
abstract = "Parallel job processing has become a key feature of many software applications, e.g., in scientific computing. Parallelization allows these applications to exploit large resource pools, such as cloud or grid data centers. However, a job composed of a large number of parallel tasks will suffer a failure if any of its tasks fail, requiring reprocessing and additional delays. In this paper, we explore the effect that the replication of parallel jobs has on the job reliability and response time, as well as on resource utilization. The replication mechanism consists of concurrently processing replicas, at either the job or the task level, retrieving the results of the replica that finishes first, if any, and canceling any remaining replica in process. We propose a stochastic model that explicitly considers parallel job processing, replication at both the job and the task level, and handles general arrival processes. We develop a numerically-efficient algorithm to solve large-scale instances of the model and compute key performance metrics. We observe that the task cancellation mechanism offers an effective way of limiting the increase in resource utilization, allowing the use of replicas that not only increase the job reliability, but have the potential to reduce the response times.",
author = "Zhan Qiu and P{\'e}rez, {Juan F.}",
year = "2015",
month = "1",
day = "1",
doi = "10.1109/MASCOTS.2014.13",
language = "English (US)",
volume = "2015-February",
pages = "31--40",
number = "February",

}

Qiu, Z & Pérez, JF 2015, 'Assessing the impact of concurrent replication with canceling in Parallel Jobs', Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS, vol. 2015-February, n.º February, 7033635, pp. 31-40. https://doi.org/10.1109/MASCOTS.2014.13

Assessing the impact of concurrent replication with canceling in Parallel Jobs. / Qiu, Zhan; Pérez, Juan F.

En: Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS, Vol. 2015-February, N.º February, 7033635, 01.01.2015, p. 31-40.

Resultado de la investigación: Contribución a Revista

TY - JOUR

T1 - Assessing the impact of concurrent replication with canceling in Parallel Jobs

AU - Qiu, Zhan

AU - Pérez, Juan F.

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Parallel job processing has become a key feature of many software applications, e.g., in scientific computing. Parallelization allows these applications to exploit large resource pools, such as cloud or grid data centers. However, a job composed of a large number of parallel tasks will suffer a failure if any of its tasks fail, requiring reprocessing and additional delays. In this paper, we explore the effect that the replication of parallel jobs has on the job reliability and response time, as well as on resource utilization. The replication mechanism consists of concurrently processing replicas, at either the job or the task level, retrieving the results of the replica that finishes first, if any, and canceling any remaining replica in process. We propose a stochastic model that explicitly considers parallel job processing, replication at both the job and the task level, and handles general arrival processes. We develop a numerically-efficient algorithm to solve large-scale instances of the model and compute key performance metrics. We observe that the task cancellation mechanism offers an effective way of limiting the increase in resource utilization, allowing the use of replicas that not only increase the job reliability, but have the potential to reduce the response times.

AB - Parallel job processing has become a key feature of many software applications, e.g., in scientific computing. Parallelization allows these applications to exploit large resource pools, such as cloud or grid data centers. However, a job composed of a large number of parallel tasks will suffer a failure if any of its tasks fail, requiring reprocessing and additional delays. In this paper, we explore the effect that the replication of parallel jobs has on the job reliability and response time, as well as on resource utilization. The replication mechanism consists of concurrently processing replicas, at either the job or the task level, retrieving the results of the replica that finishes first, if any, and canceling any remaining replica in process. We propose a stochastic model that explicitly considers parallel job processing, replication at both the job and the task level, and handles general arrival processes. We develop a numerically-efficient algorithm to solve large-scale instances of the model and compute key performance metrics. We observe that the task cancellation mechanism offers an effective way of limiting the increase in resource utilization, allowing the use of replicas that not only increase the job reliability, but have the potential to reduce the response times.

UR - http://www.scopus.com/inward/record.url?scp=84937827497&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84937827497&partnerID=8YFLogxK

U2 - 10.1109/MASCOTS.2014.13

DO - 10.1109/MASCOTS.2014.13

M3 - Conference article

AN - SCOPUS:84937827497

VL - 2015-February

SP - 31

EP - 40

IS - February

M1 - 7033635

ER -

Qiu Z, Pérez JF. Assessing the impact of concurrent replication with canceling in Parallel Jobs. Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS. 2015 ene 1;2015-February(February):31-40. 7033635. https://doi.org/10.1109/MASCOTS.2014.13