Assessing the impact of concurrent replication with canceling in Parallel Jobs

Zhan Qiu, Juan F. Pérez

Resultado de la investigación: Capítulo en Libro/Reporte/ConferenciaContribución a la conferencia

4 Citas (Scopus)

Resumen

Parallel job processing has become a key feature of many software applications, e.g., in scientific computing. Parallelization allows these applications to exploit large resource pools, such as cloud or grid data centers. However, a job composed of a large number of parallel tasks will suffer a failure if any of its tasks fail, requiring reprocessing and additional delays. In this paper, we explore the effect that the replication of parallel jobs has on the job reliability and response time, as well as on resource utilization. The replication mechanism consists of concurrently processing replicas, at either the job or the task level, retrieving the results of the replica that finishes first, if any, and canceling any remaining replica in process. We propose a stochastic model that explicitly considers parallel job processing, replication at both the job and the task level, and handles general arrival processes. We develop a numerically-efficient algorithm to solve large-scale instances of the model and compute key performance metrics. We observe that the task cancellation mechanism offers an effective way of limiting the increase in resource utilization, allowing the use of replicas that not only increase the job reliability, but have the potential to reduce the response times.

Idioma originalInglés estadounidense
Título de la publicación alojadaProceedings - 2014 22nd Annual IEEE International Symposium on Modeling, Analysis and Simulation of Computer, and Telecommunication Systems, MASCOTS 2014
EditorialIEEE Computer Society
Páginas31-40
Número de páginas10
EdiciónFebruary
ISBN (versión digital)9781479956104
DOI
EstadoPublicada - feb 5 2015
Publicado de forma externa
Evento2014 22nd Annual IEEE International Symposium on Modeling, Analysis and Simulation of Computer, and Telecommunication Systems, MASCOTS 2014 - Paris, Francia
Duración: sep 9 2014sep 11 2014

Serie de la publicación

NombreProceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
NúmeroFebruary
Volumen2015-February
ISSN (versión impresa)1526-7539

Conferencia

Conferencia2014 22nd Annual IEEE International Symposium on Modeling, Analysis and Simulation of Computer, and Telecommunication Systems, MASCOTS 2014
País/TerritorioFrancia
CiudadParis
Período9/9/149/11/14

All Science Journal Classification (ASJC) codes

  • Ingeniería eléctrica y electrónica
  • Redes de ordenadores y comunicaciones
  • Software
  • Modelización y simulación

Huella

Profundice en los temas de investigación de 'Assessing the impact of concurrent replication with canceling in Parallel Jobs'. En conjunto forman una huella única.

Citar esto