Tackling latency via replication in distributed systems

Zhan Qiu; Juan F. Pérez; Peter G. Harrison

doi:10.1145/2851553.2851562

Tackling latency via replication in distributed systems

Título traducido de la contribución: Tratamiento de la latencia mediante replicación en sistemas distribuidos

Zhan Qiu, Juan F. Pérez, Peter G. Harrison

Producción científica: Capítulo en Libro/Reporte › Contribución a la conferencia

6 Citas (Scopus)

Resumen

Consistently high reliability and low latency are twin requirements common to many forms of distributed processing; for example, server farms and mirrored storage access. To address them, we consider replication of requests with canceling – i.e. initiate multiple concurrent replicas of a request and use the first successful result returned, canceling all outstanding replicas. This scheme has been studied recently, but mostly for systems with a single central queue, while server farms exploit distributed resources for scalability and robustness. We develop an approximate stochastic model to determine the response time distribution in a system with distributed queues, and compare its performance against its centralized counterpart. Validation against simulation indicates that our model is accurate for not only the mean response time but also its quantiles, which are particularly relevant for deadline-driven applications. Further, we show that in the distributed setup, replication with canceling has the potential to reduce response times, even at relatively high utilization. We also find that it offers response times close to those of the centralized system, especially at medium-to-high request reliability. These findings support the use of replication with canceling as an effective mechanism for both fault- and delay-tolerance.

Título traducido de la contribución	Tratamiento de la latencia mediante replicación en sistemas distribuidos
Idioma original	Inglés estadounidense
Título de la publicación alojada	ICPE 2016 - Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering
Editorial	Association for Computing Machinery
Páginas	197-208
Número de páginas	12
ISBN (versión digital)	9781450340809
DOI	https://doi.org/10.1145/2851553.2851562
Estado	Publicada - mar. 12 2016
Publicado de forma externa	Sí
Evento	7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016 - Delft, Países Bajos Duración: mar. 12 2016 → mar. 16 2016

Conferencia

Conferencia	7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016
País/Territorio	Países Bajos
Ciudad	Delft
Período	3/12/16 → 3/16/16

Áreas temáticas de ASJC Scopus

Software
Informática aplicada
Hardware y arquitectura

Acceder al documento

10.1145/2851553.2851562

Otros archivos y enlaces

Citar esto

@inproceedings{3b8093a700604f7a9627f09ab2f070ea,

title = "Tackling latency via replication in distributed systems",

abstract = "Consistently high reliability and low latency are twin requirements common to many forms of distributed processing; for example, server farms and mirrored storage access. To address them, we consider replication of requests with canceling - i.e. initiate multiple concurrent replicas of a request and use the first successful result returned, canceling all outstanding replicas. This scheme has been studied recently, but mostly for systems with a single central queue, while server farms exploit distributed resources for scalability and robustness. We develop an approximate stochastic model to determine the response-time distribution in a system with distributed queues, and compare its performance against its centralized counterpart. Validation against simulation indicates that our model is accurate for not only the mean response time but also its percentiles, which are particularly relevant for deadline-driven applications. Further, we show that in the distributed set-up, replication with canceling has the potential to reduce response times, even at relatively high utilization. We also find that it offers response times close to those of the centralized system, especially at medium-to-high request reliability. These findings support the use of replication with canceling as an effective mechanism for both fault- and delay-tolerance.",

author = "Zhan Qiu and P{\'e}rez, {Juan F.} and Harrison, {Peter G.}",

year = "2016",

month = mar,

day = "12",

doi = "10.1145/2851553.2851562",

language = "English (US)",

pages = "197--208",

booktitle = "ICPE 2016 - Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering",

publisher = "Association for Computing Machinery",

address = "United States",

note = "7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016 ; Conference date: 12-03-2016 Through 16-03-2016",

}

Qiu, Z, Pérez, JF & Harrison, PG 2016, Tackling latency via replication in distributed systems. En ICPE 2016 - Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, pp. 197-208, 7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016, Delft, Países Bajos, 3/12/16. https://doi.org/10.1145/2851553.2851562

TY - GEN

T1 - Tackling latency via replication in distributed systems

AU - Qiu, Zhan

AU - Pérez, Juan F.

AU - Harrison, Peter G.

PY - 2016/3/12

Y1 - 2016/3/12

N2 - Consistently high reliability and low latency are twin requirements common to many forms of distributed processing; for example, server farms and mirrored storage access. To address them, we consider replication of requests with canceling - i.e. initiate multiple concurrent replicas of a request and use the first successful result returned, canceling all outstanding replicas. This scheme has been studied recently, but mostly for systems with a single central queue, while server farms exploit distributed resources for scalability and robustness. We develop an approximate stochastic model to determine the response-time distribution in a system with distributed queues, and compare its performance against its centralized counterpart. Validation against simulation indicates that our model is accurate for not only the mean response time but also its percentiles, which are particularly relevant for deadline-driven applications. Further, we show that in the distributed set-up, replication with canceling has the potential to reduce response times, even at relatively high utilization. We also find that it offers response times close to those of the centralized system, especially at medium-to-high request reliability. These findings support the use of replication with canceling as an effective mechanism for both fault- and delay-tolerance.

AB - Consistently high reliability and low latency are twin requirements common to many forms of distributed processing; for example, server farms and mirrored storage access. To address them, we consider replication of requests with canceling - i.e. initiate multiple concurrent replicas of a request and use the first successful result returned, canceling all outstanding replicas. This scheme has been studied recently, but mostly for systems with a single central queue, while server farms exploit distributed resources for scalability and robustness. We develop an approximate stochastic model to determine the response-time distribution in a system with distributed queues, and compare its performance against its centralized counterpart. Validation against simulation indicates that our model is accurate for not only the mean response time but also its percentiles, which are particularly relevant for deadline-driven applications. Further, we show that in the distributed set-up, replication with canceling has the potential to reduce response times, even at relatively high utilization. We also find that it offers response times close to those of the centralized system, especially at medium-to-high request reliability. These findings support the use of replication with canceling as an effective mechanism for both fault- and delay-tolerance.

UR - http://www.scopus.com/inward/record.url?scp=85020211058&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85020211058&partnerID=8YFLogxK

U2 - 10.1145/2851553.2851562

DO - 10.1145/2851553.2851562

M3 - Conference contribution

AN - SCOPUS:85020211058

SP - 197

EP - 208

BT - ICPE 2016 - Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering

PB - Association for Computing Machinery

T2 - 7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016

Y2 - 12 March 2016 through 16 March 2016

ER -

Tackling latency via replication in distributed systems

Resumen

Conferencia

Áreas temáticas de ASJC Scopus

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto