Smart Big Data Clusters

Perez Bernal, Juan Fernando (PI)

Project: Research Project

Description

Data science encompasses the handling, processing, and analysis of large amounts of information. Data science involves areas such as Big Data and data analytics, and its popularity has been growing due to the fact that it offers applications, analysis tools and solutions in multiple areas. Although the first sector to benefit has been technology, this area has also proven to be fundamental for sectors such as retail and marketing, and has great potential in the agricultural sector and in public administration, among many others.
Data science is precisely one of the deepening and research areas of the new MACC (Applied Mathematics and Computer Science) program at the Universidad del Rosario. At this time, the MACC Department is in the initial stages of developing this line of work for students and teachers.

This project has two purposes: to establish a software infrastructure (on local and remote computers) that allows students and teachers to experiment with various Big Data technologies; and use this infrastructure to launch a research project at the Universidad del Rosario focused on the development of smart Big Data clusters that are capable of adapting to changes in traffic patterns automatically.
Regarding the software infrastructure, one of the key activities of the project is the establishment and implementation of Big Data clusters on various technologies: Hadoop, Spark, SparkStream, Flink, Memcached. These clusters will allow the realization of several activities: they will support the nascent Big Data Hotbed (started activities in the second half of 2018) with test infrastructure; will support the development of new elective courses in Machine Learning, Big Data and Distributed Systems, which will be offered as part of the deepening lines of the MACC undergraduate program and in the MACC master's program (expected to start in 2020-1) ; will support outreach activities such as the Diploma in Data Science (to be offered for the second time in 2018-2). In addition to enabling the infrastructure, Big Data workshops will be held to familiarize students with these tools and to experiment with test clusters.

The Big Data clusters will be developed on two hardware infrastructures: physical equipment currently located in the Big Data Lab, to which students and teachers have access; and remote computers located in the public cloud (Amazon Web Services). This will allow having different types of users and conducting experiments at different scales (small and controlled on physical equipment, and mediated and large in the cloud).
Both the development of the software infrastructure and the training of students and researchers will be essential to develop future research and consulting projects in the area of Big Data.
Once the first clusters are functional, the research stage will begin, in which mathematical methods (based on probabilistic, statistical and optimization techniques) will be developed that allow the cluster to intelligently adapt to the environment, specifically to the level and type traffic observed.

These methods become relevant since Big Data applications have migrated to serving online services (streaming), facing variable and uncertain traffic. Due to the complexity of these applications (multiple layers of software on a set of many hardware resources), it is not clear how the application should automatically adjust to changes in traffic, taking into account the quality of service and the costs of operation. Recent examples of these efforts are found in [1-4].

The methods developed will be computationally implemented and will be tested to measure their effectiveness in a simulated environment. These methods will then be incorporated into the clusters to test realistic levels and types of traffic over physical and remote infrastructure. The results of these experiments, and the proposed methods, will be documented in two research articles that will be submitted in international conferences and / or journals. The developed methods will be implemented in software modules
which will also be the result of the project.

Size Matters: Improving the Performance of Small Files in Hadoop by Salman Niazi (KTH, Logical Clocks, RISE SICS); Jim Dowling, Seif Haridi (KTH); Mikael Ronström (Oracle); Jim Dowling (Logical Clocks).

[1] G. Mencagli, P. Dazzi, N. Tonci. SpinStreams: a Static Optimization Tool for Data Stream Processing Applications.
Middleware, 2018.
[2] S. Esteves, H. Galhardas, L. Veiga. Adaptive Execution of Continuous and Data-intensive Workflows with Machine Learning. Middleware, 2018.
[3] J. Ortiz, B. Lee, M. Balazinska, J. Gehrke, J. L.
Más información sobre este texto de origenPara obtener más información sobre la traducción, se necesita el texto de origen
Enviar comentarios
Paneles laterales
Historial
Guardado
Contribuir
Límite de 5.000 caracteres. Utiliza las flechas para seguir traduciendo.

Commitments / Obligations

1. Habrá formado a un grupo de estudiantes en tecnologías de Big Data.
2. Contará con una infraestructura de software para la experimentación con clústers de Big Data.
3. Habrán desarrollado nuevos métodos para la adaptación inteligente de aplicaciones de Big Data.
4. Habrán producido (escrito y enviado a evaluación) dos artículos de investigación sobre los métodos desarrollados.
5. Habrán generado nuevos productos de software que implementan los métodos desarrollados

Status	Finished
Effective start/end date	10/1/20 → 10/1/21

UN Sustainable Development Goals

In 2015, UN member states agreed to 17 global Sustainable Development Goals (SDGs) to end poverty, protect the planet and ensure prosperity for all. This project contributes towards the following SDG(s):

Main Funding Source

Competitive Funds
Seed Capital

Location

Bogotá D.C.