Data Processing on Distributed Systems Storage Challenges
2021; Springer Nature; Linguagem: Inglês
10.1007/978-981-16-3637-0_56
ISSN2190-3026
AutoresMohamed Eddoujaji, Hassan Samadi, Mohamed Bohorma,
Tópico(s)Parallel Computing and Optimization Techniques
ResumoHadoop is an open-source software framework for storing data and launching applications on standard machine clusters. This solution provides massive storage space for all types of data, tremendous processing power, and the ability to support virtually any amount of work. Based on Java, this framework is part of the Apache project, sponsored by Apache Software Foundation [Hadoop official site. http://hadoop.apache.org/]. Thanks to the MapReduce framework, it can handle huge amounts of data. Rather than having to move the data to a network for processing, MapReduce allows you to directly move the processing software to the data. No one will be able to neglect the results realized by the Hadoop technique nor to question its power and its performance to manage important volumetrics of the data of any type: structured, semi-structured, or unstructured [https://www.lebigdata.fr/hadoop]. But it does not shine by its processing speed when datasets are smaller and even smaller. This problem, the “small files” problem has been well-defined by the Hadoop community, and also by the researchers, the majority of the proposed solutions and concepts only deal with the performance and the pressure exerted on the NameNode memory; some new approaches have been proposed as the casi-random, grouping of small heterogeneous files and in different formats, this solves the problem of memory because the number of metadata has been considerably reduced; the real challenge companies face is the performance of the Hadoop cluster when processing a very large number of small files.
Referência(s)