Capítulo de livro

SCSI: Real-Time Data Analysis with Cassandra and Spark

2018; Springer International Publishing; Linguagem: Inglês

10.1007/978-981-13-0550-4_11

ISSN

2197-6503

Autores

Archana Chaudhari, Preeti Mulay,

Tópico(s)

Distributed and Parallel Computing Systems

Resumo

Abstract The dynamic progress in the nature of pervasive computing datasets has been main motivation for development of the NoSQL model. The devices having capability of executing “Internet of Things” (IoT) concepts are producing massive amount of data in various forms (structured and unstructured). To handle this IoT data with traditional database schemes is impracticable and expensive. The large-scale unstructured data required as the prerequisites for a preparing pipeline, which flawlessly consolidating the NoSQL storage model such as Apache Cassandra and a Big Data processing platform such as Apache Spark. The Apache Spark is the data-intensive computing paradigm, which allows users to write the applications in various high-level programming languages including Java, Scala, R, Python, etc. The Spark Streaming module receives live input data streams and divides that data into batches by using the Map and Reduce operations. This research presents a novel and scalable approaches called "Smart Cassandra Spark Integration (SCSI)” for solving the challenge of integrating NoSQL data stores like Apache Cassandra with Apache Spark to manage distributed systems based on varied platter of amalgamation of current technologies, IT enabled devices, etc., while eliminating complexity and risk. In this chapter, for performance evaluations, SCSI Streaming framework is compared with the file system-based data stores such as Hadoop Streaming framework. SCSI framework proved scalable, efficient, and accurate while computing big streams of IoT data.

Referência(s)
Altmetric
PlumX