MapReduce based improved quick reduct algorithm with granular refinement using vertical partitioning scheme
2019; Elsevier BV; Volume: 189; Linguagem: Inglês
10.1016/j.knosys.2019.105104
ISSN1872-7409
AutoresPandu Sowkuntla, P. S. V. S. Sai Prasad,
Tópico(s)Text and Document Classification Technologies
ResumoIn the last few decades, rough sets have evolved to become an essential technology for feature subset selection by way of reduct computation in categorical decision systems. In recent years with the proliferation of MapReduce for distributed/parallel algorithms, several scalable reduct computation algorithms have been developed in this field for large-scale decision systems using MapReduce . The existing MapReduce based reduct computation approaches use horizontal partitioning (division in object space) of the dataset into the nodes of the cluster, requiring a complicated shuffle and sort phase. In this work, we propose an algorithm MR_IQRA_VP which is designed using vertical partitioning (division in attribute space) of the dataset with a simplified shuffle and sort phase of the MapReduce framework . MR_IQRA_VP is a distributed/parallel implementation of the Improved Quick Reduct Algorithm (IQRA_IG) and is implemented using iterative MapReduce framework of Apache Spark . We have done an extensive comparative study through experimentation on benchmark decision systems using existing horizontal partitioning based reduct computation algorithms. Through experimental analysis, along with theoretical validation, we have established that MR_IQRA_VP is suitable and scalable to datasets of larger size attribute space and moderate object space prevalent in the areas of Bioinformatics and Web mining . • MapReduce based attribute reduction algorithm is proposed using Rough Set Theory. • Uses Vertical partitioning scheme, that divides input data over attribute space. • Achieves huge reduction in data transfer of Shuffle and Sort phase of MapReduce. • Introduces Granular refinement in MapReduce based reduct computation. • Scalable for data sets having larger attribute space, useful in Bioinformatics.
Referência(s)