Parallel Time Series Modeling - A Case Study of In-Database Big Data Analytics
2014; Springer Science+Business Media; Linguagem: Inglês
10.1007/978-3-319-13186-3_38
ISSN1611-3349
AutoresHai Qian, Shengwen Yang, Rahul Iyer, Xixuan Feng, M. Wellons, Caleb Welton,
Tópico(s)Data Management and Algorithms
ResumoMADlibis an open-source library for scalable in-database analytics. In this paper, we present our parallel design of time series analysis and implementation of ARIMA modeling in MADlib's framework. The algorithms for fitting time series models are intrinsically sequential since any calculation for a specific time $$t$$ depends on the result from the previous time step $$t-1$$ . Our solution parallelizes this computation by splitting the data into $$n$$ chunks. Since the model fitting involves multiple iterations, we use the results from previous iteration as the initial values for each chunk in the current iteration. Thus the computation for each chunk of data is not dependenton on the results from the previous chunk. We further improve performance by redistributing the original data such that each chunk can be loaded into memory, minimizing communication overhead. Experiments show that our parallel implementation has good speed-up when compared to a sequential version of the algorithm and R's default implementation in the "stats" package.
Referência(s)