Artigo Acesso aberto Revisado por pares

An improved data stream summary: the count-min sketch and its applications

2004; Academic Press; Volume: 55; Issue: 1 Linguagem: Inglês

10.1016/j.jalgor.2003.12.001

ISSN

1090-2678

Autores

Graham Cormode, S. Muthukrishnan,

Tópico(s)

Algorithms and Data Compression

Resumo

We introduce a new sublinear space data structure-the Count-Min Sketch-for summarizing data streams.Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc.The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known -typically from 1/ε 2 to 1/ε in factor. IntroductionWe consider a vector a, which is presented in an implicit, incremental fashion.This vector has dimension n, and its current state at time t is aInitially, a is the zero vector, a i (0) = 0 for all i.Updates to individual entries of the vector are presented as a stream of pairs.The tth update is (i t , c t ), meaning that a it (t) = a it (t -1) + c t , and a i (t) = a i (t -1) for all i = i t .At any time t, a query calls for computing certain functions of interest on a(t).This setup is the data stream scenario that has emerged recently.Algorithms for computing functions within the data stream context need to satisfy the following desiderata.First, the space used by the algorithm should be small, at most polylogarithmic in n, the space required to represent a explicitly.Since the space is sublinear in data and input size, the data structures used by the algorithms to represent the input data stream is merely a summary-aka a sketch or synopsis [10])-of it; because of this compression, almost no function that one needs to compute on a can be done precisely, so some approximation is provably needed.Second, processing an update should be fast and simple; likewise, answering queries of a given type should be fast and have usable accuracy guarantees.Typically, accuracy guarantees will be made in terms of a pair of user specified parameters, ε and δ, meaning that the error in answering a query is within a factor of ε with probability δ.The space and update time will consequently depend on ε and δ; our goal will be limit this dependence as much as is possible.Many applications that deal with massive data, such as Internet traffic analysis and monitoring contents of massive databases, motivate this one-pass data stream setup.

Referência(s)