|
Abstract
Data streams, unbounded sets of continuous data, pose unique challenges towards issues of indexing
and membership queries. Queries over data streams of the nature, flagging stock-value fluctuations
in stock update streams, detecting news headlines of interest in RSS feeds, object lookup over in a
cache summary etc., need to consider large or unbounded data sets. Bloom filter, a probabilistic
data structures with finite memory requirements, is a useful data structure to index and lookup
stream of data items. The basic bloom filter mechanism has been extended to adapt to the
continuous indexing and querying requiring of unbounded data.
While all extensions consider
the unbounded constraint, we argue that another characteristic, data-importance, is an important
factor towards indexing and querying. The notion of data importance is related to application-
specific priority of different items of a data set—some stocks are more important that others, cost
of fetching objects from a cache are directly related to their size etc. As part of this work, we make
a case for the need to develop solutions that are importance-aware and towards this we propose
Importance -aware Bloom Filter (IBF). As part of IBF, we provide a set insertion and deletion
heuristics to make the bloom filter importance-aware.
Our comparison of IBF with other bloom
filter-based mechanisms shows that IBF performs very well—it has low false positives, and false
negatives that occur due to deletions decrease with data-importance. Our precision and recall
measurements that reflect the metric in terms of data-importance are also encouraging, and we
believe that IBF provides a practical framework to balance the application-specific requirements.
There exist numerous challenges in storing and querying a data tream. Our work focuses on
indexing an unbounded stream of data items to answer set membership queries. Since data streams
are continuous and unbounded, it is impractical to store the entire data during query processing,
especially so in finite-memory systems.
In such contexts, it is difficult to provide precise answers
to queries. To limit the amount data to be stored, or to be processed per data item, time-based
sliding window [1] techniques have been used for continuous queries (CQ) [1]. A continuous query
executes over successive instances of the sliding time-window. In our work, we consider a landmark
window, where one end of the time-window is fixed, for example start-of-day, and the other end
of the window is unspecified. With this setting, the system would index all or partial data items
and answer membership queries.
Another important observation that motivates our work is the notion of importance of data
items (or tuples). The following examples introduce data-importance and motivates the need to
be importance-aware during indexing, querying and measuring performance of query accuracy. Application usage-scenarios:
Scenario 1: Consider a portfolio of stocks and an importance related to each. Possible importance
functions can be based on individual or combinations of stock values, number of stocks, past and
predicted fluctuations etc. Over the period of a day, the value of each stock fluctuates. An user
is interested in knowing which stocks changed by more than 10% of its opening value. A less
accurate answer regarding less important stocks is tolerable, whereas an inaccurate answer on
more important stocks can result in sub-optimal actions to maximize portfolio value.
Scenario 2: A webcrawler starts crawling from a particular web page, collects links on that page,
and then crawls these links recursively, similar to a breadth-first-traversal of a graph. It is obvious
that a some links are popular and will appear on multiple web pages, e.g., a BBC news article
may appear on international news listing of several web-sites. Crawlers need to accurately identify
such popular links and avoid crawling them multiple times. An importance function in such cases
can be correlated with the popularity of a web page or a web-site. A crawlers expectation of
accuracy regarding the question of whether a page has been previously crawled can be related to
its importance— popular pages require more accurate answers.
Scenario 3: Consider a data structure associated with a cache that can be looked-up to determine
if an object is present in the cache. Objects stored in the cache are of different sizes and the cost to
fetch them from a remote server is a function of their size. Assuming the look-up service provides
a probabilistic answer, a requirement that aims to minimize cost would be less tolerable to false
negatives for larger objects as compared to objects with smaller sizes (less important objects).
These examples motivate us to look at schemes that exploit data importance during indexing
Project Done By Ravi Bhoraskar
<<
back |