hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-2571) Coprocessors: Minitables
Date Sat, 11 Apr 2015 01:16:12 GMT

     [ https://issues.apache.org/jira/browse/HBASE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Andrew Purtell resolved HBASE-2571.
    Resolution: Later

> Coprocessors: Minitables
> ------------------------
>                 Key: HBASE-2571
>                 URL: https://issues.apache.org/jira/browse/HBASE-2571
>             Project: HBase
>          Issue Type: New Feature
>          Components: Coprocessors
>            Reporter: Andrew Purtell
> From http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf :
> {quote}
> Alberto Lerner and S. Muthukrishnan
> [...] Because of [BigTable] semantics and storing scheme, skipping N rows is not feasible
without actually reading them. Even finding the count of rows in a Bigtable at any point in
time can be done only probabilistically. On the bright side, since Bigtable does not provide
a relational query engine, we do not need to consider what are suitable sampling methods for
various relational operators (like joins) or take into account how sampling errors compound
with increasing levels of query composition. 
> _Uniform Random Sampling_
> Our sampling scheme extracts and presents a sample of a Bigtable's rows as if it were
a Bigtable itself, which we call a Minitable. The rationale here is that code written to run
against a Bigtable can run unchanged against a sample thereof. Our sampling is based on a
hash scheme. We pick a convenient hash function that maps the rowname space into a very large
keyspace (e.g., a ax+b mod p function, where p is as large as 2128). The rows falling into
the first fp keys where f is the relative sample size (it is a fraction), would belong in
the sample. Formally, we pick a hash function h : R −> 0..p and if h(r) E [0, fp−1],
then add r to the sample. It is easy to see that the expected size of the sample is f * 100%
of the Bigtable rowcount independent of the rowcount, and the probability that a particular
row r is in the sample is f, as desired. This hash-based sampling method supports maintenance
of the sample with each Bigtable mutation (insert, update, or deletion). Only the system may
forward relevant mutations from the Bigtable to the Minitable. Otherwise, the latter would
behave as just any other Bigtable: it could be backed up and even be replicated. We are currently
deploying Minitables in the repository of documents that the crawling system generates. Several
Minitables, each with a different sample factor, allow that system to compute aggregates much
faster and more often.
> _Biased Sampling_
> Uniform random sampling is quite useful but some scenarios require biased sampling methods.
We are currently working on one such extension that we call Mask Sampling. In this scheme,
the decision to select a row to the sample is still based on its rowname but now a user may
specify a mask m over it. The mask, which can be a regular expression that matches portions
of a rowname, is used to group rows together. Two rows belong to a same group if their masks
result in the same string. Mask sampling guarantees that if a group is selected to the sample,
that group will be adequately represented there, regardless of that group's relative size.
> {quote}
> Clearly minitables can be constructed on the fly by a coprocessor attached to the source

This message was sent by Atlassian JIRA

View raw message