accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: RowID design and Hive push down
Date Mon, 14 Sep 2015 22:45:35 GMT
Hi Roman,

What's the <payload> used for in your previous key design?

As I'm sure you've figured out, it's generally a bad idea to have a fully
unique hash in your key, especially if you're trying to support extensive
secondary indexing. What we've found is that it's not just the size of the
key but also the compressibility that matters. It's often better to use a
one-up counter of some sort, regardless of whether you're using a hex
encoding or a binary encoding. Due to the birthday problem [1] a one-up id
generally takes less than half of the bytes of a uniformly distributed hash
that has low probability of collisions, and it will compress much better.
Twitter did something like that in a distributed fashion that they called
Snowflake [2]. Google also published about high performance timestamp
oracles for transactions in their Percolator paper [3].

Cheers,
Adam

[1] https://en.wikipedia.org/wiki/Birthday_problem
[2] https://github.com/twitter/snowflake
[3] http://research.google.com/pubs/pub36726.html


On Mon, Sep 14, 2015 at 2:47 PM, roman.drapeko@baesystems.com <
roman.drapeko@baesystems.com> wrote:

> Hi there,
>
>
>
> Our current rowid format is yyyyMMdd_payload_sha256(raw data). It works
> nicely as we have a date and uniqueness guaranteed by hash, however
> unfortunately, rowid is around 50-60 bytes per record.
>
>
>
> Requirements are the following:
>
> 1)      Support Hive on top of Accumulo for ad-hoc queries
>
> 2)      Query original table by date range (e.g rowID < ‘20060101’ AND
> rowID >= ‘20060103’) both in code and hive
>
> 3)      Additional queries by ~20 different fields
>
>
>
> Requirement 3) requires secondary indexes and of course because each RowID
> is 50-60 bytes, they become super massive (99% of overall space) and really
> expensive to store.
>
>
>
> What we are looking to do is to reduce index size to a fixed size:
> {unixTime}{logicalSplit}{hash}, where unixTime is 4 bytes unsigned integer,
> logicalSplit – 2 bytes unsigned integer, and hash is 4 bytes – overall 10
> bytes.
>
>
>
> What is unclear to me is how second requirement can be met in Hive as to
> my understanding an in-built RowID push down mechanism won’t work with
> unsigned bytes?
>
>
>
> Regards,
>
> Roman
>
>
>
>
>
>
>
>
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>

Mime
View raw message