accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "roman.drapeko@baesystems.com" <roman.drap...@baesystems.com>
Subject RowID design and Hive push down
Date Mon, 14 Sep 2015 18:47:10 GMT
Hi there,

Our current rowid format is yyyyMMdd_payload_sha256(raw data). It works nicely as we have
a date and uniqueness guaranteed by hash, however unfortunately, rowid is around 50-60 bytes
per record.

Requirements are the following:

1)      Support Hive on top of Accumulo for ad-hoc queries

2)      Query original table by date range (e.g rowID < '20060101' AND rowID >= '20060103')
both in code and hive

3)      Additional queries by ~20 different fields

Requirement 3) requires secondary indexes and of course because each RowID is 50-60 bytes,
they become super massive (99% of overall space) and really expensive to store.

What we are looking to do is to reduce index size to a fixed size: {unixTime}{logicalSplit}{hash},
where unixTime is 4 bytes unsigned integer, logicalSplit - 2 bytes unsigned integer, and hash
is 4 bytes - overall 10 bytes.

What is unclear to me is how second requirement can be met in Hive as to my understanding
an in-built RowID push down mechanism won't work with unsigned bytes?

Regards,
Roman




Please consider the environment before printing this email. This message should be regarded
as confidential. If you have received this email in error please notify the sender and destroy
it immediately. Statements of intent shall only become binding when confirmed in hard copy
by an authorised signatory. The contents of this email may relate to dealings with other companies
under the control of BAE Systems Applied Intelligence Limited, details of which can be found
at http://www.baesystems.com/Businesses/index.htm.

Mime
View raw message