accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "roman.drapeko@baesystems.com" <roman.drap...@baesystems.com>
Subject RE: RowID design and Hive push down
Date Mon, 14 Sep 2015 22:15:10 GMT
So the most simple solution looks like to represent unix epoch time as hexademical string (+4
bytes) and do the same..


-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com]
Sent: 14 September 2015 22:37
To: Drapeko, Roman (UK Guildford)
Cc: user@accumulo.apache.org
Subject: Re: RowID design and Hive push down

Yes, the reason the simple approach below would work is before you'd just operate on the day
boundary (as specified by the yyyyMMdd) and the suffix would just naturally fall into the
prefix range.

Some code might help draw it together. The comments should bridge the gap

https://github.com/apache/hive/blob/release-1.2.1/accumulo-handler/src/java/org/apache/hadoop/hive/accumulo/predicate/AccumuloRangeGenerator.java#L277

roman.drapeko@baesystems.com wrote:
> Hi Josh,
>
> Thanks for response.
>
> Well, I am not an expert in Accumulo (so looking for a clue how to implement so we avoid
as much as possible custom code) - I will try to extend my answer a little bit and explain
what I don't understand.
>
> For example, if my rowID looks like this: 20060101_blabla
>
> I can query Hive something like that: select * from tbl where rowid>
> '20060101' and rowid<  '20060102', to my understanding what's
> happening under the hood is AccumuloPredicateHandler  creates a
> Range('20060101', '20060102') that used for scanning (?)
>
> Am I correct in saying that AccumuloPredicateHandler always creates a range that works
with strings only and it's not possible to amend this logic?
>
> Regarding java primitives - it always can be represented as byte[4]
>
> Roman
>
>
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: 14 September 2015 21:10
> To: user@accumulo.apache.org
> Cc: Drapeko, Roman (UK Guildford)
> Subject: Re: RowID design and Hive push down
>
> I'm not positive what you mean by the "in-built RowID push down
> mechanism won't work with unsigned bytes". Are you saying that you're
> trying to change your current rowID structure to
> unixTime+logicalSplit+hash structure? And you're trying to evaluate
> unixTime+logicalSplit+the
> 3 listed requirements against the new form?
>
> First off, the Java primitives are signed, so you're going to be
> limited by that. Don't forget that.
>
> Have you seen accumulo.composite.rowid from
> https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration.
> Hypothetically, you can provide some logic which will do custom
> parsing on your row and generate a struct from the components in your row ID.
>
> Of interest might be:
>
> https://github.com/apache/hive/blob/release-1.2.1/accumulo-handler/src
> /java/org/apache/hadoop/hive/accumulo/serde/AccumuloRowSerializer.java
>
> https://github.com/apache/hive/blob/release-1.2.1/accumulo-handler/src
> /test/org/apache/hadoop/hive/accumulo/serde/TestAccumuloRowSerializer.
> java
>
> You could extend the AccumuloRowSerializer to parse the bytes of the
> rowId according to your own spec. I haven't explicitly tried this
> myself, but in theory, I think your problems are meant to be solved by
> this support. It will take a little bit of effort. Hive's LazyObject
> type system is not my favorite framework to work with. Referencing
> some of the HBaseStorageHandler code might also be worthwhile (as the
> two are very similar).
>
> - Josh
>
> roman.drapeko@baesystems.com wrote:
>> Hi there,
>>
>> Our current rowid format is yyyyMMdd_payload_sha256(raw data). It
>> works nicely as we have a date and uniqueness guaranteed by hash,
>> however unfortunately, rowid is around 50-60 bytes per record.
>>
>> Requirements are the following:
>>
>> 1)Support Hive on top of Accumulo for ad-hoc queries
>>
>> 2)Query original table by date range (e.g rowID<  '20060101' AND rowID
>>   >= '20060103') both in code and hive
>>
>> 3)Additional queries by ~20 different fields
>>
>> Requirement 3) requires secondary indexes and of course because each
>> RowID is 50-60 bytes, they become super massive (99% of overall
>> space) and really expensive to store.
>>
>> What we are looking to do is to reduce index size to a fixed size:
>> {unixTime}{logicalSplit}{hash}, where unixTime is 4 bytes unsigned
>> integer, logicalSplit - 2 bytes unsigned integer, and hash is 4 bytes
>> - overall 10 bytes.
>>
>> What is unclear to me is how second requirement can be met in Hive as
>> to my understanding an in-built RowID push down mechanism won't work
>> with unsigned bytes?
>>
>> Regards,
>>
>> Roman
>>
>> Please consider the environment before printing this email. This
>> message should be regarded as confidential. If you have received this
>> email in error please notify the sender and destroy it immediately.
>> Statements of intent shall only become binding when confirmed in hard
>> copy by an authorised signatory. The contents of this email may
>> relate to dealings with other companies under the control of BAE
>> Systems Applied Intelligence Limited, details of which can be found
>> at http://www.baesystems.com/Businesses/index.htm.
> Please consider the environment before printing this email. This message should be regarded
as confidential. If you have received this email in error please notify the sender and destroy
it immediately. Statements of intent shall only become binding when confirmed in hard copy
by an authorised signatory. The contents of this email may relate to dealings with other companies
under the control of BAE Systems Applied Intelligence Limited, details of which can be found
at http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should be regarded
as confidential. If you have received this email in error please notify the sender and destroy
it immediately. Statements of intent shall only become binding when confirmed in hard copy
by an authorised signatory. The contents of this email may relate to dealings with other companies
under the control of BAE Systems Applied Intelligence Limited, details of which can be found
at http://www.baesystems.com/Businesses/index.htm.

Mime
View raw message