hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "peter.marron@baesystems.com" <peter.mar...@baesystems.com>
Subject RE: Accumulo Storage Manager
Date Mon, 21 Sep 2015 09:43:35 GMT
Hi Josh,

OK, I'm committed to looking into this problem as and when I get time.
I will do my best to try and raise JIRAs and submit some unit tests to reproduce.
Hopefully I'll be able to work on some fixes as well.

However, in the short-term I would really like to know how the AccumuloStorageHandler would/should/will
actually store ARRAYs.
You mention the HBaseStorageHandler, so that might be a guide,  but I haven't played much
with HBase and so that doesn't help
me much in the short term

I would assume that if we are storing an ARRAY of a fixed length type (I'm limiting myself
to the binary representation here)
then we would end up with just the binary values stored sequentially.
So an array of INT with values 3, 23, 10 would be stored as \x00\x00\x00\x03\x00\x00\x01\x06\x00\x00\x00\x0a
and so on for all the other types.
This seems obvious enough,  but I would like to check.

But the real question is how is an ARRAY<STRING> stored? I can't really see a delimiter
being used as you can't
be sure that the delimiter doesn't occur in the data. So I would assume that it would be something
like this:

<length1><string1><length2><string2> ... <lastLength><lastString>

Is this correct?

If it is how would the lengths be stored? As 4-byte integers?
Or some variable length encoding scheme.

I know that I'm asking a lot here, as I should probably just look at the code and work it
out for myself,
but if you do know and could let me know I'd be grateful.

Thanks,

Z

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com]
Sent: 13 September 2015 03:19
To: user@hive.apache.org
Subject: Re: Accumulo Storage Manager

So the binary parsing definitely seems wrong. Maybe two issues there:
one being the inline #binary not being recognized with the '*' map modifier and the second
being the row failing to parse.

I'd have to write a test to see how the HBaseStorageHandler works and see if I missed something
in handling all the types correctly. The AccumuloStorageHandler should be able to handle the
same kind of types that a native table can handle. So, I would call ARRAYs not being serialized
a bug as well.

Sorry you're running into this. If you could capture these in JIRA issues, that would make
it really good to start working through them and get them fixed.

If you have the time and desire, trying to reproduce theses failures in unit tests would also
be great :). The type handling can be a little difficult but there are likely some places
to start in the accumulo or hbase handler tests. At worst, we can start by writing a qtest
that will reproduce your errors using an full environment (Accumulo minicluster, etc).

peter.marron@baesystems.com wrote:
> Hi Josh,
>
> At this stage I don't know whether there's anything wrong with Hive or it's just user
error.
> Perhaps if I go through what I have done you can see where the error lies.
> Unfortunately this is going to be wordy. Apologies in advance for the long email.
>
<snip>
Please consider the environment before printing this email. This message should be regarded
as confidential. If you have received this email in error please notify the sender and destroy
it immediately. Statements of intent shall only become binding when confirmed in hard copy
by an authorised signatory. The contents of this email may relate to dealings with other companies
under the control of BAE Systems Applied Intelligence Limited, details of which can be found
at http://www.baesystems.com/Businesses/index.htm.

Mime
View raw message