hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kennon Lee <ken...@brooklynpacket.com>
Subject Re: loading datafiles in s3
Date Wed, 29 Jun 2011 20:27:27 GMT
Thanks for the responses. Regarding the first question, I wasnt sure what
you meant by using ALTER TABLE statements to allow for non-prefixed
directory names. Don't you still have to name the directories with the
'blah=' part? For instance, if we do:

ALTER TABLE foo ADD PARTITION (dt='2011-06-29');

Doesnt this look for a directory called "dt=2011-06-29"?

On Tue, Jun 28, 2011 at 12:18 PM, Igor Tatarinov <igor@decide.com> wrote:

> I think the answer to 1 is No but you can confirm on the AWS EMR forum.
>
> The problem I've been having is that if you have x=foo in the prefix of
> your S3 path, EMR will try to use it as part of your partitioning key even
> if you don't want it.
> Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get
> confused. Sometimes it works, other times it complains that x is not part of
> your INSERT .. PARTITION(y) clause. I haven't quite figured out when and
> why.
>
>
> On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat <
> patrick.christopher@hp.com> wrote:
>
>> allo,****
>>
>> 1 dunno.  I generate my EMR scripts in a separate script so generating a
>> stack of ‘alter table…’ queries is easy for me****
>>
>> 2 event_b will have a null value in column 4.****
>>
>> 2 b ( you didn’t ask) what happens with this row:****
>>
>> ** **
>>
>>   event_c user_id  france 500 afifthcolumn****
>>
>> ** **
>>
>> afifthcolumn will be truncated and you’ll have only event_c through 500 in
>> the row****
>>
>> ** **
>>
>> Pat****
>>
>> ** **
>>
>> *From:* Kennon Lee [mailto:kennon@tinyco.com]
>> *Sent:* Monday, June 27, 2011 5:50 PM
>> *To:* user@hive.apache.org
>> *Subject:* loading datafiles in s3****
>>
>> ** **
>>
>> Hello,****
>>
>> We're using hive on amazon elastic mapreduce to process logs on s3, and I
>> had a couple basic questions. Apologies if they've been answered already-- I
>> gathered most info from the hive tutorial on amazon (
>> http://aws.amazon.com/articles/2855), as well as from skimming the hive
>> wiki pages, but I'm still very new to all of this. So, questions:****
>>
>> ** **
>>
>> 1) Is it possible to partition on directories that do not have the "key="
>> prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2
>> and so ideally we could partition on that structure instead of adding "dt="
>> to every directory name. I found an old thread discussing this (
>> http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
>> but couldnt find the actual syntax.****
>>
>> ** **
>>
>> 2) How does hive handle tab-delimited files where rows sometimes have
>> different column counts? For instance, if we are parsing an event log that
>> contains multiple events, some of which have more columns associated with
>> them:****
>>
>> ** **
>>
>> event_a        user_id        apple          300****
>>
>> event_b        user_id        cat****
>>
>> ** **
>>
>> If i define my hive table to have 4 columns, how will hive react to the
>> event_b row?****
>>
>> ** **
>>
>> Thanks!****
>>
>> ** **
>>
>
>

Mime
View raw message