hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher, Pat" <patrick.christop...@hp.com>
Subject RE: loading datafiles in s3
Date Thu, 30 Jun 2011 22:28:11 GMT
no.  you can specify the location of a partition as you go.  Hive will use the value provided
in the partition value clause as the value of the partition column but it doesn't have to
be in the file name.  Ex:

  ALTER TABLE foo ADD PARITION (dt='2011-02-01') SET LOCATION 's3n://bucket/key/subkey/';

then under subkey you can have as many keys/files as you want and all of those files will
be added as part of that partition.  Now on the table side of things, lets say foo is:

  CREATE TABLE foo (
  allofthedata string
  )
  partitioned by (dt string)

This creates a table with this structure:

  allofthedata   string
  dt                     string

Yes!  dt is a column.  Its kind of a pain in the ass.  But, hive will then set the 'dt' column
to the value supplied in the alter table statement for all rows that are imported as part
of that partition load.

tl;dr?

If you specifiy the full location of the directory you wish to add, your data will be added
correctly.

pat

From: Kennon Lee [mailto:kennon@brooklynpacket.com]
Sent: Wednesday, June 29, 2011 1:27 PM
To: user@hive.apache.org
Subject: Re: loading datafiles in s3

Thanks for the responses. Regarding the first question, I wasnt sure what you meant by using
ALTER TABLE statements to allow for non-prefixed directory names. Don't you still have to
name the directories with the 'blah=' part? For instance, if we do:

ALTER TABLE foo ADD PARTITION (dt='2011-06-29');

Doesnt this look for a directory called "dt=2011-06-29"?
On Tue, Jun 28, 2011 at 12:18 PM, Igor Tatarinov <igor@decide.com<mailto:igor@decide.com>>
wrote:
I think the answer to 1 is No but you can confirm on the AWS EMR forum.

The problem I've been having is that if you have x=foo in the prefix of your S3 path, EMR
will try to use it as part of your partitioning key even if you don't want it.
Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get confused. Sometimes
it works, other times it complains that x is not part of your INSERT .. PARTITION(y) clause.
I haven't quite figured out when and why.

On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat <patrick.christopher@hp.com<mailto:patrick.christopher@hp.com>>
wrote:
allo,
1 dunno.  I generate my EMR scripts in a separate script so generating a stack of 'alter table...'
queries is easy for me
2 event_b will have a null value in column 4.
2 b ( you didn't ask) what happens with this row:

  event_c user_id  france 500 afifthcolumn

afifthcolumn will be truncated and you'll have only event_c through 500 in the row

Pat

From: Kennon Lee [mailto:kennon@tinyco.com<mailto:kennon@tinyco.com>]
Sent: Monday, June 27, 2011 5:50 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: loading datafiles in s3

Hello,
We're using hive on amazon elastic mapreduce to process logs on s3, and I had a couple basic
questions. Apologies if they've been answered already-- I gathered most info from the hive
tutorial on amazon (http://aws.amazon.com/articles/2855), as well as from skimming the hive
wiki pages, but I'm still very new to all of this. So, questions:

1) Is it possible to partition on directories that do not have the "key=" prefix? Our logs
are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2 and so ideally we could partition
on that structure instead of adding "dt=" to every directory name. I found an old thread discussing
this (http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
but couldnt find the actual syntax.

2) How does hive handle tab-delimited files where rows sometimes have different column counts?
For instance, if we are parsing an event log that contains multiple events, some of which
have more columns associated with them:

event_a        user_id        apple          300
event_b        user_id        cat

If i define my hive table to have 4 columns, how will hive react to the event_b row?

Thanks!




Mime
View raw message