hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Havens <timhav...@gmail.com>
Subject Re: Please HELP: HIVE alter table add new partition to schema...
Date Tue, 19 Jun 2012 16:58:02 GMT
Thanks for the 'small files' heads up.  The current LOGDATE file is about
7.2GB Gzipped.  So I don't think we'll be running into many small files
when that's split into 24 hours.  But thanks for the reminder about 'small
files'.

I guess I'd rather just reload the 2500 logs or so...than have to jump
through hoops...

On Tue, Jun 19, 2012 at 10:45 AM, Mark Grover <mgrover@oanda.com> wrote:

> I agree.
> It would be best if you wrote a script that iterates through each
> leaf-level partition of your existing table (logdate='A', source='B',
> datacenter='C', hostname='D')
> and populate new leaf-level partitions in the new table (logdate='A',
> source='B', datacenter='C', hostname='D', loghour).
>
> By leaf-level partition, I am referring to the partition column that
> appears the last in the list of partition columns.
>
> Technically, you can do dynamic partitioning in 1 query with something
> like:
> FROM test_table src
> INSERT OVERWRITE TABLE dest PARTITION(logdate, source, datacenter,
> hostname, loghour)
>   SELECT ts, exec_time, domain_id, domain_name, logdate, source,
> datacenter, hostname, loghour;
>
> However, as far as I understand, you wouldn't be able to take advantage of
> the existing partitioning in your source table (test_table). If you would
> like to take advantage of existing partitioning, you would have to issue a
> series of queries like this:
> FROM test_table src
> INSERT OVERWRITE TABLE dest PARTITION(logdate='A', source='B',
> datacenter='C', hostname='D', loghour)
>   SELECT ts, exec_time, domain_id, domain_name, logdate, source,
> datacenter, hostname, loghour where logdate='A' and source='B' and
> datacenter='C' and hostname='D';
>
> Just as a side note, whenever you think of your table partitioning, keep
> in mind to not overdo it. Creating more partitions could lead to a lot of
> small files on HDFS which reduces the performance of your Hadoop cluster. A
> couple people have talked about this small files problem:
>
> http://arunxjacob.blogspot.ca/2011/04/hdfs-file-size-vs-allocation-other.html
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
> http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
>
> In general, you would like your file sizes to be atleast of HDFS block
> size, most likely a small multiple of the block size. If you do find that
> you are running into the small files problem, there are other ways to get
> around like bucketing.
>
> Good luck!
> Mark
>
> ----- Original Message -----
> From: "Edward Capriolo" <edlinuxguru@gmail.com>
> To: user@hive.apache.org
> Sent: Tuesday, June 19, 2012 11:12:48 AM
> Subject: Re: Please HELP: HIVE alter table add new partition to schema...
>
> You can not change the partition columns. I would use a dynamic
> partition insert to select all the data from the original table into
> the new table.
>
> On 6/19/12, Tim Havens <timhavens@gmail.com> wrote:
> > So...I have a table that has thousands of files, and Billions of rows
> > related it.
> >
> > Lets make this a simple table:
> >
> > CREATE TABLE test_table (
> >     ts BIGINT,
> >     exec_time DOUBLE,
> >     domain_id BIGINT,
> >     domain_name STRING,
> > )
> > PARTITIONED BY (logdate STRING, source STRING, datacenter STRING,
> > hostname STRING)
> > ROW FORMAT DELIMITED
> > FIELDS TERMINATED BY '\t'
> > COLLECTION ITEMS TERMINATED BY '\001'
> > MAP KEYS TERMINATED BY '\002'
> > LINES TERMINATED BY '\n'
> > STORED AS TextFile;
> >
> > So...what I need to do is ADD a partition to the PARTITIONED BY spec
> > above....
> >
> > The partitioned by column I want to add is 'loghour STRING'.
> >
> > I can't seem to find any way to accomplish actually adding a NEW
> > PARTITION COLUMN in the 'PARTITIONED BY' spec, without completely
> > recreating and reloading the table.
> >
> > What's the correct way of adding to the partition schema and new
> > column like 'loghour STRING'.
> >
> > I'm not trying to add an entry into the table DATA, I'm trying to add
> > a completely new PARTITIONED BY Column...
> >
>



-- 
"The whole world is you. Yet you keep thinking there is something else." -
Xuefeng Yicun 822-902 A.D.

Tim R. Havens
Google Phone: 573.454.1232
ICQ: 495992798
ICBM:  37°51'34.79"N   90°35'24.35"W
ham radio callsign: NW0W

Mime
View raw message