hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Subramanian <Sanjay.Subraman...@wizecommerce.com>
Subject Re: single output file per partition?
Date Wed, 21 Aug 2013 19:15:58 GMT
Hi

I tried file crusher with LZO but it does not work….I have LZO correctly configured in production
and my jobs are running daily using LZO compression.

I like Crusher so I will see why its not working…Thanks to Edward the code is there to tweak
:-)  and test locally


sanjay


From: Stephen Sprague <spragues@gmail.com<mailto:spragues@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Wednesday, August 21, 2013 12:07 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: single output file per partition?

I see.  I'll have to punt then.  However, there is an after the fact file crusher Ed Capriolo
wrote a while back here:  https://github.com/edwardcapriolo/filecrush YMMV


On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov <igor@decide.com<mailto:igor@decide.com>>
wrote:
Using a single bucket per partition seems to create a single reducer which is too slow.
I've tried enforcing small files merge but that didn't work. I still got multiple output files.

Creating a temp table and then "combining" the multiple files into one using a simple select
* is the only option that seems to work. It's odd that I have to create the temp table but
I don't see a workaround.


On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <spragues@gmail.com<mailto:spragues@gmail.com>>
wrote:
hi igor,
lots of ideas there!  I can't speak for them all but let me confirm first that "cluster by
X into 1 bucket" didn't work?  I would have thought that would have done it.




On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <igor@decide.com<mailto:igor@decide.com>>
wrote:
What's the best way to enforce a single output file per partition?

INSERT OVERWRITE TABLE <table>
PARTITION (x,y,z)
SELECT ...
FROM ...
WHERE ...

It tried adding CLUSTER BY x,y,z at the end thinking that sorting will force a single reducer
per partition but that didn't work. I still got multiple files per partition.

Do I have to use a single reduce task? With a few TB of data that's probably not a good idea.

My current idea is to create a temp table with the same partitioning structure. Insert into
that table first and then select * from that table into the output table. With combineinputformat=true
that should work right?

Or should I make Hive merge output files instead? (using hive.merge.mapfiles) Will that work
with a partitioned table?

Thanks!
igor




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended recipient, please contact the sender
by reply email and destroy all copies of the original message along with any attachments,
from your computer system. If you are the intended recipient, please be advised that the content
of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Mime
View raw message