hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Y <>
Subject Need a smart way to delete the first row of my data
Date Wed, 07 Mar 2012 16:00:58 GMT

I have huge gzipped files that I need to drop the header row from before
loading to a hive table.

Right now, my process is:
1. Gunzip the data (...takes forever)
2. Drop the first row using the Unix sed command
3. Re-zip the data with gzip -1 (...takes forever)
4. Create the Hive table (on the compressed file to store it efficiently)

I am trying to find a way to speed up this process.  Ideally, it would
involve loading the data to Hive as a first step and then deleting the
first row, to avoid the unzip/rezip steps.

Any ideas would be appreciated!


View raw message