hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amr Awadallah <...@cloudera.com>
Subject Re: problem w/ data load
Date Mon, 03 May 2010 21:12:17 GMT
yep, hive will work fine if you point it to the .gz file

just note though that if this is one large gz file then it will only use 
one mapper and one reducer, it will not get parallelized.

-- amr

On 5/3/2010 11:29 AM, Edward Capriolo wrote:
> On Mon, May 3, 2010 at 2:00 PM, Susanne Lehmann<
> susanne.lehmann@metamarketsgroup.com>  wrote:
>
>    
>> Hi Tom,
>>
>> Yes. I store the file in HDFS with a .gz extension. Do i need to
>> "tell" somehow Hive that it is a compressed file?
>>
>> Best,
>> Susanne
>>
>> PS: Thanks for the tip with the list, I will use the other list for
>> further questions if necessary. I wasn't sure which one to use.
>>
>> On Mon, May 3, 2010 at 9:52 AM, Tom White<tom@cloudera.com>  wrote:
>>      
>>> Hi Susanne,
>>>
>>> Hadoop uses the file extension to detect that a file is compressed. I
>>> believe Hive does too. Did you store the compressed file in HDFS with
>>> a .gz extension?
>>>
>>> Cheers,
>>> Tom
>>>
>>> BTW It's best to send Hive questions like these to the hive-user@ list.
>>>
>>> On Sun, May 2, 2010 at 11:22 AM, Susanne Lehmann
>>> <susanne.lehmann@metamarketsgroup.com>  wrote:
>>>        
>>>> Hi,
>>>>
>>>> I want to load data from HDFS to Hive, the data is in compressed files.
>>>> The data is stored in flat files, the delimiter is ^A (ctrl-A).
>>>> As long as I use de-compressed files everything is working fine. Since
>>>> ctrl-A is the default delimiter I even don't need a specification for
>>>> it.  I do the following:
>>>>
>>>>
>>>> hadoop dfs -put /test/file new
>>>>
>>>> hive>   DROP TABLE test_new;
>>>> OK
>>>> Time taken: 0.057 seconds
>>>> hive>     CREATE TABLE test_new(
>>>>     >         bla  int,
>>>>     >         bla            string,
>>>>     >         etc
>>>>     >         bla      string);
>>>> OK
>>>> Time taken: 0.035 seconds
>>>> hive>  LOAD DATA INPATH "/test/file" INTO TABLE test_new;
>>>> Loading data to table test_new
>>>> OK
>>>> Time taken: 0.063 seconds
>>>>
>>>> But if I do the same with the same file compressed it's not working
>>>> anymore. I tried tons of different table definitions with the
>>>> delimiter specified, but it doesn't go. The load itself works, but the
>>>> data is always NULL, so there is a delimiter problem I conclude.
>>>>
>>>>   Any help is greatly appreciated!
>>>>
>>>>          
>>>        
>>      
> If your file is a text file that is simply gzipped you create your table as
> normal
>
> create table XXXX stored as textfile.
>
> If your file is a sequence file using block compression (gzip) you
>
> create table XXXX stored as sequencefile.
>
>    

Mime
View raw message