hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: compressed/encrypted file
Date Wed, 04 Jun 2008 22:52:55 GMT
Haijun,

On Jun 4, 2008, at 3:45 PM, Haijun Cao wrote:

>
> Mile, Thanks.
>
> "If your inputs to maps are compressed, then you don't get any  
> automatic
> assignment of mappers to your data:  each gzipped file gets assigned a
> mapper." <--- this is the case I am talking about.
>

With the current compression codecs available in Hadoop (zlib/gzip/ 
lzo) it is not possible to split up a compressed file and then  
process it in a parallel manner. However once we get bzip2 to work we  
could split up the files as you are describing...

Arun

> Haijun
>
>
> -----Original Message-----
> From: milesosb@gmail.com [mailto:milesosb@gmail.com] On Behalf Of  
> Miles
> Osborne
> Sent: Wednesday, June 04, 2008 3:07 PM
> To: core-user@hadoop.apache.org
> Subject: Re: compressed/encrypted file
>
> You can compress / decompress at many points:
>
> --prior to mapping
>
> --after mapping
>
> --after reducing
>
> (I've been experimenting with all these options; we have been crawling
> blogs
> every day since Feb and we store on DFS compressed sets of posts)
>
> If your inputs to maps are compressed, then you don't get any  
> automatic
> assignment of mappers to your data:  each gzipped file gets assigned a
> mapper.
>
> But otherwise, it is all pretty transparent.
>
> Miles
>
> 2008/6/4 Haijun Cao <haijun@kindsight.net>:
>
>>
>> If a file is compressed and encrypted, then is it still possible to
> split
>> it and run mappers in parallel?
>>
>> Do people compress their files stored in hadoop? If yes, how do  
>> you go
>> about processing them in parallel?
>>
>> Thanks
>> Haijun
>>
>
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland,
> with registration number SC005336.


Mime
View raw message