hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Shook <ash...@clearedgeit.com>
Subject RE: Hadoop--store a sequence file in distributed cache?
Date Fri, 12 Aug 2011 13:06:39 GMT
If you are looking for performance gains, then possibly reading these files once during the
setup() call in your Mapper and storing them in some data structure like a Map or a List will
give you benefits.  Having to open/close the files during each map call will have a lot of
unneeded I/O.  

You have to be conscious of your java heap size though since you are basically storing the
files in RAM. If your files are a few MB in size as you said, then it shouldn't be a problem.
 If the amount of data you need to store won't fit, consider using HBase as a solution to
get access to the data you need.

But as Joey said, you can put whatever you want in the Distributed Cache -- as long as you
have a reader for it.  You should have no problems using the SequenceFile.Reader.

-- Adam


-----Original Message-----
From: Joey Echeverria [mailto:joey@cloudera.com] 
Sent: Friday, August 12, 2011 6:28 AM
To: common-user@hadoop.apache.org; Sofia Georgiakaki
Subject: Re: Hadoop--store a sequence file in distributed cache?

You can use any kind of format for files in the distributed cache, so
yes you can use sequence files. They should be faster to parse than
most text formats.

-Joey

On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
<geosofie_tuc@yahoo.com> wrote:
> Thank you for the reply!
> In each map(), I need to open-read-close these files (more than 2 in the general case,
and maybe up to 20 or more), in order to make some checks. Considering the huge amount of
data in the input, making all these file operations on HDFS will kill the performance!!! So
I think it would be better to store these files in distributed Cache, so that the whole process
would be more efficient -I guess this is the point of using Distributed Cache in the first
place!
>
> My question is, if I can store sequence files in distributed Cache and handle them using
e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed
Cache and handle them using the usual java API.
>
> Thank you very much
> Sofia
>
> PS: The files have small size, a few KB to few MB maximum.
>
>
>
> ________________________________
> From: Dino Kečo <dino.keco@gmail.com>
> To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
> Sent: Friday, August 12, 2011 11:30 AM
> Subject: Re: Hadoop--store a sequence file in distributed cache?
>
> Hi Sofia,
>
> I assume that output of first job is stored on HDFS. In that case I would
> directly read file from Mappers without using distributed cache. If you put
> file into distributed cache that would add one more copy operation into your
> process.
>
> Thanks,
> dino
>
>
> On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
> <geosofie_tuc@yahoo.com>wrote:
>
>> Good morning,
>>
>> I would like to store some files in the distributed cache, in order to be
>> opened and read from the mappers.
>> The files are produced by an other Job and are sequence files.
>> I am not sure if that format is proper for the distributed cache, as the
>> files in distr.cache are stored and read locally. Should I change the format
>> of the files in the previous Job and make them Text Files maybe and read
>> them from the Distr.Cache using tha simple Java API?
>> Or can I still handle them with the usual way we use sequence files, even
>> if they reside in the local directory? Performance is extremely important
>> for my project, so I don't know what the best solution would be.
>>
>> Thank you in advance,
>> Sofia Georgiakaki



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1392 / Virus Database: 1520/3828 - Release Date: 08/11/11
Mime
View raw message