hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "GOEKE, MATTHEW (AG/1000)" <matthew.go...@monsanto.com>
Subject RE: Hadoop--store a sequence file in distributed cache?
Date Fri, 12 Aug 2011 16:05:56 GMT
Sofia correct me if I am wrong, but Mike I think this thread was about using the output of
a previous job, in this case already in sequence file format, as in memory join data for another
job.

Side note: does anyone know what the rule of thumb on file size is when using the distributed
cache vs just reading from HDFS (join data not binary files)? I always thought that having
a setup phase on a mapper read directly from HDFS was a asking for trouble and that you should
always distribute to each node but I am hearing more and more people say to just read directly
from HDFS for larger file sizes to avoid the IO cost of the distributed cache.

Matt

-----Original Message-----
From: Ian Michael Gumby [mailto:michael_segel@hotmail.com] 
Sent: Friday, August 12, 2011 10:54 AM
To: common-user@hadoop.apache.org
Subject: RE: Hadoop--store a sequence file in distributed cache?


This whole thread doesn't make a lot of sense.

If your first m/r job creates the sequence files, which you then use as input files to your
second job, you don't need to use distributed cache since the output of the first m/r job
is going to be in HDFS.
(Dino is correct on that account.)

Sofia replied saying that she needed to open and close the sequence file to access the data
in each Mapper.map() call. 
Without knowing more about the specific app, Ashook is correct that you could read the file
in Mapper.setup() and then access it in memory.
Joey is correct you can put anything in distributed cache, but you don't want to put an HDFS
file in to distributed cache. Distributed cache is a tool for taking something from your job
and distributing it to each job tracker as a local object. It does have a bit of overhead.


A better example is if you're distributing binary objects  that you want on each node. A c++
.so file that you want to call from within your java m/r.

If you're not using all of the data in the sequence file, what about using HBase?


> From: ashook@clearedgeit.com
> To: common-user@hadoop.apache.org
> Date: Fri, 12 Aug 2011 09:06:39 -0400
> Subject: RE: Hadoop--store a sequence file in distributed cache?
> 
> If you are looking for performance gains, then possibly reading these files once during
the setup() call in your Mapper and storing them in some data structure like a Map or a List
will give you benefits.  Having to open/close the files during each map call will have a lot
of unneeded I/O.  
> 
> You have to be conscious of your java heap size though since you are basically storing
the files in RAM. If your files are a few MB in size as you said, then it shouldn't be a problem.
 If the amount of data you need to store won't fit, consider using HBase as a solution to
get access to the data you need.
> 
> But as Joey said, you can put whatever you want in the Distributed Cache -- as long as
you have a reader for it.  You should have no problems using the SequenceFile.Reader.
> 
> -- Adam
> 
> 
> -----Original Message-----
> From: Joey Echeverria [mailto:joey@cloudera.com] 
> Sent: Friday, August 12, 2011 6:28 AM
> To: common-user@hadoop.apache.org; Sofia Georgiakaki
> Subject: Re: Hadoop--store a sequence file in distributed cache?
> 
> You can use any kind of format for files in the distributed cache, so
> yes you can use sequence files. They should be faster to parse than
> most text formats.
> 
> -Joey
> 
> On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
> <geosofie_tuc@yahoo.com> wrote:
> > Thank you for the reply!
> > In each map(), I need to open-read-close these files (more than 2 in the general
case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount
of data in the input, making all these file operations on HDFS will kill the performance!!!
So I think it would be better to store these files in distributed Cache, so that the whole
process would be more efficient -I guess this is the point of using Distributed Cache in the
first place!
> >
> > My question is, if I can store sequence files in distributed Cache and handle them
using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed
Cache and handle them using the usual java API.
> >
> > Thank you very much
> > Sofia
> >
> > PS: The files have small size, a few KB to few MB maximum.
> >
> >
> >
> > ________________________________
> > From: Dino Kečo <dino.keco@gmail.com>
> > To: common-user@hadoop.apache.org; Sofia Georgiakaki <geosofie_tuc@yahoo.com>
> > Sent: Friday, August 12, 2011 11:30 AM
> > Subject: Re: Hadoop--store a sequence file in distributed cache?
> >
> > Hi Sofia,
> >
> > I assume that output of first job is stored on HDFS. In that case I would
> > directly read file from Mappers without using distributed cache. If you put
> > file into distributed cache that would add one more copy operation into your
> > process.
> >
> > Thanks,
> > dino
> >
> >
> > On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
> > <geosofie_tuc@yahoo.com>wrote:
> >
> >> Good morning,
> >>
> >> I would like to store some files in the distributed cache, in order to be
> >> opened and read from the mappers.
> >> The files are produced by an other Job and are sequence files.
> >> I am not sure if that format is proper for the distributed cache, as the
> >> files in distr.cache are stored and read locally. Should I change the format
> >> of the files in the previous Job and make them Text Files maybe and read
> >> them from the Distr.Cache using tha simple Java API?
> >> Or can I still handle them with the usual way we use sequence files, even
> >> if they reside in the local directory? Performance is extremely important
> >> for my project, so I don't know what the best solution would be.
> >>
> >> Thank you in advance,
> >> Sofia Georgiakaki
> 
> 
> 
> -- 
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
> 
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1392 / Virus Database: 1520/3828 - Release Date: 08/11/11
 		 	   		  
This e-mail message may contain privileged and/or confidential information, and is intended
to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the
sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail
by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival
by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence
of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such
code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations
of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations
issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you
are obligated to comply with all
applicable U.S. export laws and regulations.


Mime
View raw message