hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shimi K" <shimi....@gmail.com>
Subject Re: Caching frequently map input files
Date Mon, 11 Feb 2008 08:03:10 GMT
You misunderstood me. I do not want maps across different nodes to share
their cache. I am not looking for data replication across nodes. Right now
the amount of data I have on those files can fit into RAM. I want to use
Hadoop to search those files in parallel. This way I will reduce the amount
of time it will take me to search all the files on one machine. Even if the
amount of data will grow beyond the RAM of a single machine, I have no
problem adding additional machine to the cluster in order to make the search

90% of my jobs will do a search on all the files in the cluster! I want to
make sure that each node will not waste time uploading (from the disk) the
same file which it already uploaded for the previous job.


On Feb 11, 2008 9:10 AM, Amar Kamat <amarrk@yahoo-inc.com> wrote:

> Hi,
> I totally missed what you wanted to convey. What you want is that the
> maps(the tasks) should be able to share their caches across jobs. In
> hadoop each task is separate JVM. So sharing caches across tasks is
> sharing across JVM's and that too over time (i.e to make cache a
> separate higher level entity) which I think I would not be possible.
> What you can do is to increase the filesize i.e before uploading to the
> DFS, concatenate the files. I dont think that will affect your algorithm
> in any sense. Wherever you need to maintain the file boundary you could
> do something like
> concatenated file :
> filename1 : data1
> filename2 : data2
> ..
> And the map will take care of such hidden structures.
> So the only way I think is to increase the filesize through concatenation.
> Amar
> Shimi K wrote:
> > I choose Hadoop more for the distributed calculation then the support
> for
> > huge files and my files do fit into memory.
> > I have a lot of small files and my system needs to search for something
> in
> > those files very fast. I figured I can distribute the files on a Hadoop
> > cluster and then uses the distributed calculation to do the search in
> > parallel on many files as possible. This way I would be able to return a
> > result faster then if I would have used one machine.
> >
> > Is there a way to tell which files are in memory?
> >
> >
> > On Feb 10, 2008 10:33 PM, Ted Dunning <tdunning@veoh.com> wrote:
> >
> >
> >> But if your files DO fit into memory then the datanodes that have
> copies
> >> of
> >> the blocks of your file will probably still have them in memory and
> since
> >> maps are typically data local, you will benefit as much as possible.
> >>
> >>
> >> On 2/10/08 11:17 AM, "Arun C Murthy" <acm@yahoo-inc.com> wrote:
> >>
> >>
> >>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
> >>>> upload files
> >>>> from the disk each time a file is needed no matter if it was the
> >>>> same file
> >>>> that was required by the last job on the same node?
> >>>>
> >>>>
> >>> There is no concept of caching input files across jobs.
> >>>
> >>> Hadoop is geared towards dealing with _huge_ amounts of data which
> >>> don't fit into memory anyway... and hence doing it across jobs is
> moot.
> >>>
> >>
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message