hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shimi K" <shimi....@gmail.com>
Subject Re: Caching frequently map input files
Date Mon, 11 Feb 2008 08:03:10 GMT
You misunderstood me. I do not want maps across different nodes to share
their cache. I am not looking for data replication across nodes. Right now
the amount of data I have on those files can fit into RAM. I want to use
Hadoop to search those files in parallel. This way I will reduce the amount
of time it will take me to search all the files on one machine. Even if the
amount of data will grow beyond the RAM of a single machine, I have no
problem adding additional machine to the cluster in order to make the search
faster.

90% of my jobs will do a search on all the files in the cluster! I want to
make sure that each node will not waste time uploading (from the disk) the
same file which it already uploaded for the previous job.

Shimi


On Feb 11, 2008 9:10 AM, Amar Kamat <amarrk@yahoo-inc.com> wrote:

> Hi,
> I totally missed what you wanted to convey. What you want is that the
> maps(the tasks) should be able to share their caches across jobs. In
> hadoop each task is separate JVM. So sharing caches across tasks is
> sharing across JVM's and that too over time (i.e to make cache a
> separate higher level entity) which I think I would not be possible.
> What you can do is to increase the filesize i.e before uploading to the
> DFS, concatenate the files. I dont think that will affect your algorithm
> in any sense. Wherever you need to maintain the file boundary you could
> do something like
> concatenated file :
> filename1 : data1
> filename2 : data2
> ..
> And the map will take care of such hidden structures.
> So the only way I think is to increase the filesize through concatenation.
> Amar
> Shimi K wrote:
> > I choose Hadoop more for the distributed calculation then the support
> for
> > huge files and my files do fit into memory.
> > I have a lot of small files and my system needs to search for something
> in
> > those files very fast. I figured I can distribute the files on a Hadoop
> > cluster and then uses the distributed calculation to do the search in
> > parallel on many files as possible. This way I would be able to return a
> > result faster then if I would have used one machine.
> >
> > Is there a way to tell which files are in memory?
> >
> >
> > On Feb 10, 2008 10:33 PM, Ted Dunning <tdunning@veoh.com> wrote:
> >
> >
> >> But if your files DO fit into memory then the datanodes that have
> copies
> >> of
> >> the blocks of your file will probably still have them in memory and
> since
> >> maps are typically data local, you will benefit as much as possible.
> >>
> >>
> >> On 2/10/08 11:17 AM, "Arun C Murthy" <acm@yahoo-inc.com> wrote:
> >>
> >>
> >>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
> >>>> upload files
> >>>> from the disk each time a file is needed no matter if it was the
> >>>> same file
> >>>> that was required by the last job on the same node?
> >>>>
> >>>>
> >>> There is no concept of caching input files across jobs.
> >>>
> >>> Hadoop is geared towards dealing with _huge_ amounts of data which
> >>> don't fit into memory anyway... and hence doing it across jobs is
> moot.
> >>>
> >>
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message