hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amar Kamat <ama...@yahoo-inc.com>
Subject Re: Caching frequently map input files
Date Mon, 11 Feb 2008 07:10:49 GMT
I totally missed what you wanted to convey. What you want is that the 
maps(the tasks) should be able to share their caches across jobs. In 
hadoop each task is separate JVM. So sharing caches across tasks is 
sharing across JVM's and that too over time (i.e to make cache a 
separate higher level entity) which I think I would not be possible. 
What you can do is to increase the filesize i.e before uploading to the 
DFS, concatenate the files. I dont think that will affect your algorithm 
in any sense. Wherever you need to maintain the file boundary you could 
do something like
concatenated file :
filename1 : data1
filename2 : data2
And the map will take care of such hidden structures.
So the only way I think is to increase the filesize through concatenation.
Shimi K wrote:
> I choose Hadoop more for the distributed calculation then the support for
> huge files and my files do fit into memory.
> I have a lot of small files and my system needs to search for something in
> those files very fast. I figured I can distribute the files on a Hadoop
> cluster and then uses the distributed calculation to do the search in
> parallel on many files as possible. This way I would be able to return a
> result faster then if I would have used one machine.
> Is there a way to tell which files are in memory?
> On Feb 10, 2008 10:33 PM, Ted Dunning <tdunning@veoh.com> wrote:
>> But if your files DO fit into memory then the datanodes that have copies
>> of
>> the blocks of your file will probably still have them in memory and since
>> maps are typically data local, you will benefit as much as possible.
>> On 2/10/08 11:17 AM, "Arun C Murthy" <acm@yahoo-inc.com> wrote:
>>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
>>>> upload files
>>>> from the disk each time a file is needed no matter if it was the
>>>> same file
>>>> that was required by the last job on the same node?
>>> There is no concept of caching input files across jobs.
>>> Hadoop is geared towards dealing with _huge_ amounts of data which
>>> don't fit into memory anyway... and hence doing it across jobs is moot.

View raw message