hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Caching frequently map input files
Date Mon, 11 Feb 2008 08:17:32 GMT

Are you trying to do ad hoc real-time queries or batches of queries?

On 2/11/08 12:03 AM, "Shimi K" <shimi.eng@gmail.com> wrote:

> You misunderstood me. I do not want maps across different nodes to share
> their cache. I am not looking for data replication across nodes. Right now
> the amount of data I have on those files can fit into RAM. I want to use
> Hadoop to search those files in parallel. This way I will reduce the amount
> of time it will take me to search all the files on one machine. Even if the
> amount of data will grow beyond the RAM of a single machine, I have no
> problem adding additional machine to the cluster in order to make the search
> faster.
> 90% of my jobs will do a search on all the files in the cluster! I want to
> make sure that each node will not waste time uploading (from the disk) the
> same file which it already uploaded for the previous job.
> Shimi
> On Feb 11, 2008 9:10 AM, Amar Kamat <amarrk@yahoo-inc.com> wrote:
>> Hi,
>> I totally missed what you wanted to convey. What you want is that the
>> maps(the tasks) should be able to share their caches across jobs. In
>> hadoop each task is separate JVM. So sharing caches across tasks is
>> sharing across JVM's and that too over time (i.e to make cache a
>> separate higher level entity) which I think I would not be possible.
>> What you can do is to increase the filesize i.e before uploading to the
>> DFS, concatenate the files. I dont think that will affect your algorithm
>> in any sense. Wherever you need to maintain the file boundary you could
>> do something like
>> concatenated file :
>> filename1 : data1
>> filename2 : data2
>> ..
>> And the map will take care of such hidden structures.
>> So the only way I think is to increase the filesize through concatenation.
>> Amar
>> Shimi K wrote:
>>> I choose Hadoop more for the distributed calculation then the support
>> for
>>> huge files and my files do fit into memory.
>>> I have a lot of small files and my system needs to search for something
>> in
>>> those files very fast. I figured I can distribute the files on a Hadoop
>>> cluster and then uses the distributed calculation to do the search in
>>> parallel on many files as possible. This way I would be able to return a
>>> result faster then if I would have used one machine.
>>> Is there a way to tell which files are in memory?
>>> On Feb 10, 2008 10:33 PM, Ted Dunning <tdunning@veoh.com> wrote:
>>>> But if your files DO fit into memory then the datanodes that have
>> copies
>>>> of
>>>> the blocks of your file will probably still have them in memory and
>> since
>>>> maps are typically data local, you will benefit as much as possible.
>>>> On 2/10/08 11:17 AM, "Arun C Murthy" <acm@yahoo-inc.com> wrote:
>>>>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
>>>>>> upload files
>>>>>> from the disk each time a file is needed no matter if it was the
>>>>>> same file
>>>>>> that was required by the last job on the same node?
>>>>> There is no concept of caching input files across jobs.
>>>>> Hadoop is geared towards dealing with _huge_ amounts of data which
>>>>> don't fit into memory anyway... and hence doing it across jobs is
>> moot.

View raw message