hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Caching frequently map input files
Date Mon, 11 Feb 2008 08:03:28 GMT

This begins to sound like you are trying to do something that is a very nice
match to hadoop's map-reduce framework at all.  It may be that HDFS would be
very helpful to you, but map-reduce may not be so much help.

Here are a few question about your application that will help define the
answer whether your application is a good map-reduce candidate:

A) first and most importantly, is your program batch oriented, or is it
supposed to respond quickly to random requests?  If it is batch oriented,
then it is likely that map-reduce will help.  If it is intended to respond
to random requests, then it is unlikely to be a match.

B) would do you intend to have a very large number of small files (large is
>1 million files, very large is greater than 10 million) or are your files
very small (small is less than 10MB or so, very small is less than 1MB).  If
you need a very large number of files, you need to redesign your problem or
look for a different file store.  If you are working with very small files
that nevertheless fit into memory, then you may need to concatenate files
together to get larger files.

C) how long a program startup time can you allow?  Hadoop's map-reduce is
oriented mostly around the batch processing of very large data sets which
means that a fairly lengthy startup time is acceptable and even desirable if
it allows faster overall throughput.  If you can't stand a startup time of
10 seconds or so, then you need a non-map-reduce design.

D) if you need real-time queries, can you use hbase?

Based on what you have said so far, it sounds like you either have batch
oriented input for relatively small batches of inputs (less than 100,000 or
so) or you have a real-time query requirement.  In either case, you may need
to have a program that runs semi-permanently. If you have such a program,
then keeping track of what data is already in memory is pretty easy and
using HDFS as a file store could be really good for your application.  One
way to get such a long-lived program is to simply use hbase (if you can).
If that doesn't work for you, you might try using the map-file structure or
lucene to implement your own long-running distributed search system.

If you can be more specific about what you are trying to do, you are likely
to get better answers.

On 2/10/08 10:05 PM, "Shimi K" <shimi.eng@gmail.com> wrote:

> I choose Hadoop more for the distributed calculation then the support for
> huge files and my files do fit into memory.
> I have a lot of small files and my system needs to search for something in
> those files very fast. I figured I can distribute the files on a Hadoop
> cluster and then uses the distributed calculation to do the search in
> parallel on many files as possible. This way I would be able to return a
> result faster then if I would have used one machine.
> Is there a way to tell which files are in memory?
> On Feb 10, 2008 10:33 PM, Ted Dunning <tdunning@veoh.com> wrote:
>> But if your files DO fit into memory then the datanodes that have copies
>> of
>> the blocks of your file will probably still have them in memory and since
>> maps are typically data local, you will benefit as much as possible.
>> On 2/10/08 11:17 AM, "Arun C Murthy" <acm@yahoo-inc.com> wrote:
>>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
>>>> upload files
>>>> from the disk each time a file is needed no matter if it was the
>>>> same file
>>>> that was required by the last job on the same node?
>>> There is no concept of caching input files across jobs.
>>> Hadoop is geared towards dealing with _huge_ amounts of data which
>>> don't fit into memory anyway... and hence doing it across jobs is moot.

View raw message