hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Parallell maps
Date Thu, 02 Jul 2009 18:50:04 GMT
On Thu, Jul 2, 2009 at 7:49 AM, Marcus Herou <marcus.herou@tailsweep.com>wrote:

> Anyway what do you mean that if you HAVE to load it into a db ? How the
> heck
> are my users gonna access our search engine without accessing a populated
> index or DB ?

It is very common for hadoop to process data in files and leave results in
files.  Since hadoop is not suitable for interactive processes, it matters
little how the data is kept in the middle of the process.

If your use-case requires that normal users see the results in a database,
then exporting final results to a database may be necessary.  If so, then a
final bulk load is by far the best solution for this.

If you use-case requires that normal users be able to search the data using
text-like searches, then a much better option is to do a final map-reduce
job to spread your data across many shards that are each collected in a
reducer.  Each reducer can then build a Lucene index to be served using
something like Katta.  This allows you to maintain bulk-style operations
until the very end and does not require any database operations at all.

If you need to use data *from* a database, then it is usually a much better
option to dump the database to HDFS and either use the side-data mechanism
to share the data to all mappers or reducers, or just write a map-reduce
program to move the data to the right place.

It is almost never a good idea to have map-reduce programs make random
queries to a database.  If they even probe 1% of your database, you are
probably much better off just using map-reduce to move the data to the right
place instead.

If you have a combined case where data from a map-reduce should be in the
database and then used by a subsequent map-reduce program, you should
consider optimizing away the database dump and leave the data in hadoop for
the next map-reduce step.  It is just fine to load it to the database each
time, but don't waste your effort by dumping it again.

Finally, if you were to repurpose those 50 DB shards, you would be able to
have a whomping fast hadoop cluster.

>  I've been thinking about this in many use-cases and perhaps I am thinking
> totally wrong but I tend to not be able to implement a shared-nothing arch
> all over the chain it is simply not possible. Sometimes you need a DB,

What is a database?  Is it just a bunch of records?  If so, you can store it
in a file.

Do you really think you need random access?  Try doing a join in map-reduce

> sometimes you need an index, sometimes a distributed cache, sometimes a
> file
> on local FS/NFS/glusterFS (yes I prefer classical mount points over HDFS
> due
> to the non wrapper characteristict of HDFS and it's speed, random access IO
> that is).

If you are thinking about random access I/O, then you aren't generally
thinking about map-reduce.  The *whole* point is that you don't need most
random accesses in a batch setting.

Take the class terabyte update problem.  You have 1TB of data in 100B
records.  You need to update 1% of them.

Option 1 (traditional, random access):

Assume 10 ms seek, 10ms rotation, 100MB/s transfer for your data.  You have
10^12 / 10^2 = 10^10 records.  You need to update 10^8 of them.  Each update
costs 10ms seek, 10 ms rotation to read, 10 ms rotation to write = 30 ms.
10^8 of these require 3 Ms = 35 days.

Option 2 (map-reduce, sequential transfer):

Read the entire database sequentially and write the entire database back,
substituting the changed records.  Assuming the same disk as before and
assuming that we read the data in 1GB chunks, we have 1000 reads, 1000
writes to do.  Each one costs 1 seek and then 10^9B / 10^8B/s = 10 seconds
to read dn the same to write.  Total time is 20 ks = 5.5 hours.

You can argue that option 2 is doing 100x more work than it should, but you
have to account for the fact that it still runs more than 100 times faster.

This example is a cartoon, but is surprisingly realistic.  Whenever you say
random access, I think that you are paying four orders of magnitude more in
costs than you should.

Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message