hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uri Shani <SH...@il.ibm.com>
Subject Re: Parallell maps
Date Thu, 02 Jul 2009 15:08:22 GMT
Hi Marcus,
If you need a database to serve as an in-between to your customer, than 
you can do massive load to DB of the end results - as I suggested.
You can avoid that and work with files only. Than you need to serve your 
customers via an API to your hdfs files. Each query will start a M/R job 
to get the results.
Don't know the response time - yet it is a viable approach to search. 
There are also the hadoop DBs: Hbase and Hive. PIG may also have SQL like 
things in the future, and there is the JAQL project http://www.jaql.org/.

- Uri



From:
Marcus Herou <marcus.herou@tailsweep.com>
To:
common-user@hadoop.apache.org
Date:
02/07/2009 05:50 PM
Subject:
Re: Parallell maps



Hi.

Yep I recon this. As I said, I have created a sharded DB solution where 
all
data is currently spread across 50 shards. I totally agree that one should
try to emit data to the outputCollector but when one have the data in 
memory
I do not want to throw it away and re-read it into memory again later on
after the job is done by chained Hadoop jobs... In our stats system we do
exactly this, almost no db all over the complex chain, just reducing and
counting like crazy :)

There is another pro following your example and that is that you risk less
during the actual job since you do not have to create a db-pool of 
thousands
of connections. And the jobs do not affect the live production DB = good.

Anyway what do you mean that if you HAVE to load it into a db ? How the 
heck
are my users gonna access our search engine without accessing a populated
index or DB ?
I've been thinking about this in many use-cases and perhaps I am thinking
totally wrong but I tend to not be able to implement a shared-nothing arch
all over the chain it is simply not possible. Sometimes you need a DB,
sometimes you need an index, sometimes a distributed cache, sometimes a 
file
on local FS/NFS/glusterFS (yes I prefer classical mount points over HDFS 
due
to the non wrapper characteristict of HDFS and it's speed, random access 
IO
that is).

I will put a mental note to adapt and be more hadoopish :)

Cheers

//Marcus

On Thu, Jul 2, 2009 at 4:27 PM, Uri Shani <SHANI@il.ibm.com> wrote:

> Hi,
> Whenever you try to access DB in parallel you need to design it for 
that.
> This means, for instance, that you ensure that each of the parallel 
tasks
> inserts to a distinct "partition" or table in the database to avoid the
> conflicts and failures. Hadoop does the same in its reduce tasks - each
> reduce gets ALL the records that are needed to do its function. So there
> is a hash mapping keys to these tasks. Same principle you need to follow
> when linking DB partitions to your hadoop tasks.
>
> In general, think how to not use DB inserts from Hadoop. Rather, create
> your results in files.
> At the end of the process, you can - if this is what you HAVE to do -
> massively load all the records into the database using efficient loading
> utilities.
>
> If you need the DB to communicate among your tasks, meaning that you 
need
> the inserts to be readily available for other threads to select, than it
> is obviously the wrong media for such sharing and you need to look at
> other solutions to share consistent data among hadoop tasks. For 
instance,
> zookeeper, etc.
>
> Regards,
> - Uri
>
>
>
> From:
> Marcus Herou <marcus.herou@tailsweep.com>
> To:
> common-user <common-user@hadoop.apache.org>
> Date:
> 02/07/2009 12:13 PM
> Subject:
> Parallell maps
>
>
>
> Hi.
>
> I've noticed that hadoop spawns parallell copies of the same task on
> different hosts. I've understood that this is due to improve the
> performance
> of the job by prioritizing fast running tasks. However since we in our
> jobs
> connect to databases this leads to conflicts when inserting, updating,
> deleting data (duplicated key etc). Yes I know I should consider Hadoop 
as
> a
> "Shared Nothing" architecture but I really must connect to databases in
> the
> jobs. I've created a sharded DB solution which scales as well or I would
> be
> doomed...
>
> Any hints of how to disable this feature or howto reduce the impact of 
it
> ?
>
> Cheers
>
> /Marcus
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message