hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Lam <chuck....@gmail.com>
Subject Re: Parallell maps
Date Thu, 02 Jul 2009 09:55:52 GMT
you're probably talking about speculative execution. you can turn it
off for mappers and reducers, respectively, by setting
mapred.map.tasks.speculative.execution and
mapred.reduce.tasks.speculative.execution to false.

another thing to watch out for is that a task can fail half way, and
then hadoop will re-run it from the beginning. so it's still possible
that the same db queries are executed multiple times.



On Thu, Jul 2, 2009 at 2:13 AM, Marcus Herou<marcus.herou@tailsweep.com> wrote:
> Hi.
>
> I've noticed that hadoop spawns parallell copies of the same task on
> different hosts. I've understood that this is due to improve the performance
> of the job by prioritizing fast running tasks. However since we in our jobs
> connect to databases this leads to conflicts when inserting, updating,
> deleting data (duplicated key etc). Yes I know I should consider Hadoop as a
> "Shared Nothing" architecture but I really must connect to databases in the
> jobs. I've created a sharded DB solution which scales as well or I would be
> doomed...
>
> Any hints of how to disable this feature or howto reduce the impact of it ?
>
> Cheers
>
> /Marcus
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
>

Mime
View raw message