hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcus Herou <marcus.he...@tailsweep.com>
Subject Parallell maps
Date Thu, 02 Jul 2009 09:11:14 GMT

I've noticed that hadoop spawns parallell copies of the same task on
different hosts. I've understood that this is due to improve the performance
of the job by prioritizing fast running tasks. However since we in our jobs
connect to databases this leads to conflicts when inserting, updating,
deleting data (duplicated key etc). Yes I know I should consider Hadoop as a
"Shared Nothing" architecture but I really must connect to databases in the
jobs. I've created a sharded DB solution which scales as well or I would be

Any hints of how to disable this feature or howto reduce the impact of it ?



Marcus Herou CTO and co-founder Tailsweep AB

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message