hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: Can I use hadoop/map-reduce for this usecase?
Date Mon, 19 Apr 2010 19:49:24 GMT
On Sun, Apr 18, 2010 at 10:29 AM, Colin Yates <colin.yates@gmail.com> wrote:
> Many thanks for an excellent and thorough answer - cloudera certainly looks
> very interesting as a developer resource.
> I have a few more questions if that is OK.
> So I have a directory filled with .CSV files representing each dump of data,
> or alternatively I have a single SequenceFiles.  How do I ensure a job is
> executed on as many nodes (i.e. all) as possible?  If I set a HDFS
> replication factor of N, does that restrict N parallel jobs?

Replication is not what matters for parallelization. Hadoop will
process many "chunks" of one file in parallel. If you use
SequenceFiles, Hadoop will understand how to split this up and work on
it in parallel. The same is true for text / csv files. You should
check out Tom White's excellent book: Hadoop the Definitive Guide.

> I want to get as many jobs running parallel as possible.
> The most common map/reduce job will be the equivalent of:
> "select the aggregation of some columns from this table where columnX=this
> and columnY=that group by columnP"
> where columnX, columnY and columnP change from job to job :)
> I am absolutely trying to get the most out of divide and conquer - tens, or
> even hundreds of little virtual machines (I take your point about
> virtualisation) throwing their CPUs at this job.

Again, virtualization will almost certainly hurt performance rather
than help it.

> My second question is the number of jobs per node.  If I run this on a
> (non-virtual) machine with 2 quad-core CPUs (each hyper-threaded), so 16
> cores, that (simplistically) means that there could be at least 16 map
> and/or reduce jobs in parallel on each machine.  How do I ensure this, or
> will hadoop do this automatically?  My only motivation for using VMs was to
> achieve the best utilisation of CPU.  (I get life isn't as simple as this
> and the point about IO wait and CPU wait etc.)

The maximum number of allowed tasks per node is a user configured
parameter. Hadoop will trust you do this correctly. You configure map
and reduce slots per node and Hadoop assigns tasks to run in those
slots. The number of tasks one can run per node is based on the type
of tasks and machine configuration. If you're jobs are CPU heavy,
you'll want more cores per disk. IO heavy jobs will want the opposite.
All tasks will consume RAM as they're running in JVMs. The baseline
recommended by Cloudera can be found at
in detail.

Again, avoid virtualization in data intensive processing systems like
this. If you still don't believe me, ask yourself this - would you
increase your Oracle 10g capacity by taking the same hardware and
splitting it up into smaller machines? If anyone answers yes to this,
email me off list. I have a bridge to sell you. ;)

HTH and good luck!
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

View raw message