hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From juber patel <juberpa...@gmail.com>
Subject hadoop without disk i/o
Date Fri, 30 Jul 2010 03:52:48 GMT

Is it possible to use hadoop and not use disk i/o, apart from the
initial input?

I am asking this with the assumption that disk i/o is the bottleneck
in overall processing, even more than the network access if you are on
a dedicated, high speed cluster. (Does anyone have experience to
confirm or reject this assumption?)

I know that my programs logic does not require disk access after
initial input. I don't even require sorting, but would like to combine
the mapper output to reduce its size. This output is fed to another
job/standalone program where it is interpreted meaningfully. I know
this job/standalone program could be the reducer, but I don't want to
spend time in sorting, especially involving disk spills. It is not

Does anyone have a suggestion for this scenario? Is there something
like NetworkInputFormat? Is there a way to start the reduce phase as
mapper output starts coming in? I am thinking in terms of blocking
queues, without disk access but with hadoop's fault tolerance, input
splitting etc.

thanks in advance,


View raw message