hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Platform MapReduce - Enterprise Features
Date Tue, 13 Sep 2011 12:20:25 GMT

As an aside, if you ask for the white paper you get a PDF that 
over-exaggerates the limits of Hadoop.


Mostly focusing on a critique of the scheduler -which MR-279 will fix in 
Hadoop 0.23- they say

   "It is designed to be used by IT departments that
   have an army of developers to help fix any issues they en-

I don't believe this. Cloudera and Hortonworks will do this for a fee 
-as will Platform. In most organisations the R&D effort doesn't go into 
the Hadoop codebase, it goes into writing the analysis code, which is 
why things like Pig and Hive help -they make it easier.

   "Their (Clouderas) distribution is based on open source
   which is still an unproven large-scale enterprise full stack
   solution. There are many shortcomings in the open source
   distribution,  including the workload management capa-

   Other open source commercial distributions are
   emerging, with IBM and EMC entering the marketplace.
   However, all of these offerings are based on open source
   code and inevitably inherit the strengths and weaknesses
   of that code base and architectural design. "

Ted will point out that MapR's MR engine isn't limited, as will Brisk, 
while Arun will view that statement in the past tense. Doug and Tom will 
pick up on the word "unproven" too. Which enterprises plan to have 
Hadoop clusters bigger than Yahoo or Facebook?

Furthermore, as Platform only puts in their own scheduler, leaving the 
filesystem alone, it's a bit weak to critique the architecture of the 
open source distro. Not a way to make friends -or get your bug fixes in. 
Or indeed, promise better scalability.

"Therefore they cannot meet the enterprise–class requirements for ”big
data” problems as already mentioned."

This is daft. The only thing platform brings to the table is a scheduler 
that works with "legacy" grid workloads and a console to see what's 
going on. I don't see that being tangibly more enterprise-class than the 
existing JT -which does persist after an outage. With HDFS underneath a 
new scheduler doesn't even remove the filesystem SPOFs, so the only way 
to get an HA cluster is to swap in a premium filesystem.

The other thing the marketing blurb gets wrong is its claim that Hadoop 
only works with one distributed file system. Not so. You can read in and 
out of any filesystem, file:// being a handy one what works with NFS 
mount points too.

Overall, a disappointing white paper, as all it can to do criticise open 
source Hadoop is spread fear about the #of developers you need to 
maintain it, and the limitations of the Hadoop scheduler vs their 
Scheduler -that being the only that differs from the Platform product 
from the full OSS release.

I missed a talk at the local university by a Platform sales rep last 
month, though I did get to offend one of the authors of condor team 
instead [1]. by pointing out that all grid schedulers contain a major 
assumption: that storage access times are constant across your cluster. 
It is if you can pay for something like GPFS, but you don't get 50TB of 
GPFS storage for $2500, which is what adding 25*2TB SATA drives would 
cost if you stuck them on your compute nodes; $7500 for a fully 
replicated 50TB. That's why I'm not a fan of grid systems -cost of 
storage and networking aren't taken into account. Then there's the 
availablity issues with the larger filesystems, that are a topic for 
another day.

I look forward to them giving a talk at any forthcoming London HUG event 
and will try to do a follow-on talk introducing MR-279 and arguing in 
favour of an OSS solution because the turnaround time on defects is faster.


[1] (Miron Livny ), facing the camera, two to the left of Sergey Melnik 
with the camera -the author of Dremel: http://flic.kr/p/akUzE7

View raw message