hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricky Ho <...@adobe.com>
Subject Best practices of using Hadoop
Date Thu, 27 Nov 2008 17:37:04 GMT
I am trying to get some answers to these kind of questions as they pop up frequently ...

1) What kind of problems fits best to Hadoop and what not ?

2) What is the dark side of Hadoop where other parallel processing model (e.g. MPI, TupleSpace
... etc) fits better ?

3) What is the demarcation point between choosing a Hadoop model versus a multi-thread share
memory model ?

4) Given that we can partition and replicate a RDBMS table.  We can make it as big as we like
and spread the workload across.  Why isn't that good enough for scalability ?  Why do we need
BigTable or HBase which require an adoption of a new data model ?

5) Is there a general methodology that can transform any algorithm into the map/reduce form

6) How would one choose between Hadoop Java, Hadoop Streaming and PIG ?  Looks like if a problem
can be solved in one, it can be solved in others.  If so, PIG is more attractive because it
gives a higher level semantics.

I appreciate if anyone come across these decisions can share their thoughts.


View raw message