hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Dyer" <redp...@umd.edu>
Subject Re: scaling experiments on a static cluster?
Date Wed, 12 Mar 2008 22:44:59 GMT
Thanks-- that should work.  I'll follow up with the cluster
administrators to see if I can get this to happen.  To rebalance the
file storage can I just set the replication factor using "hadoop dfs"?
Chris

On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning <tdunning@veoh.com> wrote:
>
>  What about just taking down half of the nodes and then loading your data
>  into the remainder?  Should take about 20 minutes each time you remove nodes
>  but only a few seconds each time you add some.  Remember that you need to
>  reload the data each time (or rebalance it if growing the cluster) to get
>  realistic numbers.
>
>  My suggested procedure would be to take all but 2 nodes down, and then
>
>  - run test
>  - double number of nodes
>  - rebalance file storage
>  - lather, rinse, repeat
>
>
>
>
>  On 3/12/08 3:28 PM, "Chris Dyer" <redpony@umd.edu> wrote:
>
>  > Hi Hadoop mavens-
>  > I'm hoping someone out there will have a quick solution for me.  I'm
>  > trying to run some very basic scaling experiments for a rapidly
>  > approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
>  > with 2 procs/node.  Ideally, I would want to run my code on clusters
>  > of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
>  > The problem is that I am not able to reconfigure the cluster (in the
>  > long run, i.e., before a final version of the paper, I assume this
>  > will be possible, but for now it's not).  Setting the number of
>  > mappers/reducers does not seem to be a viable option, at least not in
>  > the trivial way, since the physical layout of the input files makes
>  > hadoop run different tasks of processes than I may request (most of my
>  > jobs consist of multiple MR steps, the initial one always running on a
>  > relatively small data set, which fits into a single block, and
>  > therefore the Hadoop framework does honor my task number request on
>  > the first job-- but during the later ones it does not).
>  >
>  > My questions:
>  > 1) can I get around this limitation programmatically?  I.e., is there
>  > a way to tell the framework to only use a subset of the nodes for DFS
>  > / mapping / reducing?
>  > 2) if not, what statistics would be good to report if I can only have
>  > two data points -- a legacy "single-core" implementation of the
>  > algorithms and a MapReduce version running on a cluster full cluster?
>  >
>  > Thanks for any suggestions!
>  > Chris
>
>

Mime
View raw message