hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Chang <james.bigd...@gmail.com>
Subject Re: Heterogeneous cluster
Date Sat, 08 Dec 2012 15:17:11 GMT
Hi JM,

     I ever think the same issue, in my opinion,
option 2 is perfer.

     By the way, I saw you mentioned that you
have built a "LoadBalancer", could you kindly
share some detailed info about it?

Best Regards.
James Chang

Jean-Marc Spaggiari 於 2012年12月8日星期六寫道:

> Hi,
> Here is the situation.
> I have an heterogeneous cluster with 2 cores CPUs, 4 cores CPUs and 8
> cores CPUs servers. The performances of those different servers allow
> them to handle different size of load. So far, I built a LoadBalancer
> which balance the regions over those servers based on the
> performances. And it’s working quite well. The RowCounter went down
> from 11 minutes to 6 minutes. However, I can still see that the tasks
> are run on some servers accessing data on other servers, which
> overwhelme the bandwidth and slow done the process since some 2 cores
> servers are assigned to count some rows hosted on 8 cores servers.
> I’m looking for a way to “force” the tasks to run on the servers where
> the regions are assigned.
> I first tried to reject the tasks on the Mapper setup method when the
> data was not local to see if the tracker will assign it to another
> server. No. It’s just failing and mostly not re-assigned. I tried
> IOExceptions, RuntimeExceptions, InterruptionExceptions with no
> success.
> So now I have 3 possible options.
> The first one is to move from the MapReduce to the Coprocessor
> EndPoint. Running locally on the RegionServer, it’s accessing only the
> local data and I can manually reject all what is not local. Therefor
> it’s achieving my needs, but it’s not my preferred options since I
> would like to keep the MR features.
> The second option is to tell Hadoop where the tasks should be
> assigned. Should that be done by HBase? By Hadoop? I don’t know.
> Where? I don’t know either. I have started to look at JobTracker and
> JobInProgress code but it seems it will be a big task. Also, doing
> that will mean I will have to re-patch the distributed code each time
> I’m upgrading the version, and I will have to redo everything when I
> will move from 1.0.x to 2.x…
> Third option is to not process the task if the data is not local. I
> mean, on the map method, simply have a if (!local) return; right from
> the beginning and just do nothing. This will not work for things like
> RowCount since all the entries are required, but for some of my
> usecases this might work where I don’t necessary need all the data to
> be processed. I will not be efficient stlil the task will still scan
> the entire region.
> My preferred option is definitively the 2nd one, but it seems also to
> be the most difficult one. The Third one is very easy to implement.
> Need 2 lines to see if the data is local. But it’s not working for all
> the scenarios, and is more like a dirty fix. The coprocessor option
> might be doable too since I already have all the code for my MapReduce
> jobs. So it might be an acceptable option.
> I’m wondering if anyone already faced this situation and worked on
> something, and if not, do you have any other ideas/options to propose,
> or can someone point me to the right classes to look at to implement
> the solution 2?
> Thanks,
> JM

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message