mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Ye <jerr...@yahoo-inc.com>
Subject Re: About GBDT support
Date Thu, 15 Sep 2011 18:16:24 GMT
Hi Jake,
This should work as well since the main pain points that I had with MapReduce was the lack
of persistent memory between iterations and speedy communication.  Along the same lines, there
is also a MPI application master in the works for Hadoop YARN/NextGen.

However, integrating either Giraph or MPI into Mahout would require some thought.

- jerry

On Sep 15, 2011, at 9:44 AM, Jake Mannix wrote:

This is an old thread to resurrect from the dead, but I wonder if training GBDT on a Pregel-like
architecture (for example: Apache Giraph<http://incubator.apache.org/giraph>, which
loads the data set into memory in long-lived Hadoop Mappers, then communicates via Hadoop
RPC between nodes while doing BSP iterations) would allow for "speedy communication of optimal
split points"?  The advantage of Giraph is that it runs on vanilla Hadoop, no new Schedulers
or Job Trackers.

  -jake

On Thu, Mar 24, 2011 at 8:47 PM, Jerry Ye <jerryye@yahoo-inc.com<mailto:jerryye@yahoo-inc.com>>
wrote:
One of the author's of the Yahoo paper here. One of the main points of our paper was that
an algorithm like GBDT needs a speedy way of communicating optimal split points and subsequently
the partitioning of the samples.  As of right now, communication in MapReduce is essentially
done by reading/writing files on HDFS.  This was exactly the reason why our pure MapReduce
implementation of the algorithm was so slow.  In order to reduce communication costs, we looked
into using MPI and wrote a launcher for it on top of Hadoop (mostly to utilize existing clusters).

Google's approach was quite different and trained approximate models while ours produced exactly
the same model as the single machine implementation.  Like ours, their approach also deviated
greatly from what a standard MapReduce job is and had to write what essentially was another
job scheduler.

Recently, there was a NIPS workshop on distributed computing (http://lccc.eecs.berkeley.edu/)
with 2 other variants of GBDT from Microsoft and Washington University.  The general message,
however, was that we needed something better than writing to HDFS for communication and all
of the solutions used something like MPI or another way of communication between nodes directly.

It would be great to have GBDT in Mahout, but it's not intuitive how to fit a fast and scalable
implementation into the existing framework.

- jerry

On 3/24/11 8:00 PM, "Bai, Gang" <dev@baigang.net<mailto:dev@baigang.net>> wrote:

This post is probably OT since it's essentially about implementing an
algorithm for Mahout. :-)

As the nature of boosting method, GBDT is hard to parallelize (in a whole
ensemble manner). So it typically requires multiple stages. Also, efficiency
is an urgent issue. A distributed edition of algorithm focus on tackling
huge amount of data, which is impossible for a traditional in-memory
edition.

Both Yahoo and Google have articles address this problem. Yahoo's method* is
less prominent IMHO. It generally deals with scalability, i.e partitioning
the data both horizonally (in the sample dimension) and vertically (in the
feature dimension), but leaves the communication costs unresolved, which
makes it less efficient than a sequential algorithm. As for Google's
method**, I just skimmed over the article and had not found time to read it
in detail. It's genrally fancier, presenting the algorithm as well as
addressing several engineering issues.

I don't know how to start, or propose to start, an implementation of a
particular algorithm, but I recommend this one. I'll give assistance on this
project.

* Stochastic Gradient Boosted Distributed Decision Trees. CIKM 2009.
** PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce.
VLDB 2009.

Thanks,
-BaiGang

On Tue, Mar 22, 2011 at 10:23 PM, Ted Dunning <ted.dunning@gmail.com<mailto:ted.dunning@gmail.com>>
wrote:

> I don't know of anybody currently planning to implement GBDT's.  A scalable
> implementation would probably be a good thing to have in Mahout.
>
> Decision tree methods can, however, be difficult to implement in a scalable
> way.
>
> How do you propose to do so?
>
>
> On Tue, Mar 22, 2011 at 6:24 AM, Bai, Gang <dev@baigang.net<mailto:dev@baigang.net>>
wrote:
>
>> Hi all,
>>
>> Is there any plan to implement Gredient Boosted Decision Trees in Mahout?
>> I
>> think it's high expected because of its extensive application in web data
>> mining.
>>
>> Best regards,
>> -BaiGang
>>
>
>




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message