Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 28789 invoked from network); 17 Jun 2009 10:54:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Jun 2009 10:54:11 -0000 Received: (qmail 42416 invoked by uid 500); 17 Jun 2009 10:53:34 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 41991 invoked by uid 500); 17 Jun 2009 10:53:33 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 41126 invoked by uid 99); 17 Jun 2009 10:46:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jun 2009 10:46:05 +0000 X-ASF-Spam-Status: No, hits=3.2 required=10.0 tests=RCVD_IN_BL_SPAMCOP_NET,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [217.146.176.14] (HELO web26303.mail.ukl.yahoo.com) (217.146.176.14) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 17 Jun 2009 10:45:53 +0000 Received: (qmail 56518 invoked by uid 60001); 17 Jun 2009 10:45:31 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.fr; s=s1024; t=1245235531; bh=AkgdUucovuA0D91rMbiKu0wI0uvgDJPr6/GmwftRX3g=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=TLvZmmcdt9fB7VIO75g8/cqfzvrdzp7msTI0T3eUyVQOypN9d06QfuG/7yBuaLrXvFAgXIJbyHTZv5JOeUgWKZ9fTeTv5zcWPGJM2QUOBjpkyb6AORQAnKYpLa1Pz8L732XfqvHzVLQOAqnddqh92+vJ21UTWVCWwhAGzthf2iE= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.fr; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=dK4u1F29mW5kBJ+/fRhtLmdzVO/1tsq4vlEKuf8SKwzgaf8uRV0Qs183LkapcTO0PzudwE6tPAMI7J6dFEoYT5GvDJgH0n0qQOSi5g8kNLdEf4nL2OAOTEs5mY3ZubDYoHaLu0jwtrzd4W2xkMymMo8w5JDn6E1hkGCjydPDHNc=; Message-ID: <281390.56457.qm@web26303.mail.ukl.yahoo.com> X-YMail-OSG: vRp3iMAVM1n6riZmUK8SuzrNIR4WR7hdjlFwygN.3.8XZcW8uN1ZGLbB7dI9Gvo4_yBZXe84MO.R_cI7RSI9jmux4exf8cohCs_EAEJr7cWX50T2Fdzvd9_L.zqpDsdBeOS_TNwv0bLX_RZgl9b.yF_RZ8zZMbpltWyxKKZshICyz8GL02mYCXMSPafg6euRLmiSYyeP3xYXkhakS2VpYO0yW4McKvO2xKvXl38xVe_HZ7gS0lByHFdJkboHxTzHnHzIk8NF9DwbFARLB8o_mx0Phcms0J8OyvTbPzHk.nXcg3dw5pkuKbXQi11wVItYaOlLrHktckgoxS7XDfG2zR3DfmuptkjTjYGxeg-- Received: from [41.110.2.13] by web26303.mail.ukl.yahoo.com via HTTP; Wed, 17 Jun 2009 10:45:30 GMT X-Mailer: YahooMailClassic/5.4.12 YahooMailWebService/0.7.289.15 Date: Wed, 17 Jun 2009 10:45:30 +0000 (GMT) From: deneche abdelhakim Subject: [GSOC] Thoughts about Random forests map-reduce implementation To: mahout-dev@lucene.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org As we talked about in the following discussion (A), I'm considering two way= s to implement a distributed map-reduce builder.=0A=0AGiven the reference i= mplementation, the easiest implementation is the following:=0A=0A* the data= is distributed to the slave nodes using the DistributedCache=0A* each mapp= er loader the data in-memory when in JobConfigurable.configure()=0A* each t= ree is built by one mapper=0A* the Job doesn't really need any input data, = it may be possible to implement our own InputFormat that generates InputSpl= it s using the configuration parameters (number of trees)=0A* the mapper us= es DecisionTree.toString() to output the tree in a String=0A* the main prog= ram builds the forest using DecisionTree.parse(String) for each tree=0A=0AP= ros:=0A* the easiest implementation because the actual ref. code can be use= d as it is, and the distributed implementation can benefit from future opti= mizations of the ref. code=0A=0ACons:=0A* because its based on the ref. imp= lementation, it will be very slow when dealing with large datasets. For exa= mple, with half of the KDD 99 dataset (a file of about 350 Mb), building on= e single tree will took more than 12 hours (in fact I stopped the program a= fter 12 hours) in a core 2 2Ghz with 3 Gb of RAM laptop !!!=0A* if the slav= e nodes contain many computing cores it would be interesting to launch para= llel mappers in every slave node to exploit the multi-threading. But as I d= idn't found a way to share a memory variable between the mappers, each mapp= er must load its own copy of the Data in memory !=0A=0AImportant:=0A* The r= ef. implementation memory usage will probably change when the Information G= ain computing will be optimized, in this case an out-of-core approach could= become viable=0A=0ASo I'm asking, what to do next ?=0A* go on with this im= plementation=0A* optimize the IG computing=0A* consider the "Distributed Im= plementation B", see (A), which deals with BIG datasets and should not need= any IG optimization, but its code has nearly nothing to do with the ref. i= mplementation=0A=0A(A) http://markmail.org/message/mgap2nuhnl4kokeu=0A=0A= =0A