Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 30968 invoked from network); 7 Oct 2010 04:13:30 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Oct 2010 04:13:30 -0000 Received: (qmail 74722 invoked by uid 500); 7 Oct 2010 04:13:30 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 74575 invoked by uid 500); 7 Oct 2010 04:13:30 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 74567 invoked by uid 99); 7 Oct 2010 04:13:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Oct 2010 04:13:30 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of adeneche@gmail.com designates 209.85.216.170 as permitted sender) Received: from [209.85.216.170] (HELO mail-qy0-f170.google.com) (209.85.216.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Oct 2010 04:13:23 +0000 Received: by qyk34 with SMTP id 34so7156427qyk.1 for ; Wed, 06 Oct 2010 21:13:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=xcOWH9QmtoQFUoRhnxvJ3ZZUPqmKQuRQHsIIBZ/REkc=; b=YhNENxYQEzMMl+mtsQO2H2zIA53+p9eDS0ezlJoGHk+BAvQd+tKV7+oIrNgaq67hHX 6LAoYnKqC0BcGV7lwX18X45p19n68gsJn06kmbl1In+yvB9m02+fZyTSVDw41FXTqBMi VnNlBO+ZSHHr0j2LuoIDJWXofOTOP3A3+Lb9I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=uPQT/zGRx1NBKY5+IatzPorq/LvJLRz50gnyfIFDrvPhgq+fv4CS6TuuR6u8s2Mw+s g5+JL8z3GqwNW/bPXzVqXYkVqA8CQcDeWjkOmBjDKy9mHjHXz/Iyw1aEMPVtON9vBxbP 8za3aDqqxB7cCqBfwj88zLSy5Mrn942ctCYxU= MIME-Version: 1.0 Received: by 10.229.211.9 with SMTP id gm9mr202354qcb.246.1286424782019; Wed, 06 Oct 2010 21:13:02 -0700 (PDT) Received: by 10.229.34.85 with HTTP; Wed, 6 Oct 2010 21:13:01 -0700 (PDT) In-Reply-To: References: Date: Thu, 7 Oct 2010 05:13:01 +0100 Message-ID: Subject: Re: default tree builder From: deneche abdelhakim To: dev@mahout.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable > Does this refer to the random forest stuff? Yes, and the DefaultTreeBuilder is used by all implementations (sequential, InMem and Partial). > Andrey has a question about the Random Forest code. (hurray for him for > being the first outsider to beat on it) Hurray =3DD I will make the fix and create a patch as soon as possible. I wonder if Andrey good give us the data that lead to this Bug. Just to make sure it's not another hidden bug that's causing the infinite recursion. On Wed, Oct 6, 2010 at 10:22 PM, Ted Dunning wrote: > Andrey has a question about the Random Forest code. =A0(hurray for him fo= r > being the first outsider to beat on it) > > This looks like a simple and fairly obvious fix, but I can't afford enoug= h > attention to verify. > > Deneche? > > ---------- Forwarded message ---------- > From: Andrey Gusev > Date: Wed, Oct 6, 2010 at 1:41 PM > Subject: default tree builder > To: Ted Dunning > > > Hi Ted, > > Unrelated to KCCA discussion. I have some pretty positive results with > bagging of DefaultTreeBuilder-id3 like implementation. However, there is > seems to be an issue with building possible running into infinite recursi= on, > i.e. stack overflow. Basically when you have few features, it is possible > that all results left to split can not be split according to any attribut= e > so in my cases line 107 of DefaultTreeBuilder keeps being called with > identical set. More precisely loSubset is empty and hiSubset contain few > instances which can not be split. I modified this into LimitDepthTreeBuil= der > with something like: > > > =A0 =A0public Node build(Random rng, Data data) { > =A0 =A0 =A0 =A0return inner_build(rng, data, 0); > =A0 =A0} > > =A0 =A0public Node inner_build(Random rng, Data data, int depth) { > > =A0 =A0 =A0 =A0if (selected =3D=3D null) { > =A0 =A0 =A0 =A0 =A0 =A0selected =3D new boolean[data.getDataset().nbAttri= butes()]; > =A0 =A0 =A0 =A0} > > =A0 =A0 =A0 =A0if (data.isEmpty()) { > =A0 =A0 =A0 =A0 =A0 =A0return new Leaf(-1); > =A0 =A0 =A0 =A0} > =A0 =A0 =A0 =A0if (isIdentical(data)) { > =A0 =A0 =A0 =A0 =A0 =A0return new Leaf(data.majorityLabel(rng)); > =A0 =A0 =A0 =A0} > =A0 =A0 =A0 =A0if (data.identicalLabel()) { > =A0 =A0 =A0 =A0 =A0 =A0return new Leaf(data.get(0).label); > =A0 =A0 =A0 =A0} > =A0 =A0 =A0 =A0// SFDC change to prevent stack overflow > =A0 =A0 =A0 =A0if (depth > MAX_DEPTH) { > =A0 =A0 =A0 =A0 =A0 =A0return new Leaf(data.majorityLabel(rng)); > =A0 =A0 =A0 =A0} > ... > > } > > I know mahout-0.4 is pretty close to freeze (or past it). But this is sim= ple > enough change and with high enough MAX_DEPTH (say 100) should not have mu= ch > functional impact unless there is actually infinite recursion. Or it coul= d > default to not use it but have a setter for that. Let me know what you > think. > > Thanks, > > Andrey >