Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E23529819 for ; Tue, 24 Jan 2012 23:42:03 +0000 (UTC) Received: (qmail 54223 invoked by uid 500); 24 Jan 2012 23:42:02 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 54144 invoked by uid 500); 24 Jan 2012 23:42:01 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 54136 invoked by uid 99); 24 Jan 2012 23:42:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jan 2012 23:42:01 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of turbocodr@gmail.com designates 209.85.213.170 as permitted sender) Received: from [209.85.213.170] (HELO mail-yx0-f170.google.com) (209.85.213.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jan 2012 23:41:54 +0000 Received: by yenq1 with SMTP id q1so3379197yen.1 for ; Tue, 24 Jan 2012 15:41:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=M+2z6jMDPEn2MSLe+C84iqs7lTVJsl1gxnX/4Z1seBk=; b=MLgzwr8iPzIhbpFcvTLpvFT39xat3bT6UZ0pGODiSyK5MvHQUtRjgXzFaGIVSwCVCQ xP9SHcY0qjYuP/YDQ8dp6u3geoFPIA3vt3PNNhAGaL8xe3Ei0axG3etzu9Q69jo7jrzq Uy8QES6xJ8Dzy+EoU2YqmFCxFIupc5kjFEcSs= MIME-Version: 1.0 Received: by 10.236.193.70 with SMTP id j46mr21219085yhn.108.1327448493647; Tue, 24 Jan 2012 15:41:33 -0800 (PST) Sender: turbocodr@gmail.com Received: by 10.236.136.234 with HTTP; Tue, 24 Jan 2012 15:41:33 -0800 (PST) In-Reply-To: References: Date: Tue, 24 Jan 2012 15:41:33 -0800 X-Google-Sender-Auth: pWpeznGnhMAdxX2gBl27i6WD_ww Message-ID: Subject: Re: Token filtering and LDA quality From: John Conwell To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=20cf30563c21ebe76704b74eaf7b X-Virus-Checked: Checked by ClamAV on apache.org --20cf30563c21ebe76704b74eaf7b Content-Type: text/plain; charset=ISO-8859-1 Hey Jake, Thanks for the tips. That will definitely help. One more question, do you know if the topic model quality will be affected by the document length? I'm thinking lengths ranging from tweets (~20 words), to emails (hundreds of words), to whitepapers (thousands of words) to books (boat loads of words). What lengths'ish would degrade topic model quality. I would think tweets would kind'a suck, but what about longer docs? Should they be segmented into sub-documents? Thanks, JohnC On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix wrote: > Hi John, > > I'm not an expert in the field, but I have done a bit of work building > topic > models with LDA, and here are some of the "tricks" I've used: > > 1) yes remove stop words, in fact remove all words occurring in more than > (say) half (or more conservatively, 90%) of your documents, as they'll be > noise > and just dominate your topics. > > 2) more features is better, if you have the memory for it (note that > mahout's > LDA currently holds numTopics * numFeatures in memory in the mapper tasks, > which means that you are usually bounded to a few hundred thousand > features, > maybe up as high as a million, currently). So don't stem, and throw in > commonly occurring (or more importantly: high log-likelihood) bigrams and > trigrams as independent features. > > 3) violate the underlying assumption of LDA, that you're talking about > "token > occurrences", and weight your vectors not as "tf", but "tf*idf", which > makes rarer > features more prominent, which ends up making your topics look a lot nicer. > > Those are the main tricks I can think of right now. > > If you're using Mahout trunk, try the new LDA impl: > > $MAHOUT_HOME/bin/mahout cvb0 --help > > It operates on the same kind of input as the last one (ie. a corpus which > is > a SequenceFile). > > -jake > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell wrote: > > > I'm trying to find out if there are any standard best practices for > > document tokenization when prepping your data for LDA in order to get a > > higher quality topic model, and to understand how the feature space > affects > > topic model quality. > > > > For example, will the topic model be "better" if there is a more rich > > feature space by not stemming terms, or is it better to have a more > > normalized feature space by applying stemming? > > > > Is it better to filter out stop words, or keep them in? > > > > Is it better to include bi and/or tri grams of highly correlated terms in > > the feature space? > > > > In essence what characteristics of the feature space that LDA uses for > > input will create a higher quality topic model. > > > > Thanks, > > JohnC > > > -- Thanks, John C --20cf30563c21ebe76704b74eaf7b--