Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B5DAD867 for ; Wed, 15 Aug 2012 18:30:06 +0000 (UTC) Received: (qmail 19874 invoked by uid 500); 15 Aug 2012 18:30:03 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 19746 invoked by uid 500); 15 Aug 2012 18:30:03 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 19737 invoked by uid 99); 15 Aug 2012 18:30:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Aug 2012 18:30:03 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qc0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Aug 2012 18:29:55 +0000 Received: by qcsc21 with SMTP id c21so1854619qcs.35 for ; Wed, 15 Aug 2012 11:29:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=FmWfp8bMjZHBSqQlrsW8BxS24qmSYMyd3BfKGJ3Ui1U=; b=FrlLaEHjaxdH5ILKly96N9eXZV9U3oadXJDFlTnOuWLAlya+bM3LH4Gk9460wh6JmE imeXbsShfTFGcAXHKpGqDspziRJ/ceq5JvfoJWqs3kPkeUDs9s23KRlyjimOVryxTP9O WL00VNazNMT1MPYyK5qxIIdMuW52J3wkGZQyYANdwLPf+7mHL228+kle04TzzTXdCUCF sgvMfSjp8AxgXs+kvbEEWbO6oFnqFJLRUAkfPHRVMceLkueUak8C9rgeaIxgZ5GQ2kct YZyPCaxqFl6v/Ph53NF+148bVmircJmuoSroCA2ldOrgm/e6Wzsl1GJDtyICOmki1CHx InNg== MIME-Version: 1.0 Received: by 10.224.220.143 with SMTP id hy15mr31342806qab.33.1345055374522; Wed, 15 Aug 2012 11:29:34 -0700 (PDT) Received: by 10.229.56.21 with HTTP; Wed, 15 Aug 2012 11:29:34 -0700 (PDT) In-Reply-To: <502BE07D.2050003@gmail.com> References: <502BE07D.2050003@gmail.com> Date: Wed, 15 Aug 2012 12:29:34 -0600 Message-ID: Subject: Re: easy way to figure out most common tokens? From: Erick Erickson To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org I don't see how you could without indexing everything first since you can't know what the most frequent terms until you've processed all your documents.... If you know these terms in advance, it seems like you could just call then stopwords and use the common stopword processing. If you have to examine your corpus in the first place, it seems like you could do something with term frequencies to extract the most common terms from your index then re-index all your data with those terms as stopwords.. Best Erick On Wed, Aug 15, 2012 at 11:46 AM, Shaya Potter wrote: > Is there an easy way to figure out the most common tokens and then remove > those tokens from the documents. > > use case: imagine one is indexing a mailing list (such as this java-user) > and is extracting all e-mail addresses in the messages and adding them to a > doc. > > What that means is that one will be a lot of > > java-user-unsubscribe@lucene.apache.org > java-user-help@lucene.apache.org > > due to that being in the signature of each email. > > while, the best approach might be to not put it in the index in the first > place, I'm wondering if there's a good way to process the index after the > fact to remove these type of entries. > > thanks. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org