Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 66179 invoked from network); 16 Feb 2010 18:03:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Feb 2010 18:03:38 -0000 Received: (qmail 72776 invoked by uid 500); 16 Feb 2010 18:03:37 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 72734 invoked by uid 500); 16 Feb 2010 18:03:37 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 72724 invoked by uid 99); 16 Feb 2010 18:03:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2010 18:03:37 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pw0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2010 18:03:29 +0000 Received: by pwi7 with SMTP id 7so762140pwi.35 for ; Tue, 16 Feb 2010 10:03:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=uT6NyOQmVNTrMn83vQDy+FM02F7lVD7LYXOyfkaEejM=; b=Yg2BJ3qmcBGmbvwDNYUWcC6ZGrqTGaGIiW2N7NVHrvv2rwrakAm+/bGXVj88o9LNG+ 1fKySmcgBKoNdy+CqUrHJ6x3socEXCjnTDOPlgo2TWNqH+mCGlnSG3ZASEZqmkY+ed+f kHqAKPBLd1r647QKxaAPnVQxJI7QccuoZpHvA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=sMTdoyguCFt8E8WEzv3nsIcrwMobMMQ9CIxbX7c+sy2ohvITzXjT37CE+2l/4cjLmA 6OlxoAcObNfcmn45BSNuH+XU+kIMMCZ/fLXSdvT0B8ojI3JVgVHrUEkNpA4ZkV6/WrVJ tzoyKIBCGAIPQhtcSn69qOq2S+f2MXpQMUmAE= MIME-Version: 1.0 Received: by 10.114.215.30 with SMTP id n30mr162202wag.56.1266343387917; Tue, 16 Feb 2010 10:03:07 -0800 (PST) In-Reply-To: <4b124c311002160959p7a70cfddv861ee05b1d313f7c@mail.gmail.com> References: <8f8e14c41002160828y44c12defxf4bd9f95b6ae087b@mail.gmail.com> <4b124c311002160954kfff36adn7404bb6b224ea79f@mail.gmail.com> <4b124c311002160958p273653d7l78fc615c7add11ee@mail.gmail.com> <4b124c311002160959p7a70cfddv861ee05b1d313f7c@mail.gmail.com> Date: Tue, 16 Feb 2010 10:03:07 -0800 Message-ID: <4b124c311002161003y6abfbe0cga84799430188c685@mail.gmail.com> Subject: Re: n-gram over-representation? From: Jake Mannix To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e64ce5c2ccf0c4047fbb8b52 X-Virus-Checked: Checked by ClamAV on apache.org --0016e64ce5c2ccf0c4047fbb8b52 Content-Type: text/plain; charset=ISO-8859-1 Drew, Did you pick your whitelist using the LLR score? What is the kind of over-representation you're trying to prune out? DF will certainly help you remove "too common" bigrams, but that's not what you're looking for, is it? -jake On Feb 16, 2010 8:29 AM, "Drew Farris" wrote: I have a collection of about 800k bigrams from a corpus of 3.7m documents that I'm in the process of working with. I'm looking to determine an appropriate subset of these to use both as features for both an ML and an IR application. Specifically I'm considering white-listing a subset of these to use as features when building a classifier and separately as terms when building an index and doing query parsing. As a part of the earlier collocation discussion Ted mentioned that tests for over-representation could be used to identify dubious members of such a set. Does anyone have any pointers to discussions of how such a test could be implemented? Thanks, Drew --0016e64ce5c2ccf0c4047fbb8b52--