Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 2229 invoked from network); 2 Jan 2010 18:27:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Jan 2010 18:27:46 -0000 Received: (qmail 8122 invoked by uid 500); 2 Jan 2010 18:27:45 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 8069 invoked by uid 500); 2 Jan 2010 18:27:45 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 8059 invoked by uid 99); 2 Jan 2010 18:27:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jan 2010 18:27:45 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bogdan.vatkov@gmail.com designates 74.125.78.27 as permitted sender) Received: from [74.125.78.27] (HELO ey-out-2122.google.com) (74.125.78.27) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jan 2010 18:27:39 +0000 Received: by ey-out-2122.google.com with SMTP id 9so2528237eyd.3 for ; Sat, 02 Jan 2010 10:27:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=81x7vn+3rPSfTsFbVI0x4KcNDs0sAANfDYXULpi5328=; b=N8CLw2tgFV4PQojFOzwDeCESI66KmhM55wV0jUq52gYqhsfqdvgd8ElGzSIUzyO3XS yftmC97LTYsDSFpnyPraYPNvj6lbXW4uel8WB0GyLVz09Z4zs56Gv2jw8p+hWDoIQYYb Yuy5Igj3pQz2t7v/wBpCFIeM6dv1M3/hWioyw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=BAZDvLejLat3euOaKJqBqHu30WTR52HDSzqYDqiTi96teC6UKAYpEIsFCCVG5fW3xy q4g0ufLyDVdeQLgQX55z9Y+9geIY4rkK4Fxq7popQz7PeSi677rgQ2YV+LxZxZueaJ8R WwokFWDYE3KERLaKPJE6FJjsV6BymwcHa7pgQ= MIME-Version: 1.0 Received: by 10.213.107.8 with SMTP id z8mr26994670ebo.32.1262456837272; Sat, 02 Jan 2010 10:27:17 -0800 (PST) In-Reply-To: <2C501FC7-B2FA-46C4-8853-8FB4B5A7C7A3@gmail.com> References: <56747AB3-8E9C-4B77-A610-100CBC8F0737@apache.org> <32D1486C-DCB0-4593-8ECE-BE6F5CECE012@apache.org> <2C501FC7-B2FA-46C4-8853-8FB4B5A7C7A3@gmail.com> Date: Sat, 2 Jan 2010 20:27:17 +0200 Message-ID: Subject: Re: Stopwords work for Solr but not for Mahout From: Bogdan Vatkov To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00504502d31254731b047c32a392 --00504502d31254731b047c32a392 Content-Type: text/plain; charset=ISO-8859-1 Thanks for the Luke hint, I will try it out but now I noticed something else which is very very strange - I ran k-means on 23K+ docs and with 50 clusters which all seem to be very very strange as top term collection - I would say for 90% of the top terms I get some words which I barely recognize. I did a short check and for one particular term, which anyway sounded strange and which appeared in top terms for 9 of the 50 clusters, I found that it has "doc freq" = 2 in the Solr dictionary. How is this even possible - for 23, 000 docs and for a term which is mentioned only 2 times I have it as a top term in 9 clusters? I definitely did something wrong, do you have an idea what that could be? --00504502d31254731b047c32a392--