Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 92618 invoked from network); 3 Jan 2010 14:08:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Jan 2010 14:08:48 -0000 Received: (qmail 54784 invoked by uid 500); 3 Jan 2010 14:08:48 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 54710 invoked by uid 500); 3 Jan 2010 14:08:48 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 54700 invoked by uid 99); 3 Jan 2010 14:08:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Jan 2010 14:08:48 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bogdan.vatkov@gmail.com designates 209.85.219.225 as permitted sender) Received: from [209.85.219.225] (HELO mail-ew0-f225.google.com) (209.85.219.225) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Jan 2010 14:08:41 +0000 Received: by ewy25 with SMTP id 25so17821135ewy.5 for ; Sun, 03 Jan 2010 06:08:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=ExOQ3dLCWZymqJgo2v63/1RwnYgKjgYmuG3ibbGt39g=; b=dNCIrSHacEp4HjFu5AAwtvkkbhIoSxb0WIuu939riAnjTrWlXTOa7Hyb0k8BP9WW9r fvxPV9kc6PlDLX/wevy/v0YuJOWmM8PN7+o78SYmOLRDHLUW/7kvv1mun+RjNXd/jh61 setQc30/w4VNSJsH35QV10rnhzCTvJeoxAkhY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=WG+vJ9VASS1ubFXKbGzsOOXl2xif5Hl2vBm8HWuf/bXC/pUzFJ+jeoIft9JqNh6HoV YsrlatntM+nCO834SUBtROUhHlm7dGkZLPZ46VNI1jgh7q6uA8nwlNbBlCeT2ze5MFMM BPtJpWqlEjmpnGxhnAAMepuNkNGc1JfQzP1bg= MIME-Version: 1.0 Received: by 10.213.100.13 with SMTP id w13mr8103420ebn.15.1262527700075; Sun, 03 Jan 2010 06:08:20 -0800 (PST) In-Reply-To: <3E300DFB-FED0-4752-98A8-33B0EC3D21B2@apache.org> References: <87c998321001021751g275d5c0axa366e84116535849@mail.gmail.com> <0280C025-D0F1-4AC1-9330-3A38C7573796@apache.org> <3E300DFB-FED0-4752-98A8-33B0EC3D21B2@apache.org> Date: Sun, 3 Jan 2010 16:08:20 +0200 Message-ID: Subject: Re: Stopwords not working as expected From: Bogdan Vatkov To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00504502cd6515069c047c4323ae --00504502cd6515069c047c4323ae Content-Type: text/plain; charset=ISO-8859-1 Yesterday I had issues with mapping cluster results to dictionary entries - it happened that I was using different dictionary - therefore the result clusters shown really strange results. But once I fixed all the commands, input/output files, etc. I got very good result from clusterization POV (I mean clusters are quite correct having in mind the input documents) but unfortunately the clusters contained mostly words which I would like to stop - and which words I placed in the stopwords.txt in Solr (re-indexed, restarted Solr, etc.). Where do you suggest I debug the vector creation? Seems Solr respects the stopwords but not the vector creation (then clustering). On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll wrote: > > On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote: > > > I have stopwords.txt file with 1200+ words, i did not understand this > with > > the stemming - you mean my stopwords are somehow ignored due to some > > stemming or ? > > No, stopword removal happens before stemming so it is possible that a word > that was not stopped was then stemmed to a stopword. > > I thought you said yesterday you got it straightened out. > > > > > On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll > wrote: > > > >> Are you sure you have stopwords and it is not the result of stemming > some > >> other word? > >> > >> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote: > >> > >>> my Solr config is like the default one: > >>> > >>> >>> stored="true"/> > >>> > >>> >> positionIncrementGap="100"> > >>> > >>> > >>> >>> ignoreCase="true" > >>> words="stopwords.txt" > >>> enablePositionIncrements="true" > >>> /> > >>> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>> > >>> >> language="English" > >>> protected="protwords.txt"/> > >>> > >>> > >>> > >>> >>> ignoreCase="true" expand="true"/> > >>> >>> ignoreCase="true" > >>> words="stopwords.txt" > >>> enablePositionIncrements="true" > >>> /> > >>> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >>> > >>> >> language="English" > >>> protected="protwords.txt"/> > >>> > >>> > >> > >> > > > > > > -- > > Best regards, > > Bogdan > > -- Best regards, Bogdan --00504502cd6515069c047c4323ae--