Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 65244 invoked from network); 2 Jan 2010 16:35:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Jan 2010 16:35:24 -0000 Received: (qmail 43853 invoked by uid 500); 2 Jan 2010 16:35:23 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 43804 invoked by uid 500); 2 Jan 2010 16:35:23 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 43794 invoked by uid 99); 2 Jan 2010 16:35:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jan 2010 16:35:23 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bogdan.vatkov@gmail.com designates 74.125.78.25 as permitted sender) Received: from [74.125.78.25] (HELO ey-out-2122.google.com) (74.125.78.25) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jan 2010 16:35:14 +0000 Received: by ey-out-2122.google.com with SMTP id 9so2513349eyd.3 for ; Sat, 02 Jan 2010 08:34:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=5c48OOQFq5JCgIFRct2e6mjeu743Eum53NI4nmeZsF8=; b=dThCG5hr6Rv8D6MO5PGoGiqeZHDRXjiQwHkfmjkslSsx9l9oKVD3BylB4iwwwF/Lxm uFZq/HWydOie1tyUaPvPShFZkVJjRufetiEHd6hAsvTDIYU8sSgesu2MmYCImq9w+uoD 9fl/5TRTHmldzzG0LKtzFbkYKR9HyQ19/xTB0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=MRXkmhtwCbcryIFSZuzjruQRMN8hvIjWDclmpfFIl3rD/br0ei0ZGpptiS8dqHbXwU 1dLPAXsm4tqpGxws/twxnt8BclfRSSlwFPNfiqCtCIizG0TZGA5zqJfsPn3HOjVYR8MY oyq1uBXz5+BMKSe62PNLnYYBNeQSRKvDFdVmw= MIME-Version: 1.0 Received: by 10.213.100.13 with SMTP id w13mr6921907ebn.15.1262450093935; Sat, 02 Jan 2010 08:34:53 -0800 (PST) In-Reply-To: <32D1486C-DCB0-4593-8ECE-BE6F5CECE012@apache.org> References: <56747AB3-8E9C-4B77-A610-100CBC8F0737@apache.org> <32D1486C-DCB0-4593-8ECE-BE6F5CECE012@apache.org> Date: Sat, 2 Jan 2010 18:34:53 +0200 Message-ID: Subject: Re: Stopwords work for Solr but not for Mahout From: Bogdan Vatkov To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00504502cd65654d0f047c31114c X-Virus-Checked: Checked by ClamAV on apache.org --00504502cd65654d0f047c31114c Content-Type: text/plain; charset=ISO-8859-1 I re-indexed but I cannot find a way to use the VectorDumper w/ Dictionary, I am using mahout v 0.2 and not the very latest trunk code since the latter was not compiling and I had to use older code. On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll wrote: > I assume you re-indexed and you used the VectorDumper (along with the > dictionary) to dump out the Vectors that were converted and verified no stop > words? > > On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote: > > > this is my Solr config: > > > > > stored="true"/> > > > > and the type text is as configured by default: > > > > > positionIncrementGap="100"> > > > > > > > > > > > ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true" > > /> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > language="English" > > protected="protwords.txt"/> > > > > > > > > > ignoreCase="true" expand="true"/> > > > ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true" > > /> > > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > > > language="English" > > protected="protwords.txt"/> > > > > > > > > and I have entered quite some stopwords in the stopwords.txt file > > > > my SolrToMahout.sh file: > > > > #!/bin/bash > > set -x > > cd /store/dev/inst/mahout-0.2 > > java -classpath > > /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo > > /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ > /:/g') > > org.apache.mahout.utils.vectors.lucene.Driver --dir > > /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \ > > --output /store/dev/inst/mahout-0.2/clustering-example/solr/output > > --field msg_body --dictOut > > /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict > > > > Best regards, > > Bogdan > > > > On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll > wrote: > > > >> What do the relevant pieces of your Solr setup look like and how are you > >> invoking the Lucene driver? > >> > >> -Grant > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > -- Bogdan Vatkov email: bogdan.vatkov@gmail.com --00504502cd65654d0f047c31114c--