Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 96139 invoked from network); 6 Nov 2008 03:14:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Nov 2008 03:14:10 -0000 Received: (qmail 98671 invoked by uid 500); 6 Nov 2008 03:14:09 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 98640 invoked by uid 500); 6 Nov 2008 03:14:09 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 98629 invoked by uid 99); 6 Nov 2008 03:14:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Nov 2008 19:14:09 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of almeidaraf@gmail.com designates 209.85.200.174 as permitted sender) Received: from [209.85.200.174] (HELO wf-out-1314.google.com) (209.85.200.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Nov 2008 03:12:52 +0000 Received: by wf-out-1314.google.com with SMTP id 28so378380wfc.20 for ; Wed, 05 Nov 2008 19:13:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:from:to:subject :message-id:in-reply-to:references:x-mailer:mime-version :content-type:content-transfer-encoding; bh=AQ8m5Hg6+Kft5FGw4OwuE6TjeMN1a85RTR0oJtAsPbs=; b=uU/gKZXaCa06Bkyooa1LQpXwN/ZGz7OitzT/1NMzGxZVNpgGn/JmcPFvOJ/eD0lzpi vaPMsiLeFRBhtFLipvH7k/5zfldj9uMzbVU40cQBZncmV7OU27rEUwJr8WpPN/LkF6ER w+/vvk6KHdS/GmIMo2CuF1NxBv/So26qX9QmE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:subject:message-id:in-reply-to:references:x-mailer :mime-version:content-type:content-transfer-encoding; b=CTS3UOTVSZSrCjn17vDdFggTx1W9sABkGDmv+dMujlmi3QcswCMGJ7+z0pXbXmRtg+ RQcP2+/FcRG7qSOh2CXp2Usdmk1ba1hnrBFGnuwjyTCySsDFiyaUV/6ZfKOMFJeyUvr6 sLsQHvraWLCZmeMQLeS63u4OH3YZEjsB8vdMo= Received: by 10.142.188.4 with SMTP id l4mr187722wff.19.1225941213975; Wed, 05 Nov 2008 19:13:33 -0800 (PST) Received: from zim ([189.107.137.13]) by mx.google.com with ESMTPS id 30sm1358945wfd.1.2008.11.05.19.13.32 (version=SSLv3 cipher=RC4-MD5); Wed, 05 Nov 2008 19:13:33 -0800 (PST) Date: Thu, 6 Nov 2008 01:13:23 -0200 From: Rafael Cunha de Almeida To: java-user@lucene.apache.org Subject: Re: Benchmarking my indexer Message-Id: <20081106011323.cef6984d.almeidaraf@gmail.com> In-Reply-To: <20081102210656.a9e33c12.almeidaraf@gmail.com> References: <20081101033904.6f2bc6df.almeidaraf@gmail.com> <42D6517E-5B16-4B1F-9EDE-D91FE6EC1F5D@apache.org> <20081102210656.a9e33c12.almeidaraf@gmail.com> X-Mailer: Sylpheed 2.5.0 (GTK+ 2.12.11; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On Sun, 2 Nov 2008 21:06:56 -0200 Rafael Cunha de Almeida wrote: > On Sun, 2 Nov 2008 07:11:20 -0500 > Grant Ingersoll wrote: > > > > > On Nov 1, 2008, at 1:39 AM, Rafael Cunha de Almeida wrote: > > > > > Hello, > > > > > > I did an indexer that parses some files and indexes them using > > > lucene. I > > > want to benchmark the whole thing, so I'd like to count the tokens > > > being indexed so I can calculate the average number of indexed tokens > > > per second. Is there a way to count the number of tokens on a > > > document? > > > > I think you would have to add a "CountingTokenFilter", that you write > > and manage as you add documents. Or, you could just take the total # > > of tokens / by the number of docs and use the average. That can be > > obtained w/o writing a new TokenFilter. > > How would I obtain the total number of tokens on an index? I couldn't > find that statistic anywhere. I looked for it on IndexWritter, > IndexReader and IndexSearcher classes. Is there maybe some tool I'd run > on a index or something like that? I'm using PerFieldAnalyzerWrapper, so I tried writing the following Analyzer to count the tokens: import org.apache.lucene.analysis.PerFieldAnalyzerWrapper; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Analyzer; import java.io.Reader; public class CounterAnalyzer extends PerFieldAnalyzerWrapper { public CounterAnalyzer(Analyzer a) { super(a); } public TokenStream tokenStream(String field, Reader reader) { return new CounterFilter(super.tokenStream(field, reader)); } } the CounterFilter is implemented as: import org.apache.lucene.document.NumberTools; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import java.io.IOException; public final class CounterFilter extends TokenFilter { public CounterFilter(TokenStream in) { super(in); } public Token next(Token t) throws IOException { assert t != null; Token nt = input.next(t); if (nt == null) return null; System.out.println("1"); return nt; } } my idea was to pipe the output to an awk script that would count the number of 1s. But, to my surprise, the tokenStream method of the analyzer wasn't even called during the indexing. Could someone instruct me into how should I count the number of tokens on an index? --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org