lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Cunha de Almeida <almeida...@gmail.com>
Subject Re: Benchmarking my indexer
Date Thu, 06 Nov 2008 03:13:23 GMT
On Sun, 2 Nov 2008 21:06:56 -0200
Rafael Cunha de Almeida <almeidaraf@gmail.com> wrote:

> On Sun, 2 Nov 2008 07:11:20 -0500
> Grant Ingersoll <gsingers@apache.org> wrote:
> 
> > 
> > On Nov 1, 2008, at 1:39 AM, Rafael Cunha de Almeida wrote:
> > 
> > > Hello,
> > >
> > > I did an indexer that parses some files and indexes them using  
> > > lucene. I
> > > want to benchmark the whole thing, so I'd like to count the tokens
> > > being indexed so I can calculate the average number of indexed tokens
> > > per second. Is there a way to count the number of tokens on a  
> > > document?
> > 
> > I think you would have to add a "CountingTokenFilter", that you write  
> > and manage as you add documents.  Or, you could just take the total #  
> > of tokens / by the number of docs and use the average.  That can be  
> > obtained w/o writing a new TokenFilter.
> 
> How would I obtain the total number of tokens on an index? I couldn't
> find that statistic anywhere. I looked for it on IndexWritter,
> IndexReader and IndexSearcher classes. Is there maybe some tool I'd run
> on a index or something like that?

I'm using PerFieldAnalyzerWrapper, so I tried writing the following
Analyzer to count the tokens:

	import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
	import org.apache.lucene.analysis.TokenStream;
	import org.apache.lucene.analysis.Analyzer;

	import java.io.Reader;


	public class CounterAnalyzer extends PerFieldAnalyzerWrapper
	{
	    public CounterAnalyzer(Analyzer a)
	    {
		super(a);
	    }

	    public TokenStream tokenStream(String field, Reader reader)
	    {
		return new CounterFilter(super.tokenStream(field,
reader)); }
	}

the CounterFilter is implemented as:
	import org.apache.lucene.document.NumberTools;

	import org.apache.lucene.analysis.TokenFilter;
	import org.apache.lucene.analysis.Token;
	import org.apache.lucene.analysis.TokenStream;


	import java.io.IOException;

	public final class CounterFilter extends TokenFilter
	{
	    public CounterFilter(TokenStream in)
	    {
		super(in);
	    }

	    public Token next(Token t) throws IOException
	    {
		assert t != null;

		Token nt = input.next(t);
		if (nt == null)
		    return null;
		System.out.println("1");
		return nt;
	    }

	}

my idea was to pipe the output to an awk script that would count the
number of 1s. But, to my surprise, the tokenStream method of the
analyzer wasn't even called during the indexing. Could someone instruct
me into how should I count the number of tokens on an index?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message