lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: StandardTokenizer is slowing down highlighting a lot
Date Wed, 25 Jul 2007 11:29:12 GMT
I would be very interested. I have been playing around with Antlr to see 
if it is any faster than JavaCC, but haven't seen great gains in my 
simple tests. I had not considered trying JFlex.

I am sure a faster StandardAnalyzer would be greatly appreciated. 
StandardAnalyzer appears widely used and horrendously slow. Even better 
would be a StandardAnalyzer that could have different recognizers 
enabled/disabled. For example, dropping NUM recognition if you don't 
need it in the current StandardAnalyzer gains like 25% speed.

- Mark

Stanislaw Osinski wrote:
>>
>> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
>> limited by JavaCC speed. You cannot shave much more performance out of
>> the grammar as it is already about as simple as it gets.
>
>
> JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 
> years
> ago :) switched to JFlex, which for roughly the same grammar would 
> sometimes
> be up to 10x (!) faster. You can have a look at our JFlex 
> specification at:
>
> http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup

>
>
> This one seems more complex than the StandardAnalyzer's but it's much 
> faster
> anyway.
>
> If anyone is interested, I could prepare a JFlex based Analyzer 
> equivalent
> (to the extent possible) to current StandardAnalyzer, which might 
> offer nice
> indexing and highlighting speed-ups.
>
> Best,
>
> Staszek
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message