lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <whosc...@lbl.gov>
Subject Re: [Performance] Streaming main memory indexing of single strings
Date Wed, 27 Apr 2005 01:47:58 GMT
I've uploaded slightly improved versions of the fast MemoryIndex  
contribution to http://issues.apache.org/bugzilla/show_bug.cgi?id=34585  
along with another contrib - PatternAnalyzer.
  	
For a quick overview without downloading code, there's javadoc for it  
all at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

I'm happy to maintain these classes externally as part of the Nux  
project. But from the preliminary discussion on the list some time ago  
I gathered there'd be some wider interest, hence I prepared the  
contribs for the community. What would be the next steps for taking  
this further, if any?

Thanks,
Wolfgang.

/**
  * Efficient Lucene analyzer/tokenizer that preferably operates on a  
String
rather than a
  * {@link java.io.Reader}, that can flexibly separate on a regular  
expression
{@link Pattern}
  * (with behaviour idential to {@link String#split(String)}),
  * and that combines the functionality of
  * {@link org.apache.lucene.analysis.LetterTokenizer},
  * {@link org.apache.lucene.analysis.LowerCaseTokenizer},
  * {@link org.apache.lucene.analysis.WhitespaceTokenizer},
  * {@link org.apache.lucene.analysis.StopFilter} into a single efficient
  * multi-purpose class.
  * <p>
  * If you are unsure how exactly a regular expression should look like,
consider
  * prototyping by simply trying various expressions on some test texts  
via
  * {@link String#split(String)}. Once you are satisfied, give that  
regex to
  * PatternAnalyzer. Also see <a target="_blank"
  * href="http://java.sun.com/docs/books/tutorial/extra/regex/">Java  
Regular
Expression Tutorial</a>.
  * <p>
  * This class can be considerably faster than the "normal" Lucene  
tokenizers.
  * It can also serve as a building block in a compound Lucene
  * {@link org.apache.lucene.analysis.TokenFilter} chain. For example as  
in this

  * stemming example:
  * <pre>
  * PatternAnalyzer pat = ...
  * TokenStream tokenStream = new SnowballFilter(
  *     pat.tokenStream("content", "James is running round in the  
woods"),
  *     "English"));
  * </pre>



On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:

> I've now got the contrib code cleaned up, tested and documented into a  
> decent state, ready for your review and comments.
> Consider this a formal contrib (Apache license is attached).
>
> The relevant files are attached to the following bug ID:
>
> 	http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
>
> For a quick overview without downloading code, there's some javadoc at  
> http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
> summary.html
>
> There are several small open issues listed in the javadoc and also  
> inside the code. Thoughts? Comments?
>
> I've also got small performance patches for various parts of Lucene  
> core (not submitted yet). Taken together they lead to substantially  
> improved performance for MemoryIndex, and most likely also for Lucene  
> in general. Some of them are more involved than others. I'm now  
> figuring out how much performance each of these contributes and how to  
> propose potential integration - stay tuned for some follow-ups to  
> this.
>
> The code as submitted would certainly benefit a lot from said patches,  
> but they are not required for correct operation. It should work out of  
> the box (currently only on 1.4.3 or lower). Try running
>
> 	cd lucene-cvs
> 	java org.apache.lucene.index.memory.MemoryIndexTest
>
> with or without custom arguments to see it in action.
>
> Before turning to a performance patch discussion I'd a this point  
> rather be most interested in folks giving it a spin, comments on the  
> API, or any other issues.
>
> Cheers,
> Wolfgang.
>
> On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
>
>> On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
>>
>>>
>>> On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
>>>> By the way, by now I have a version against 1.4.3 that is 10-100  
>>>> times faster (i.e. 30000 - 200000 index+query steps/sec) than the  
>>>> simplistic RAMDirectory approach, depending on the nature of the  
>>>> input data and query. From some preliminary testing it returns  
>>>> exactly what RAMDirectory returns.
>>>
>>> Awesome.  Using the basic StringIndexReader I sent?
>>
>> Yep, it's loosely based on the empty skeleton you sent.
>>
>>>
>>> I've been fiddling with it a bit more to get other query types.   
>>> I'll add it to the contrib area when its a bit more robust.
>>
>> Perhaps we could merge up once I'm ready and put that into the  
>> contrib area? My version now supports tokenization with any analyzer  
>> and it supports any arbitrary Lucene query. I might make the API for  
>> adding terms a little more general, perhaps allowing arbitrary  
>> Document objects if that's what other folks really need...
>>
>>>
>>>> As an aside, is there any work going on to potentially support  
>>>> prefix (and infix) wild card queries ala "*fish"?
>>>
>>> WildcardQuery supports wildcard characters anywhere in the string.   
>>> QueryParser itself restricts expressions that have leading wildcards  
>>> from being accepted.
>>
>> Any particular reason for this restriction? Is this simply a current  
>> parser limitation or something inherent?
>>
>>> QueryParser supports wildcard characters in the middle of strings no  
>>> problem though.  Are you seeing otherwise?
>>
>> I ment an infix query such as "*fish*"
>>
>> Wolfgang.
>>
>>
>> ---------------------------------------------------------------------- 
>> -
>> Wolfgang Hoschek                  |   email: whoschek@lbl.gov
>> Distributed Systems Department    |   phone: (415)-533-7610
>> Berkeley Laboratory               |   http://dsd.lbl.gov/~hoschek/
>> ---------------------------------------------------------------------- 
>> -
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message