lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: [Performance] Streaming main memory indexing of single strings
Date Wed, 27 Apr 2005 02:08:33 GMT
Wolfgang,

You have provided a superb set of patches!  I'm in awe of the extensive  
documentation you've done.

There is nothing further you need to do, but be patient while we  
incorporate it into the contrib area somewhere.  Your PatternAnalyzer  
could fit into the contrib/analyzers area nicely.  I'm not quite sure  
where to put MemoryIndex - maybe it deserves to stand on its own in a  
new contrib area?  Or does it make sense to put this into misc (still  
in sandbox/misc)?  Or where?

	Erik

On Apr 26, 2005, at 9:47 PM, Wolfgang Hoschek wrote:

> I've uploaded slightly improved versions of the fast MemoryIndex  
> contribution to  
> http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 along with  
> another contrib - PatternAnalyzer.
>  	
> For a quick overview without downloading code, there's javadoc for it  
> all at  
> http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
> summary.html
>
> I'm happy to maintain these classes externally as part of the Nux  
> project. But from the preliminary discussion on the list some time ago  
> I gathered there'd be some wider interest, hence I prepared the  
> contribs for the community. What would be the next steps for taking  
> this further, if any?
>
> Thanks,
> Wolfgang.
>
> /**
>  * Efficient Lucene analyzer/tokenizer that preferably operates on a  
> String
> rather than a
>  * {@link java.io.Reader}, that can flexibly separate on a regular  
> expression
> {@link Pattern}
>  * (with behaviour idential to {@link String#split(String)}),
>  * and that combines the functionality of
>  * {@link org.apache.lucene.analysis.LetterTokenizer},
>  * {@link org.apache.lucene.analysis.LowerCaseTokenizer},
>  * {@link org.apache.lucene.analysis.WhitespaceTokenizer},
>  * {@link org.apache.lucene.analysis.StopFilter} into a single  
> efficient
>  * multi-purpose class.
>  * <p>
>  * If you are unsure how exactly a regular expression should look like,
> consider
>  * prototyping by simply trying various expressions on some test texts  
> via
>  * {@link String#split(String)}. Once you are satisfied, give that  
> regex to
>  * PatternAnalyzer. Also see <a target="_blank"
>  * href="http://java.sun.com/docs/books/tutorial/extra/regex/">Java  
> Regular
> Expression Tutorial</a>.
>  * <p>
>  * This class can be considerably faster than the "normal" Lucene  
> tokenizers.
>  * It can also serve as a building block in a compound Lucene
>  * {@link org.apache.lucene.analysis.TokenFilter} chain. For example  
> as in this
>
>  * stemming example:
>  * <pre>
>  * PatternAnalyzer pat = ...
>  * TokenStream tokenStream = new SnowballFilter(
>  *     pat.tokenStream("content", "James is running round in the  
> woods"),
>  *     "English"));
>  * </pre>
>
>
>
> On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:
>
>> I've now got the contrib code cleaned up, tested and documented into  
>> a decent state, ready for your review and comments.
>> Consider this a formal contrib (Apache license is attached).
>>
>> The relevant files are attached to the following bug ID:
>>
>> 	http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
>>
>> For a quick overview without downloading code, there's some javadoc  
>> at  
>> http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
>> summary.html
>>
>> There are several small open issues listed in the javadoc and also  
>> inside the code. Thoughts? Comments?
>>
>> I've also got small performance patches for various parts of Lucene  
>> core (not submitted yet). Taken together they lead to substantially  
>> improved performance for MemoryIndex, and most likely also for Lucene  
>> in general. Some of them are more involved than others. I'm now  
>> figuring out how much performance each of these contributes and how  
>> to propose potential integration - stay tuned for some follow-ups to  
>> this.
>>
>> The code as submitted would certainly benefit a lot from said  
>> patches, but they are not required for correct operation. It should  
>> work out of the box (currently only on 1.4.3 or lower). Try running
>>
>> 	cd lucene-cvs
>> 	java org.apache.lucene.index.memory.MemoryIndexTest
>>
>> with or without custom arguments to see it in action.
>>
>> Before turning to a performance patch discussion I'd a this point  
>> rather be most interested in folks giving it a spin, comments on the  
>> API, or any other issues.
>>
>> Cheers,
>> Wolfgang.
>>
>> On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
>>
>>> On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
>>>
>>>>
>>>> On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
>>>>> By the way, by now I have a version against 1.4.3 that is 10-100  
>>>>> times faster (i.e. 30000 - 200000 index+query steps/sec) than the  
>>>>> simplistic RAMDirectory approach, depending on the nature of the  
>>>>> input data and query. From some preliminary testing it returns  
>>>>> exactly what RAMDirectory returns.
>>>>
>>>> Awesome.  Using the basic StringIndexReader I sent?
>>>
>>> Yep, it's loosely based on the empty skeleton you sent.
>>>
>>>>
>>>> I've been fiddling with it a bit more to get other query types.   
>>>> I'll add it to the contrib area when its a bit more robust.
>>>
>>> Perhaps we could merge up once I'm ready and put that into the  
>>> contrib area? My version now supports tokenization with any analyzer  
>>> and it supports any arbitrary Lucene query. I might make the API for  
>>> adding terms a little more general, perhaps allowing arbitrary  
>>> Document objects if that's what other folks really need...
>>>
>>>>
>>>>> As an aside, is there any work going on to potentially support  
>>>>> prefix (and infix) wild card queries ala "*fish"?
>>>>
>>>> WildcardQuery supports wildcard characters anywhere in the string.   
>>>> QueryParser itself restricts expressions that have leading  
>>>> wildcards from being accepted.
>>>
>>> Any particular reason for this restriction? Is this simply a current  
>>> parser limitation or something inherent?
>>>
>>>> QueryParser supports wildcard characters in the middle of strings  
>>>> no problem though.  Are you seeing otherwise?
>>>
>>> I ment an infix query such as "*fish*"
>>>
>>> Wolfgang.
>>>
>>>
>>> --------------------------------------------------------------------- 
>>> --
>>> Wolfgang Hoschek                  |   email: whoschek@lbl.gov
>>> Distributed Systems Department    |   phone: (415)-533-7610
>>> Berkeley Laboratory               |   http://dsd.lbl.gov/~hoschek/
>>> --------------------------------------------------------------------- 
>>> --
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message