lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <>
Subject Re: [Performance] Streaming main memory indexing of single strings
Date Sat, 16 Apr 2005 17:17:29 GMT
On Apr 16, 2005, at 2:58 AM, Erik Hatcher wrote:

> On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote:
>>> So, all the text analyzed is in a given field... that means that 
>>> anything in the Query not associated with that field has no bearing 
>>> on whether the text matches or not, correct?
>> Right, it has no bearing. A query wouldn't specify any fields, it 
>> just uses the implicit default field name.
> Cool.  My questions regarding how to deal with field names is 
> obviously more an implementation detail under the covers of the 
> match() method than how you want to use it.  In a general sense, 
> though, its necessary to deal with default field name, queries that 
> have non-default-field terms, and the analysis process.

Right, I'd just like to first assess rough overall efficiency before 
tying up some loose ends.

>> (: An XQuery that finds all books authored by James that have 
>> something to do with "fish", sorted by relevance :)
>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>> declare variable $query := "fish*~"; (: any arbitrary fuzzy lucene 
>> query goes here :)
> Note that "fish*~" is not a valid query expression :)

Perhaps the Lucene QueryParser should throw an exception then. 
Currently 1.4.3 accepts the expression as is without grumbling...

> (I love how XQuery uses smiley emoticons for comments)  BTW, I have a 
> strong vested interest in seeing a fast and scalable XQuery engine in 
> the open source world.  I've toyed with eXist some - it was not stable 
> or scalable enough for my needs.  Lot's of Wolfgang's in the XQuery 
> world :)

If you're looking for an XML DB for managing and querying large 
persistent data volumes, Nux/Saxon will disappoint you. If, on the 
other hand, you're looking for a very fast XQuery engine inserted into 
a processing pipeline working with many small to medium sized XML 
documents (such as messages in a scalable message queue or network 
router) then you might be pleased.

>> for $book in /books/book[author="James" and lucene:match(string(.), 
>> $query) > 0.0]
>> let $score := lucene:match(string($book), $query)
>> order by $score descending
>> return (<score>{$score}</score>, $book)
> Could you avoid calling match() twice here?

That's no problem for two reasons:
1) The XQuery optimizer rewrites the query into an optimized expression 
tree eliminating redundancies, etc. If for some reason this isn't 
feasible or legal then
2) There's a smart cache between the XQuery engine and the lucene 
invocation that returns results in O(1) for Lucene queries that have 
already been seen/processed before. It caches (queryString,result), 
plus parsed Lucene queries, plus the Lucene index data structure for 
any given string text (which currently is a simple RAMDirectory but 
could be whatever datastructure we come up with as part of the exercise 
- class StringIndex or some such). This works so well that I have to 
disable the cache to avoid getting astronomically good figures on 
artificial benchmarks.

>> some skeleton:
>> 	private static final String FIELD_NAME = "content"; // or whatever - 
>> it doesn't matter
>> 	public Query parseQuery(String expression) throws ParseException {
>> 		QueryParser parser = new QueryParser(FIELD_NAME, analyzer);
>> 		return parser.parse(expression);
>> 	}
>> 	private Document createDocument(String content) {
>> 		Document doc = new Document();
>> 		doc.add(Field.UnStored(FIELD_NAME, content));
>> 		return doc;
>> 	}
> This skeleton code doesn't really apply to the custom IndexReader 
> implementation.  There is a method to return a document from 
> IndexReader, which I did not implement yet in my sample - it'd be 
> trivial though.  I don't think you'd need to get a Lucene Document 
> object back in your use case, but for completeness I will add that to 
> my implementation.

Right, it was just to outline that the value of FIELD_NAME doesn't 
really matter.

>>> There is still some missing trickery in my StringIndexReader - it 
>>> does not currently handle phrase queries as an implementation of 
>>> termPositions() is needed.
>>> Wolfgang - will you take what I've done the extra mile and implement 
>>> what's left (frequency and term position)?  I might not revisit this 
>>> very soon.
>> I'm not sure I'll be able to pull it off, but I'll see what I can do. 
>> If someone more competent would like to help out, let me know... 
>> Thanks for all the help anyway, Erik and co, it is greatly 
>> appreciated!
> If you can build an XQuery engine, you can hack in some basic Java 
> data structures that keep track of word positions and frequency :)

There's a learning curve ahead of me, not having working before at that 
low-level with Lucene :-)
Mark Harwood sent me some good but somewhat unfinished code he wrote 
previously for similar scenarios. I'll look into merging his pieces and 
your skeleton.

By now I'm quite confident this can be done reasonably efficient. BTW, 
I have some small performance patches for FastCharStream and in various 
other places, but I'll hold off proposing those until our exercise is 
done and the real merits/drawbacks of those patches can be better 

> I'll tinker with it some more for fun in the near future, but anyone 
> else is welcome to flesh out the missing pieces.

Thanks again for the kind helping out!

Wolfgang Hoschek                  |   email:
Distributed Systems Department    |   phone: (415)-533-7610
Berkeley Laboratory               |

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message