lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: New Lucene QueryParser
Date Wed, 03 Jan 2007 17:31:36 GMT

>
> Looks like interesting stuff Mark, but why did you make everything so
> configurable (syntax-wise)?  IMO, there is a lot of value to
> standards, and doing things like changing the precedence of operators
> isn't necessarily a good thing :-)
>
I made it so configurable because I needed to implement a certain query 
language at work, but I think that the language is not that great. I 
don't like most of the choices in it. I needed something though, and it 
was going to require a lot of work...not only did we need arbitrary 
mixing of boolean and proximity operators, but we needed the sentence 
and paragraph proximity as well as the thesaurus expansion. We also have 
many people who ask for one offs that only apply to their setup, like 
NEAR being an operator that is really within 10. All of this was not 
something I could guarantee that I could do (I just entered the 
workforce), and I certainly didn't have time at work with everything 
else I needed to do for this project I am working on. I wasn't going to 
put so much free time into a parser that I did not like though. So I 
made it very configurable so that it could be configured into the parser 
I needed while still being the parser I wanted.

> Did you ever get a chance to look at Paul's surround language? (I've
> never had the chance to dive into it myself)
>
I have looked into Paul's parser and it is a very nice piece of work. 
Unfortunately, I needed to duplicate a very specific syntax. Also, 
Paul's parser would not give me sentence and paragraph proximity or 
<i>arbitrary</i> connecting of boolean and prox operators. That brings 
me back to why this is so configurable: a big reason is to be able to 
simulate a syntax that a customer may be familiar with and want 
retained. I think that order of operations should be standard too, but I 
see no problem with the standard at someones site being different than 
the standard I use for another site. Some people may want/need proximity 
to bind tighter than ANDNOR, while others might need/want the reverse. 
Being too configurable has it's draw backs, but I am attempting to 
create an alternative parser, not a QueryParser replacement. Choose the 
best weapon for the job ;)

>> Query-time thesaurus expansion / General token to query expansion :
>> Takes advantage of a general find/replace feature, "expand" might map to
>> "(expander | expanded)" ... or any other valid syntax.
>
> The QueryParser does this instead of TokenFilters?
> Is it based on static configuration?
>
I do not use TokenFilters as it does not fit my requirements (I think). 
Right now, a hashmap is used to map a token to replacement syntax. A 
queryparser is generated from a parserfactory. The parserfactory takes a 
configuration class. When you get a queryparser from the factory you can 
choose to inherit the config from the factory, or you can just set the 
options and configuration directly on the parser. I did this because I 
have a need for a base configuration to a common syntax that individual 
accounts than want to be able to tweak to their needs.

The queryparser is a two pass system. The first pass does not 
tokenize...it does query expansion and preps the suggested query (the 
suggested query must be suggested in the syntax the query was typed in, 
and without expansion). I had worried about speed when I made the 2 pass 
decision, but it has allowed me great flexibility, and with my testing 
so far I have had 0 speed problems.

By the way, I have recently tested the paragraph/sentence proximity 
searching (mark within 4 sentences of dog) on a 300k doc index (docs 
8-20k) and the perceived speed was as fast as a normal one or two work 
boolean search (not a very scientific test :))

A problem with the paragraph/sentence proximity search right now is that 
if there is only 1 doc in the index the proximity search will wrap. I am 
sure this can be fixed.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message