lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean O'Connor <>
Subject Re: SpanQuery parser? Update (ugly hack inside...)
Date Fri, 04 Nov 2005 23:32:09 GMT
I'm posting this primarily hoping to give back a tiny bit to a very 
helpful community. More likely however, someone else will open my eyes 
to an easier approach than what I outline below...

I've come up with a very ugly conversion approach from regular Query 
objects into SpanQuery objects. I then use the converted SpanQuery to 
get span positions (currently both token #, and start/end position). In 
effect, I have highlighting for simple queries with a very inefficient 
approach (yea for me!).

The goal(s) I am trying to accomplish is rather specific I think, so I 
imagine the use of my hacking is rather limited (i.e. just to me).

At the moment my code:

    * parses the search text (i.e. user entered query)
    * rewrites the resulting query to expand wildcards and such against
    * calls a recursive conversion function with very basic conversion
          o TermQuery -> SpanTerm
          o PhraseQuery -> SpanNear
          o others in progress as time permits

Currently, I only process simple query strings like:
"blue green yellow" => SpanOrQuery
"luce* acti*" => SpanOrQuery with wild cards expanded
    e.g.: lucene lucent action acting ... all or'ed together in a 
braindead fashion
"luce* acti* \"book rocks\"" => SpanOrQuery combining SpanTerms and 
SpanNear (no slop)
    er, hopefully you get the picture, I'm not up to showing a vector of 
this one... :-)

I would be happy to discuss my approach if there is anyone interested. I 
assume I am pretty much alone in finding this ineffecient approach 
useful. For me, it is the functionality that overrides perfomance 
issues. I have something which can take user search strings and do hit 
highlighting for the exact hit found. This is really only useful for 
"termA near 'some phrase'" at the moment, but might become more advanced 
in the next 2-3 months.


Paul Elschot wrote:

>On Thursday 20 October 2005 00:40, Sean O'Connor wrote:
>>    I have user entered search commands which I want to convert to 
>>SpanQueries. I have seen in the book "Lucene in Action" that no parser 
>>existed at time of publication, but there was someone working on a 
>>SpanQuery parser. Can anyone point me to that code, or provide any 
>>    I want to use SpanQueries for their detail on the number of hits 
>>from a query, and more importantly, the location (position start and 
>>end) of each hit. My application requires me to do precise hit 
>>highlighting.  I also need to perform calculations on the number of hits 
>>per document, as well as per query (sum of document hits).
>You may want to use the getSpans() method of SpanQuery and operate
>on the result directly.
>>    It is fairly critical I highlight the hits, and only the hits. From 
>>what I've read SpanQueries (with dumpSpans) is a better approach than 
>>using 'regular' queries. I _think_ regular queries currently use a 
>>highlighter which shows all terms highlighted. This can give more 
>>highlighting than actual hits (i.e false positives).
>>    So, that being said, should I stick with SpanQueries? Is there any 
>>current work on a parser to convert a string, or regular (Token, 
>>Boolean, Phrase, Prefix,...) query into a SpanQuery?
>>    I have written some very duct tape-ish code which will convert basic 
>>booleanOR and prefix queries into SpanQueries. I just realized I'm in 
>>deeper water than I expected when I tried converting my first query 
>>string containing several boolean queries, AND a phrase query. So now I 
>>am looking to either help an existing effort, or just continue with my 
>>own hacking.
>Have a look at the surround query parser in the svn trunk:
>There is also some code that does highlighting based on Spans,
>but I don't know where that is. Hopefully someone else can point you at that.
>Paul Elschot
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message