lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: wildcard and span queries
Date Wed, 11 Oct 2006 18:30:58 GMT
Erick - what about using getSpans() from the SpanQuery that is  
generated?   That should give you what you're after I think.

	Erik


On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote:

> Problem 3482:
>
> I'm probably close to being able to start work. Except...
>
> How to count hits with SrndQuery? Or, more generally, with arbitrary
> wildcards and boolean operators?
>
> So, say I've indexed a book by page. That is, each page is a  
> document. I
> know a particular page matches my query because the SrndQuery found  
> it. Now,
> I want to answer the question "How many times did the query match  
> on this
> page"?
>
> For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k? 
> i?f,
> h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a  
> not or
> three, a nested pair of OR clauses and..... No, DON'T tell me that  
> that
> probably wouldn't match any pages anyway <G>....
>
> Anyway, I want to answer "how many times does this occur on the page".
> Another way of asking this, I suppose, is "how many terms would be
> highlighted on that page", but I don't think highlighting helps.  
> And I'm
> aware that the question "how many times does 'this' occur" is  
> ambiguous,
> especially when we add the not case in......
>
> I can think of a couple of approaches:
> 1> get down and dirty with the terms. That is, examine the term  
> position
> vectors and compare all the nasty details of where they occur,  
> combined
> with, say, regextermenum and go at it. This is fairly ugly,  
> especially with
> nested queries. But I can do it especially if we limit the  
> complexity of the
> query or define the hitcount more simply.
> 2> get clever with a regex, fetch the text of the page and see how  
> many
> times the regex matches. I'd imagine that the regex will
> be...er...unpleasant.
> 2a> Use simpler regex expressions for each term, assemble the list  
> of match
> positions, and count.
> 2b> Isn't this really just using TermDocs as it was meant to be used?
> combined with regextermenum?
> 2c> Since the number of regex matches on a particular page is much  
> smaller
> than the number of regex matches over the entire index, does anyone  
> have a
> feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm  
> analyzing a
> page with a regex. For <2b>, Lucene has already done the pattern  
> matching,
> but I'm reading a bunch of different termdocs......
>
> Fortunately, for this application, I only care about the hits per  
> page for a
> single book at a time. I do NOT have to create a list of all hits  
> on all
> pages for all books that have any match.
>
> Thanks
> Erick
>
> On 10/9/06, Erick Erickson <erickerickson@gmail.com> wrote:
>>
>> Doron:
>>
>> Thanks for the suggestion, I'll certainly put it on my list,  
>> depending
>> upon what the PM decides. This app is geneaology reasearch, and users
>> *can* put in their own wildcards...
>>
>> This is why I love this list... lots of smart people giving me  
>> suggestions
>> I never would have thought of <G>...
>>
>> Thanks
>> Erick
>>
>> On 10/9/06, Doron Cohen < DORONC@il.ibm.com> wrote:
>> >
>> > "Erick Erickson" <erickerickson@gmail.com> wrote on 09/10/2006  
>> 13:09:21:
>> > > ... The kicker is that what we are indexing is
>> > > OCR data, some of which is pretty trashy. So you wind up with
>> > "interesting"
>> > > words in your index, things like rtyHrS. So the whole question of
>> > allowing
>> > > very specific queries on detailed wildcards (combined with  
>> spans) is
>> > under
>> > > discussion. It's not at all clear to me that there's any value  
>> to the
>> > end
>> > > users in the capability of, say, two character prefixes. And,  
>> it's an
>> > easy
>> > > rule that "prefix queries must specify at least 3 non-wildcard
>> > > characters"....
>> >
>> > Erick, I may be out of course here, but, fwiw, have you considered
>> > n-gram
>> > indexing/search for a degree of fuzziness to compensate for OCR
>> > errors..?
>> >
>> > For a four words query you would probably get ~20 tokens  
>> (bigrams?) - no
>> > matter what the index size is. You would then probably want to  
>> score
>> > higher
>> > by LA (lexical affinity - query terms appear close to each other  
>> in the
>> > document) - and I am not sure to what degree a span query (made of
>> > n-gram
>> > terms) would serve that, because (1) all terms in the span need  
>> to be
>> > there
>> > (well, I think:-); and, (2) you would like to increase doc score  
>> for
>> > close-by terms only for close-by query n-grams.
>> >
>> > So there might not be a ready to use solution in Lucene for  
>> this, but
>> > perhaps this is a more robust direction to try than the wild card
>> > approach
>> > - I mean, if users want to type a wild card query, it is their  
>> right to
>> > do
>> > so, but for an application logic this does not seem the best  
>> choice.
>> >
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message