lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: wildcards and spans
Date Wed, 02 Aug 2006 16:15:18 GMT
Hey! I'm actually looking all on my own. Anyway, 2<b> gives me
TooManyClauses. It looks like what I want is to use
IndexReader.termPositions to aggregate the offsets of all the wildcard terms
on a per-document basis, then walk the lists for proximity. Something
like...

for each wildcard term wt
   for each WildCardTermEnum wet
        for each termdoc aggregate the positions by doc ID with other terms
from wt.


Now I have, for each original wildcard term a list of all doc IDs and all
positions that any term matching the wildcard occupies. For any doc that
appears in all lists, compare the positions for proximity and, if proximity
is met, add it to my filter.

And away we go. Of course, I have no idea what the speed here is, but I
guess that's what testing is for.

Am I on the right path?

Erick


On 8/2/06, Erick Erickson <erickerickson@gmail.com> wrote:
>
> I'm back, with another flavor of wildcards. What direction would you point
> a poor boy who's project lead wants wildcard queries and spans? Here's the
> problem....
>
> I cannot use any of the classes that throw a "TooManyClauses" exception (
> e.g. SpanRegexQuery or SpanNearQuery with, say WildCardQuery). The corpus
> is big enough that this is guaranteed to be thrown. So, currently I'm using
> a filter for wildcard queries, populating it via WildcardTermEnum and
> TermDocs... Works like a champ. But I don't see how to combine this with
> spans...
>
> It seems to me that spans are incompatible with filters, they're just
> different beasts. I see no way incorporate spans and filters without doing
> actual work myself. So, it seems I'm left with several alternatives.
>
> 1> figure it out when creating the filter. Conceptually, for each document
> find the offsets of the terms I want to span, and find out if the distance
> between them fits my criteria and only add the doc to the filter if the
> distance is within my parameters.
>
> 2> Look at the docs returned by the current filtered process and, for each
> doc returned,
>   a> don't add if it doesn't fit my span criteria by examining the term
> positions.
>   b> re-query with a wildcard span, restricted by doc ID. I *think* that
> by restricting the query by (lucene) doc_id I'll be able to avoid the "too
> many clauses" issue. Assuming that I remember correctly and that the
> most-restrictive clause is honored when trying this....
>
> guys, feel free to hop in here with just the names of the classes I really
> want to pay attention to <G>....
>
> I know this is scanty info, what I'm looking for is a very quick
> pointer.... What I'm especially looking for is "Just use the
> contrib/JustWhatYouWanted class" <G> although I poked around and didn't see
> anything...
>
> Thanks
> Erick
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message