lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: 1:n queries again
Date Wed, 12 Nov 2008 14:46:30 GMT
It's entirely unclear to me whether facets could help, since I haven't used
them, I've
seen these mentioned on the SOLR user list, it may bear investigating.

To expand on Stefan's point. I think his solution will work for you quite
well, but
there are a couple of tricks....

The first thing to understand is that (This won't compile, but you get the
idea)

doc = new Document
doc.add("field1", "word1 word2 word3")
doc.add("field1", "word4 word5")
IndexWriter.addDocument(doc)

is perfectly legal. The single document added will have all 5 words in
"field1". But
here's the trick. If you provide your own analyzer (a trivial analyzer built
from one
of the standard ones?) that returns a number other than 1 (say 10) from
getPositionIncrementGap the "distance" between word3 and word4 will be
10 rather than 1. But the distance between word1 and word2 will be 1 as
will the distance between word2 and word3, as will the distance between
word4 and word5

How does this help, you ask? Well, SpanQuery is your friend (PhraseQuery
might work just as well in this case). Because you can now ask that all your
terms have < 10 "holes". For instance, if you made a phrase like
"word1 word2"~5 it would match, as would "word1"~5 or just word1

"word1 word3"~5 would NOT match since there  other tokens between

"word3 word4"~5 would NOT match since the distance between them is
greater than 5

Note that using 10 is arbitrary, you probably really want to use something
much
larger, say 100 larger than the maximum number of terms you expect. The only
thing you need to watch at all is that the total length of all the terms and
all
the gaps doesn't exceed MAX_INT (MAX_INT / 2? I don't know whether the
integers are signed).....

What's really happening here is that the "gap" is taking the place of your
delimiters and you're making use of Phrase/SpanQuery characteristics
to return what you want.

Of course I may have completely mis-read your problem, but I'm sure you'll
let us know if that's the case <G>.


BTW, if this isn't a typo, you probably need SpanQuery since you can
specify order not being important:
attName:"startDelimiter myterm2 myterm1 endDelimiter"...should also match

Did you really mean to have myterm2 in front of myterm1?

Best
Erick

On Wed, Nov 12, 2008 at 8:58 AM, Christian Reuschling <
christian.reuschling@gmail.com> wrote:

> Hello Friends,
>
> In order to offer some simple 1:n matching, currently we create several,
> counted
> attributes and expand our queries that we search inside each attribute,
> e.g.:
>
> Query 'attName:myTerm'  => Query 'attName1:myTerm attName2:myTerm'
>
> This is not the fastest way, and sometimes not easy to handle - also we
> have to
> consider the 1:n attributes during indexing, and must remember the highest
> 'n'
> for query expansion. We get very big queries.
>
>
> Currently I have some other scenario in mind, but I'm not sure how I can
> achieve
> this. The idea is to write all n datasets into one attribute, with a
> specialized
> start and end delimiter term, e.g.:
>
> document entry for attName:
> "startDelimiter myterm1 myterm2 endDelimiter startDelimiter myterm3 myterm4
> endDelimiter"
>
> When I look to this, it would go somehow into the direction of a
> PhraseQuery,
> where I can search e.g. for
>
> attName:"startDelimiter myterm1 myterm2 endDelimiter"
> but the query
> attName:"startDelimiter myterm1 myterm4 endDelimiter"
>
> would not match.
>
> The only thing that lacks now is that the queries
> attName:"startDelimiter myterm1 endDelimiter"
> attName:"startDelimiter myterm2 myterm1 endDelimiter"
>
> also should match - which of course isn't possible with the current
> PhraseQuery
> implementation.
>
> Best would be some construct like attName:"startDelimiter (myterm1 myterm2)
> endDelimiter"
>
> Whereby the stuff inside the brackets would be a standard BooleanQuery, but
> only
> applied inside the range of the delimiters. Is this somehow possible, or do
> I
> have to write my own Query implementation - and what would be the best way
> in this case.
>
>
> Thanks in advance
>
> Christian Reuschling
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message