lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Reuschling <christian.reuschl...@gmail.com>
Subject Re: 1:n queries again
Date Wed, 12 Nov 2008 15:17:35 GMT
Hello Erick,

thank you very much for this interesting idea - but I'm not sure that the
SpanQuery will make every aspect I search for.

I think the lack is that in the case of a PhraseQuery (and I think also in
the case of the SpanQuery, but I'm not sure about yet), every term must appear
inside the phrase, it is some kind of 'must' for every term.

I search for a 'should' - so the behaviour should be exactly the same as
BooleanQuery does, but only in one dataset (maybe represented as extra field
entry with an incremented PositionIncrementGap)

In this context, it also was no typo with term2 in front of term1

At the end, I want to know a score for the overlapping between two term lists,
so in the case the index entry is

> doc = new Document
> doc.add("field1", "word1 word2 word3")
> doc.add("field1", "word4 word5")
> IndexWriter.addDocument(doc)

also a query like field1:"word3 NewNotExistingWord word1"~5
should match.

So the semantic of this (hypothetic) query
"starDelimiter (word1 notExistingWord word3) endDelimiter"

would make it...but it is a good hint with the PositionIncrementGap. Maybe
there is a possibility to combine this with BooleanQuery?




Erick Erickson schrieb:
> It's entirely unclear to me whether facets could help, since I haven't used
> them, I've
> seen these mentioned on the SOLR user list, it may bear investigating.
> 
> To expand on Stefan's point. I think his solution will work for you quite
> well, but
> there are a couple of tricks....
> 
> The first thing to understand is that (This won't compile, but you get the
> idea)
> 
> doc = new Document
> doc.add("field1", "word1 word2 word3")
> doc.add("field1", "word4 word5")
> IndexWriter.addDocument(doc)
> 
> is perfectly legal. The single document added will have all 5 words in
> "field1". But
> here's the trick. If you provide your own analyzer (a trivial analyzer built
> from one
> of the standard ones?) that returns a number other than 1 (say 10) from
> getPositionIncrementGap the "distance" between word3 and word4 will be
> 10 rather than 1. But the distance between word1 and word2 will be 1 as
> will the distance between word2 and word3, as will the distance between
> word4 and word5
> 
> How does this help, you ask? Well, SpanQuery is your friend (PhraseQuery
> might work just as well in this case). Because you can now ask that all your
> terms have < 10 "holes". For instance, if you made a phrase like
> "word1 word2"~5 it would match, as would "word1"~5 or just word1
> 
> "word1 word3"~5 would NOT match since there  other tokens between
> 
> "word3 word4"~5 would NOT match since the distance between them is
> greater than 5
> 
> Note that using 10 is arbitrary, you probably really want to use something
> much
> larger, say 100 larger than the maximum number of terms you expect. The only
> thing you need to watch at all is that the total length of all the terms and
> all
> the gaps doesn't exceed MAX_INT (MAX_INT / 2? I don't know whether the
> integers are signed).....
> 
> What's really happening here is that the "gap" is taking the place of your
> delimiters and you're making use of Phrase/SpanQuery characteristics
> to return what you want.
> 
> Of course I may have completely mis-read your problem, but I'm sure you'll
> let us know if that's the case <G>.
> 
> 
> BTW, if this isn't a typo, you probably need SpanQuery since you can
> specify order not being important:
> attName:"startDelimiter myterm2 myterm1 endDelimiter"...should also match
> 
> Did you really mean to have myterm2 in front of myterm1?
> 
> Best
> Erick
> 
> On Wed, Nov 12, 2008 at 8:58 AM, Christian Reuschling <
> christian.reuschling@gmail.com> wrote:
> 
>> Hello Friends,
>>
>> In order to offer some simple 1:n matching, currently we create several,
>> counted
>> attributes and expand our queries that we search inside each attribute,
>> e.g.:
>>
>> Query 'attName:myTerm'  => Query 'attName1:myTerm attName2:myTerm'
>>
>> This is not the fastest way, and sometimes not easy to handle - also we
>> have to
>> consider the 1:n attributes during indexing, and must remember the highest
>> 'n'
>> for query expansion. We get very big queries.
>>
>>
>> Currently I have some other scenario in mind, but I'm not sure how I can
>> achieve
>> this. The idea is to write all n datasets into one attribute, with a
>> specialized
>> start and end delimiter term, e.g.:
>>
>> document entry for attName:
>> "startDelimiter myterm1 myterm2 endDelimiter startDelimiter myterm3 myterm4
>> endDelimiter"
>>
>> When I look to this, it would go somehow into the direction of a
>> PhraseQuery,
>> where I can search e.g. for
>>
>> attName:"startDelimiter myterm1 myterm2 endDelimiter"
>> but the query
>> attName:"startDelimiter myterm1 myterm4 endDelimiter"
>>
>> would not match.
>>
>> The only thing that lacks now is that the queries
>> attName:"startDelimiter myterm1 endDelimiter"
>> attName:"startDelimiter myterm2 myterm1 endDelimiter"
>>
>> also should match - which of course isn't possible with the current
>> PhraseQuery
>> implementation.
>>
>> Best would be some construct like attName:"startDelimiter (myterm1 myterm2)
>> endDelimiter"
>>
>> Whereby the stuff inside the brackets would be a standard BooleanQuery, but
>> only
>> applied inside the range of the delimiters. Is this somehow possible, or do
>> I
>> have to write my own Query implementation - and what would be the best way
>> in this case.
>>
>>
>> Thanks in advance
>>
>> Christian Reuschling
>>
>>
> 


Mime
View raw message