lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Phrase query vs span query
Date Wed, 22 Feb 2006 00:31:14 GMT

your "Aim of the Query formation" got truncated, so it's not entirely
clear what you are looking for, but if the general idea of what you are
looking for is that you want searches for phrase like "quick brown fox" to
only match if/when the words "quick" "brown" and "fox" all appear in the
same section in the specified order, and you want documents in which the
phrases appear more then once to bescored higher then a simple PhraseQuery
with a high slop factor and "inOrder=true" should work fine ... the key
being that your slop value needs to be at least as big as the largest
section size you can have, and less then the gap you put between sections.

I have no idea if it will be faster/slower then a span query, but it's a
little simpler because you don't need to use artificial section boundry
tokens.

If you want to tweak how much the score is influenced by the proximity of
the words in the query, vs the frequency of hte phrases in the docs, see
my recent posting about the use of tf in Similarity -- which i think is
accurate since nobody replied and said i was wrong...

http://www.nabble.com/Similarity-Usage%3A-tf%28int%29-vs-tf%28float%29-p2981283.html


: Date: Tue, 21 Feb 2006 17:45:12 -0600
: From: Rajesh Munavalli <findmath@gmail.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Phrase query vs span query
:
: I am trying to adopt lucene for a special IR system. The following scenario
: is an approximation of what I am trying to do. Please bear with me if some
: things doesnt make sense. I need some suggestions on formulating queries for
: the following scenario
:
: Each document consists of a set of fields (standard in lucene). But in my
: case, the field is somewhat different as explained below.
:
: Field:
: ---------
: Each field consists of a set of conceptual sections. Each of these sections
: is separated by say N (say 1000) index positions but are in the same field.
: Sizes of sections vary and do not have any lower or upper bound on the
: number of terms they may contain
: .
: Ex: Lets say Field "contents" has
: <section1 of 100 terms><gap of 1000 term positions><section 2 of 1500
: terms><gap of 1000 term positions><gap of 1000 term positions><section
3 of
: 10 terms>
:
: NOTE: At index time, I am assuming I somehow know how to form these
: sections.
:
: Typical Query:
: ---------------------
: Consists of 15 to 30 query terms. In other words, these query terms
: represent a conceptual section.
:
: Aim of the Query formation:
: ----------------------------------------
: I want to rank the documents proportional to the number query terms
: appearing in the SAME SECTION and IN ORDER. Documents containing terms with
: the
:
: My Questions:
: ---------------------
: Considering the structure of the fields/documents and the number of query
: terms.
:
: (1) Is there an effective way of formulating a query with the existing query
: types in Lucene?
:
: (2) After considering the way different queries work and their limitations,
: I think forming phrase/span queries of groups of query terms
: might approximate the rankings I am expecting. In that case which of the
: following queries will perform better (in terms of QUERY SPEED and RANKING)
:               (a) phrase query with certain slope factor
:               (b) span query
:
: Thanks,
:
: Rajesh Munavalli
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message