Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <3FD0BFB2.5070401@lucene.com>
Date: Fri, 05 Dec 2003 09:26:10 -0800
From: Doug Cutting <cutting@lucene.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031114
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Re: Position increment (tokens, DocumentWriter), max field length
References: <1F93B54C-26D5-11D8-8DAC-000393A564E6@ehatchersolutions.com>
 <200312042226.35948.tatu@hypermall.net>
In-Reply-To: <200312042226.35948.tatu@hypermall.net>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Tatu Saloranta wrote:
> I have a question related to the way position increment is handled in 
> DocumentWriter's invertDocument (main tokenization/indexing method). It does 
> following:
> 
>             for (Token t = stream.next(); t != null; t = stream.next()) {
>               position += (t.getPositionIncrement() - 1);
>               addPosition(fieldName, t.termText(), position++);
>               if (position > maxFieldLength) break;
>             }
> 
> If I'm not mistaken, this means that maxFieldLength comparison counts in 
> "holes" in token sequence. And such behaviour might be problematic,
> especially if such holes are used to mark sentence/paragraph boundaries (to 
> reduce score or avoid hit for phrase queries), which was discussed recently.
> Also, since that count is saved in index, such holes "bloat" perceived 
> document size, and thus reduce document's relative weight.
> 
> It'd be easy to fix this to only count tokens (I can provide patch if so), but 
> I wanted to make sure I don't misunderstand something fundamental here?

I think that sounds like a reasonable fix.  The field length is also 
used for length normalization, and ignoring holes seems like the right 
thing there too.  I don't think field length is used anywhere else, so 
such a change shouldn't break anything.

Speaking of sentence/paragraph boundaries: Following is a proposal I 
just wrote.  It also includes the PhraseQuery change I alluded to in my 
message yesterday to Erik.  If this proposal is accepted, then I'll have 
funding to develop it ASAP.

Does this sound like a reasonable approach to the problem?

It adds no costs for folks who don't use the new features.  When folks 
do use it, their index gets larger, but sentence- and paragraph- and 
section-type searches are exact and as efficient as ordinary phrase 
searches.

Using position-increment is still possible: one can, e.g., increment by 
100 for sentence, 1000 for paragraph and 10,000 for section.  This does 
not increase index size nearly as much, but searches are not exact, and 
are slightly slower, since they must specify non-zero slop (slop=0 
queries use a faster algorithm).

Which approach is preferred?  Would it be confusing to support both?

Note that the relative term position in PhraseQuery feature is not 
dependent on the levels feature, and could be easily added right now.

I hope this is not too confusing...

Doug

------

MATCH WITHIN SENTENCE, PARAGRAPH, SECTION, ETC.

The token class has a field called "type", which is used by analyzers
to distinguish lexical types, but is currently ignored when indexing.
We change this so that token types can name "levels".  Each token type
can be declared to define a unique level.  There is also a default
level for all undeclared token types.  When indexing a token stream, a
position counter is maintained for each level.  Tokens at non-default
levels are not indexed, and simply increment their counter.  Tokens at
the default level are both indexed and increment the default counter.

Consider the following example.

Types "sent" and "para" are declared as levels.  Types "word" and
"abbr" are undeclared, and hence default.

The first two lines below are a sample sequence of token texts and
types.  The next two lines are the values of the counters when
indexing this sequence of tokens.

Tokens:

   text:    The start    .  The  USA         .  The  end    .
   type:   word  word sent word abbr para sent word word sent

Counters:

   default:   0      1   1    2   3     3    3    4    5    5
   sent:      0      0   0    1   1     1    1    2    2    2
   para:      0      0   0    0   0     0    1    1    1    1

Thus the word "the" appears at default level positions 0, 2, and 4,
and at "sent" level positions 0, 1, and 2, and at para level positions
0, 0, and 1.

The following method is used to declare a level:

   IndexWriter.addLevel(String field, String type);

Note that this must be performed each time an IndexWriter is
constructed.  If it is performed when some documents are added but not
when others are, or if indexes are merged with different positioned
tokens, then all positions will be zero for levels undeclared when
indexing.

An index maintains, for each field, the known levels, accessed through:

   public String[] IndexReader.positionedLevels(String field);

The stream of positions at each level is accessed with:

   public TermPositions IndexReader.getPositions(Term term, String level);

Note that the index size increases substantially for each level added.
Each additional level adds around 50% to the default-only index size.
Thus an index with two additional levels would require around 200% the
size of an index with no additional levels.

PhraseQuery's constructor takes a level for the match:

   public PhraseQuery(String level);

Phrase queries are also permitted to explicitly specify the position
of each term in the phrase.  For example, if two phrase query terms
are both at position=0, with phrase slop=0, then they must occur at
exactly the same positions.  And if they're at position=0 and
position=1 respectively, with phrase slop=0, then they must occur
adjacently, as an exact phrase.

   public PhraseQuery.add(Term term, int position);

If, for example, sentence boundaries are at level=sent, then a phrase
query at level=sent with slop=0 and position=0 for all terms would
require that all query terms are in the same sentence.

------


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org