lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: Spans, appended fields, and term positions
Date Sun, 20 Nov 2005 17:00:32 GMT
> Does it make sense to add an IndexWriter setting to
> specify a default position increment gap to use when multiple fields
> are added in this way?

Per-field might be nice...

The good news is that Analyzer is an abstract class, and not an
Interface, so we could add something to it without breaking existing
analyzers. (a benefits of classes over interfaces that rarely get
mentioned).

int Analyzer.getPositionIncrementGap(String field)
or getMultiValuedFieldGap(String field)

But what might be even more powerful is to leave everything up to the
analyzer, where you could choose to do a big position increment,
generate a special token, or anything else one might think of.

You can't do this right now in the analyzer because of a lack of info
(you don't know if you are on the first field or a subsequent one. 
One could always add a big position increment at the start of every
field, but I suspect that would blow up the index size.  Another way
is to give more context info to the Analyzer:

Analyzer.analyzer.tokenStream(fieldName, reader, flags)
  where one of the flags could be REPEATED_FIELD or something.

Analyzer.getPositionIncrementGap(String field) is certainly the
simplest option if all you want to do is have a gap, and I guess these
different options aren't mutually exclusive.


-Yonik
Now hiring -- http://forms.cnet.com/slink?231706


On 11/20/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
> I'm working on building a custom highlighter for a client, which may
> eventually be generalizable.  In my work, I've come across some
> issues I'd like to discuss.  One issue is of appended fields allowing
> querying across boundaries.  For example, if I index two fields with
> the same name:
>
>          doc.add(new Field("repeated", "this is a repeated field -
> first instance", Field.Store.YES,
>                  Field.Index.TOKENIZED));
>          doc.add(new Field("repeated", "this is a repeated field -
> second instance", Field.Store.YES,
>                  Field.Index.TOKENIZED));
>
> A query, of course by design, can span across those two field
> instances.  This query, against an index with only the above single
> document in it, shows the effect:
>
>      SpanTermQuery first = new SpanTermQuery(new Term("repeated",
> "first"));
>      SpanTermQuery second = new SpanTermQuery(new Term("repeated",
> "second"));
>      SpanNearQuery wrapped = new SpanNearQuery(new SpanQuery[]
> { first, second}, 7, true);
>
>      IndexSearcher searcher = new IndexSearcher(reader);
>      Hits hits = searcher.search(wrapped);
>      assertEquals(1, hits.length());
>
> So my first question is how could this match be prevented?
> Technically if the second "this" has a large position increment then
> there would be no match.  But how could I achieve that large position
> increment?  Does it make sense to add an IndexWriter setting to
> specify a default position increment gap to use when multiple fields
> are added in this way?  And likely a setting on a per-field basis to
> specify an increment offset to use for that individual field.  There
> isn't a way an Analyzer itself could address this situation, is there?
>
> Highlighting is quite a challenging endeavor!  Spans certainly
> provides a lot of help, but in the appended field scenario, the
> Spans.start() and .end() goes across the field boundary, so it
> requires, in my case with the text coming from stored field values,
> cleverness in how to highlight in order to keep field instances
> separate.
>
> So, should we make some changes in allowing offsets to be controllable?
>
> Thanks,
>      Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message