lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Wu <...@ohsu.edu>
Subject FieldMaskingSpanQuery and statistics
Date Wed, 15 Apr 2015 04:53:35 GMT
In the documentation for FieldMaskingSpanQuery, it says:

"Note: as getField() returns the masked field, scoring will be done using the Similarity and
collection statistics of the field name supplied, but with the term statistics of the real
field. This may lead to exceptions, poor performance, and unexpected scoring behavior."

I assume this was implemented as such because the hypothetical use case was with very short
fields, and collection statistics/idf are not so important when you're basically doing boolean
queries.

However, we've given a lot of thought to how we could include linguistic annotations alongside
the original text, and we're looking at separate fields + FieldMaskingSpanQuery to do the
trick. (The idea is to create "annotation" fields with token offsets set by the tokenized
text. Then FieldMaskingSpanQuery allows us to search both text and annotations as if they
are in the same token position in the same field. We've considered payloads, synonyms, and
a few other things, but not really been satisfied.)

In order for this to be scientifically interesting, though, we need for the collection statistics
to remain consistent with the original "annotation" field; we would also like to ensure that
all of these stats/SpanQuery descendents work with LMDirichletSimilarity.

Any idea how to implement a FieldMaskingSpanQuery that gets collection statistics right?

Many thanks for any help on the issue.

stephen
P.S. Has anyone made progress on allowing indexes to store word lattices, preserving the graphs
that are produced with TokenFilters?
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message