lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: Boolean and SpanQuery: different results
Date Thu, 13 Dec 2012 17:00:58 GMT
Can you provide some examples of terms that don't work and the index token 
stream they fail on?

Make sure that the Analyzer you are using doesn't do any magic on the 
indexed terms - your query term is unanalyzed. Maybe multiple, but distinct, 
index terms are analyzing to the same, but unexpected term.

-- Jack Krupansky

-----Original Message----- 
From: Carsten Schnober
Sent: Thursday, December 13, 2012 10:49 AM
Subject: Boolean and SpanQuery: different results

I'm following Grant's advice on how to combine BooleanQuery and

The strategy is to perform a BooleanQuery, get the document ID set and
perform a SpanQuery restricted by those documents. The purpose is that I
need to retrieve Spans for different terms in order to extract their
respective payloads separately, but a precondition is that possibly
multiple terms occur within the documents. My code looks like this:

/* reader and terms are class variables and have been declared finally
before */
Reader reader = ...;
List<String> terms = ...

/* perform the BooleanQuery and store the document IDs in a BitSet */
BitSet bits = new BitSet(reader.maxDoc());
AllDocCollector collector = new AllDocCollector
BooleanQuery bq = new BooleanQuery();
for (String term : terms)
Term(config.getFieldname(), term)), Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
for (ScoreDoc doc : collector.getHits())

/* get the spans for each term separately */
for (String term : terms) {
  String payloads = retrieveSpans(term, bits);
  // process and print payloads for term ...

def String retrieveSpans(String term, BitSet bits) {
  StringBuilder payloads = new StringBuilder();
  Map<Term, TermContext> termContexts = new HashMap<>();
  Spans spans;
  SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new
RegexpQuery(new Term("text", term))).rewrite(reader);

  for (AtomicReaderContext atomic : reader.leaves()) {
    spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts);
    while ( {
      // extract and store payloads in 'payloads' StringBuilder
  return payloads.toString();

This construction seemed to be working fine at first, but I noticed a
disturbing behaviour: for many terms, the BooleanQuery when fed with one
RegexpQuery only matches a larger number of documents than the SpanQuery
constructed from the same RegexpQuery.
With the BooleanQuery containing only one RegexpQuery, the number should
be identical, while with multiple Queries added to the BooleanQuery, the
SpanQuery should return an equal number or more results. This behaviour
is reproducible reliably even after re-indexing, but not for all tokens.
Does anyone have an explanation for that?


Institut für Deutsche Sprache |
Projekt KorAP                 |
Tel. +49-(0)621-43740789      |
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

To unsubscribe, e-mail:
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message