lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robust Links <>
Subject phrase query in solr 4
Date Fri, 24 Oct 2014 17:51:03 GMT

We are trying to upgrade our index from 3.6.1 to 4.9.1 and I wanted to make
sure our existing indexing strategy is still valid or not. The statistics
of the raw corpus are:

- 4.8 Billon total number of tokens in the entire corpus.

- 13MM documents

We have 3 requirements

1) we want to index and search all tokens in a document (i.e. we do not
rely on external stores)

2) we need search time to be fast and willing to pay larger indexing time
and index size,

3)  be able to search as fast as possible ngrams of 3 tokens or less (i.e,
unigrams, bigrams and trigrams).

To satisfy (1) we used the default  <maxFieldLength>2147483647</
maxFieldLength> in solrconfig.xml of 3.6.1 index to specify the total
number of tokens to index in an article. In solr 4 we are specifying it via
the tokenizer in the analyzer chain

 <tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="2147483647"

To satisfy 2 and 3 in our 3.6.1 index we indexed using the following
shingedFilterFactory in the analyzer chain

<filter class="solr.ShingleFilterFactory" outputUnigrams="true"

This was based on this thread:

The open questions we are trying to understand now are:

1) whether shingling is still the best strategy for phrase (ngram) search
given our requirements above?

2) if not then what would be a better strategy.

thank you in advance for your help


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message