lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robust Links <pey...@robustlinks.com>
Subject phrase query in solr 4
Date Mon, 27 Oct 2014 12:20:43 GMT
Hi

We are trying to upgrade our index from 3.6.1 to 4.9.1 and I wanted to make
sure our existing indexing strategy is still valid or not. The statistics
of the raw corpus are:

- 4.8 Billon total number of tokens in the entire corpus.

- 13MM documents


We have 3 requirements


1) we want to index and search all tokens in a document (i.e. we do not
rely on external stores)

2) we need search time to be fast and willing to pay larger indexing time
and index size,

3)  be able to search as fast as possible ngrams of 3 tokens or less (i.e,
unigrams, bigrams and trigrams).


To satisfy (1) we used the default
<maxFieldLength>2147483647</maxFieldLength> in
solrconfig.xml of 3.6.1 index to specify the total number of tokens to
index in an article. In solr 4 we are specifying it via the tokenizer in
the analyzer chain


<tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="2147483647
"/>


To satisfy 2 and 3 in our 3.6.1 index we indexed using the following
shingedFilterFactory in the analyzer chain


<filter class="solr.ShingleFilterFactory" outputUnigrams="true"
maxShingleSize="3”/>


This was based on this thread:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200808.mbox/%3C856ac15f0808161539p54417df2ga5a6fdfa35889851@mail.gmail.com%3E


The open questions we are trying to understand now are:


1) whether shingling is still the best strategy for phrase (ngram) search
given our requirements above?

2) if not then what would be a better strategy.


thank you in advance for your help


Peyman

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message