lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yao Ge <yao...@gmail.com>
Subject Faceting on text fields
Date Thu, 04 Jun 2009 16:01:45 GMT

I am indexing a database with over 1 millions rows. Two of fields contain
unstructured text but size of each fields is limited (256 characters). 

I come up with an idea to use visualize the text fields using text cloud by
turning the two text fields in facets. The weight of font and size is of
each facet value (words) derived from the facet counts. I used simpler field
type so that the there is no stemming to these facet values:
    <fieldType name="word" class="solr.TextField" positionIncrementGap="100"
>
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

The facet query is considerably slower comparing to other facets from
structured database fields (with highly repeated values). What I found
interesting is that even after I constrained search results to just a few
hunderd hits using other facets, these text facets are still very slow.  

I understand that text fields are not good candidate for faceting as it can
contain very large number of unique values. However why it is still slow
after my matching documents is reduced to hundreds? Is it because the whole
filter is cached (regardless the matching docs) and I don't have enough
filter cache size to fit the whole list?

The following is my filterCahce setting:
     <filterCache class="solr.LRUCache" size="5120" initialSize="512"
autowarmCount="128"/>

Lastly, what I really want to is to give user a chance to visualize and
filter on top relevant words in the free-text fields. Are there alternative
to facet field approach? term vectors? I can do client side process based on
top N (say 100) hits for this but it is my last option.
-- 
View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html
Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message