lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Import Handler for tokenizing facet string into multi-valued solr.StrField..
Date Thu, 27 Jan 2011 15:32:23 GMT
Tokenization is fine with facets, that caution is about, say, faceting
on the tokenized body of a document where you have potentially
a huge number of unique tokens.

But if there is a controlled number of distinct values, you shouldn't have
to do anything except index to a tokenized field. I'd remove stemming,
WordDelimiterFactory, etc though, in fact I'd probably just go with
WhiteSpaceTokenizer and, maybe, LowerCaseFilter.

But if you have a huge number of unique values, it doesn't matter whether
they are tokenized or strings, it'll still be a problem.

One note: when faceting for the first time on a newly-started Solr instance,
the caches are filled and the *first* query will be slower, so measure
subsequent queries.

Best
Erick

On Thu, Jan 27, 2011 at 9:09 AM, Dennis Schafroth <dennis@indexdata.com>wrote:

> Hi,
>
> Pretty novice into SOLR coding, but looking for hints about how (if not
> already done) to implement a PatternTokenizer, that would index this into
> multivalie fields of solr.StrField for facetting. Ex.
>
> Water -- Irrigation ; Water -- Sewage
>
> should be tokenized into
>
> Water
> Irrigation
> Sewage
>
> in multi-valued non-tokenized fields due to performance. I could do it from
> the outside, but I would this as a opportunity to learn about SOLR.
>
> It "works" as I want with the PatternTokenizerFactory when I am using
> solr.TextField, but not when I am using the non-tokenized solr.StrField. But
> according to reading, facets performance is better on non-tokenized fields.
> We need better performance on our faceted searches on these multi-value
> fields.  (25 million documents, three multi-valued facets)
>
> I would also need to have a filter that filter out identical values as the
> feeds have redundant data as shown above.
>
> Can anyone point point me in the right direction..
>
> cheers,
> :-Dennis

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message