lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Multiple search analyzers on the same field type possible?
Date Fri, 14 Oct 2011 11:36:29 GMT
Hmmmm....

A couple of things.
1> Have you looked at alternate stemmers? Porter stemmer is rather
     aggressive. Perhaps a less-agressive stemmer would suit your
     internal users.
2> Try a few things, but if you can't solve it reasonably quickly,
      go back to your internal customer and explain the costs of
      fixing this. Really. You're jumping through hoops because
     results "did not please my internal customer". Can they
      quantify their objections? Or is this just looking at the
      results for random searches and guessing at relevance?
      If the latter, you really, really, really need to get them to
      quantify their objections and I bet you'll find that they can't.
      And you'll forever be trying to tweak results to please
      how they feel about it today. Which will be different from
      how they felt about *the exact same results* yesterday.
      You can go around this loop forever.

      We've (programmers in general) done a rather poor job
      historically of laying out the *costs* of fixing things to
      suit a customer and allowing the various stake-holders
      to make rational decisions. We say "Sure, that can be done"
      and leave out "but it will take a month when we won't
      be able to do X, Y, or Z, and requires more hardware".
      There, rant done....

3> I suppose you could think about writing your own filter that
     added the original token and the stemmed token.
     Something like the SynonymFilter but instead of alternate
     versions of the word, you'd have the stemmed version
     and the original version at the same position. Or maybe
     you have the stemmed version and then the original
     version with a special ending character (say $) appended.
     Then you'd have to somehow write a query-time
     analysis chain (or a query parser?) that somehow
     knew enough to use the stemmed or original word (plus $)
     in the query. But I admit I haven't thought this through
     at all. There'd have to be some parameter you passed
     through with the query that controlled whether the
     regular stemming process happened or not... And I
     don't know offhand how that'd work.

     Or reverse that. Append $ to all the stemmed variants.

But really, before going there (which I admit would be more
fun than arguing with your customer), try one of the less
aggressive stemmers. Or see if your other stake-holders
would be better served by not stemming at all. Or....

Best
Erick


On Fri, Oct 14, 2011 at 3:22 AM, Victor <scanner598@yahoo.co.uk> wrote:
> Hi Erick,
>
> I work for a very big library and we store huge amounts of data. Indexing
> some of our collections can take days and the index files can get very big.
> We are a non-profit organisation, so we want to provide maximum service to
> our customers but at the same time we are bound to a fixed budget and want
> to keep costs as low as possible (including disk space). Our customers vary
> from academic people that want to do very precise searches to common users
> who want to seach in a more general way. The library now wants to implement
> some form of stemming, but we have had one demo in the past with a stemmer
> that returned results that did not please my internal customer (another
> department).
>
> So my wish list looks like this:
>
> 1) Implement stemming
> 2) Give the end user the possibility to turn stemming on or off for their
> searches
> 3) Have maximum control over the stemmer without the need to reindex if we
> change something there
> 4) Prevent the need for more storage (to keep the operations people happy)
>
> So far I have been able to satisfy 1,2 and 3. I am using a synonyms list at
> query time to apply my stemming. The synonym list I build as follows:
>
> a) load a library (a text file with 1 word per line)
> b) remove stop words from the list
> c) link words that have the same stem
>
> Bullet c) is a little bit more sophisticated, because I do not link words
> that are already part of a pre-defined synonym list that contains
> exceptions.
>
> All this I do to keep maximum control over the behaviour of the stemmer.
> Since this is a demo and it will be used to convince other people in my
> organisation that stemming could be worth implementing, I need to be able to
> adjust its behaviour quickly.
>
> So far processing speed has not been an issue, but disk storage has.
> Generally, at index time we remove as few tokens as possible and our objects
> are complete books, news papers (from 1618 until 1995), etc . So you can
> imagine that our indexes get very, very big.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3420946.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Mime
View raw message