lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Performance implications of unanlyzed content
Date Fri, 16 Apr 2004 10:05:59 GMT
On Apr 16, 2004, at 2:59 AM, Magnus Johansson wrote:
> Hi
>
> I'm developing an application using Lucene where I need to
> be able to both search using a stemmer and sometimes using
> "exact" search.
>
> I see two ways of doing this:
>
> 1. Use two indexes. One using a stemming analyzer and one using
>    a SimpleAnalyzer
>
> 2. Using duplicate fields. One field with stemmed content and
>    one with unstemmed content. (Perhaps the field CONTENT, will be
>    CONTENT and CONTENT_RAW)
>
> I'm leaning towards option 2. However I'm interested in any performance
> implications. If I understand it correctly Lucene keeps separate
> term-dictionaries for each field. So besides the index growing larger
> (which might affect caching) it won't be any slower searching the index
> with duplicate fields when I only query on the CONTENT field
>
> Is this correct?

I wouldn't concern yourself with performance at this stage.  Granted 
here in Lucene Land, performance is key, but Lucene will be plenty fast 
in either of these scenarios.  You say "sometimes" for toggling between 
exact and stemmed.  If your requirement was that it was "always" both, 
then you could leverage another option - having the custom analyzer 
place stemmed and exact terms in the same term position (set increment 
to zero for the stemmed words).

But since you need to toggle between exact and stemmed, I'd opt for #2 
as well.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message