lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Proposal about Version API "relaxation"
Date Thu, 15 Apr 2010 18:02:56 GMT
On 04/15/2010 01:50 PM, Earwin Burrfoot wrote:
>> First, the index format. IMHO, it is a good thing for a major release to be
>> able to read the prior major release's index. And the ability to convert it
>> to the current format via optimize is also good. Whatever is decided on this
>> thread should take this seriously.
>>      
> Optimize is a bad way to convert to current.
> 1. conversion is not guaranteed, optimizing already optimized index is a noop
> 2. it merges all your segments. if you use BalancedSegmentMergePolicy,
> that destroys your segment size distribution
>
> Dedicated upgrade tool (available both from command-line and
> programmatically) is a good way to convert to current.
> 1. conversion happens exactly when you need it, conversion happens for
> sure, no additional checks needed
> 2. it should leave all your segments as is, only changing their format
>
>    
>> It is my observation, though possibly not correct, that core only has
>> rudimentary analysis capabilities, handling English very well. To handle
>> other languages well "contrib/analyzers" is required. Until recently it did
>> not get much love. There have been many bw compat breaking changes (though
>> w/ version one can probably get the prior behavior). IMHO, most of
>> contrib/analyzers should be core. My guess is that most non-trivial
>> applications will use contrib/analyzers.
>>      
> I counter - most non-trivial applications will use their own analyzers.
> The more modules - the merrier. You can choose precisely what you need.
>    
By and large an analyzer is a simple wrapper for a tokenizer and some 
filters. Are you suggesting that most non-trivial apps write their own 
tokenizers and filters?

I'd find that hard to believe. For example, I don't know enough Chinese, 
Farsi, Arabic, Polish, ... to come up with anything better than what 
Lucene has to tokenize, stem or filter these.

>    
>> Our user base are those with ancient,
>> underpowered laptops in 3-rd world countries. On those machines it might
>> take 10 minutes to create an index and during that time the machine is
>> fairly unresponsive. There is no opportunity to "do it in the background."
>>      
> Major Lucene releases (feature-wise, not version-wise) happen like
> once in a year, or year-and-a-half.
> Is it that hard for your users to wait ten minutes once a year?
>    
  I said that was for one index. Multiply that times the number of books 
available (300+) and yes, it is too much to ask. Even if a small subset 
is indexed, say 30, that's around 5 hours of waiting.

Under consideration is the frequency of breakage. Some are suggesting a 
greater frequency than yearly.

DM

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message