lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Promoting new analysis components
Date Thu, 09 Feb 2012 01:49:52 GMT
On Wed, Feb 08, 2012 at 05:04:56PM +0100, Nick Wellnhofer wrote:
> On 23/12/2011 04:18, Marvin Humphrey wrote:
>> Now that EasyAnalyzer is in, I think we should promote the use of all the
>> improvements Nick has made to the analysis chain.
>>
>>    * Swap in EasyAnalyzer for PolyAnalyzer, Normalizer for CaseFolder, and
>>      StandardTokenizer for RegexTokenizer everywhere we can.
>
> Done.

Excellent!  Lots of great-looking commits coming through.   The revisions to
the tutorial looked sane; I figured that would be the trickiest part.

>>    * Deprecate the "language" parameter to PolyAnalyzer#new.
>>
>> By "deprecate", I mean:
>>
>>    * Open a JIRA issue so that a suitably titled entry ends up in the CHANGES
>>      file.
>>    * Mark the "language" param as "deprecated" in the PolyAnalyzer docs.
>>
>> We don't have a strong deprecation mechanism available to us right now, so I
>> think that's the best we can do.
>
> I just noticed that I removed the "language" parameter from the  
> PolyAnalyzer docs, but I can revert that part of my commit and mark it  
> as deprecated.
>
> Regarding the JIRA issue: I couldn't find a good issue type for  
> deprecations. "Task" seems the most appropriate to me.

I agree, there's no good answer, so +1 for "Task".

>> It's not important that any of these changes happen before 0.3.0.  The docs
>> changes can happen at any time, and the parameter deprecation only allows the
>> simplification of a single class (PolyAnalyzer itself).  It would also be nice
>> to switch most test cases to use the new Analyzers, but that can also happen
>> at any time.
>
> The tests have been converted, too.

Lookin' good!

>> In contrast, here are a couple changes we should *not* make prior to 0.3.0,
>> because they have index compatibility implications:
>>
>>    * Change Lucy::Simple to use EasyAnalyzer instead of PolyAnalyzer.
>
> I've done that now.

After reviewing the Lucy::Simple code, I realized that we can avoid breaking
compat with only a few extra lines.

  * If the index exists during new(), extract the schema and type from what's
    on disk.
  * Otherwise, create a new EasyAnalyzer for the type.

That way, we avoid a schema conflict crash when indexes built by Lucy::Simple
prior to 0.4.0 are read by 0.4.0 or above.

>>    * Implement CaseFolder as a subclass of Normalizer.
>
> This has yet to be done. We could also mark the CaseFolder as deprecated  
> and remove it completely later.

The cost for keeping CaseFolder around in its current form is high, because it
is tied into a perlapi function and thus needs a per-host implementation. (The
perlapi function's name broke in late Perl 5.15 releases, which was a PITA to
troubleshoot).  In contrast, the cost for keeping CaseFolder around is small
if it becomes a subclass of Normalizer.

However, CaseFolder and Normalizer presumably have slightly different case
mappings, thus the subclassing change is a back compat break.  It shouldn't be
a horrible break (depending on how close the mappings are) because it will
only affect search-time, screwing up the results only for terms which contain
code points whose mapping has changed.

I don't think we should outright remove CaseFolder without a really good
reason, because that will force almost all of our users to change their code
and then reindex from scratch.  But a subtle compat break might be OK,
especially since you can update all the docs in place after upgrading and only
suffer during a window of time from slightly degraded search results.

Marvin Humphrey


Mime
View raw message