lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Wed, 20 Jan 2010 20:26:54 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802973#action_12802973
] 

Hoss Man commented on SOLR-1677:
--------------------------------


bq. I think I am slightly offended with some of your statements about 'subjective opinion
of the Lucene Community' and 'they should do relevancy testing which use some language-specific
stemmer whose behavior changed in a small but significant way'.

That was not at all my intention, i'm sorry about that.  I was in fact trying to speak entirely
in generalities and theoretical examples.

The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical
absolutes -- we're not fixing bugs where 1+1=3.  Even if everyone on java-dev, and java-user
agrees that behavior A is broken and behavior B is correct, that is still (to me) a subjective
opinion -- 1000 mens trash may be one mans treasure, and there could be users out there who
have come to expect/rely on that behavior A.

I tried to use a stemmer as an example because it's the type of class where making behavior
more correct (ie: making the stemming match the semantics of the language more accurately)
doesn't necessarily improve the percieved behavior for all users -- someone could be very
happy with the "sloppy stemming" in the 3.1 version of a (hypothetical) EsperantoStemmer because
it gives him really "loose" matches.  And if you (or any one else) put in a lot of hard work
making that stemmer "better" my all concievable metrics in 3.4, then i've got no problem telling
that person "Sorry dude, if you don't want those fixes don't upgrade, or here are some other
suggestions for getting 'loose' matching on that field."

My concern is that there may be people who don't even realize they are depending on behavior
like this.  Without an easy way for users to understand what objects have improved/fixed behavior
between luceneMatchVersion=X and luceneMatchVersion=Y they won't know the full list of things
they should be considering/testing when they do change luceneMatchVersion.

bq. I'm also not that worried that users won't know what changed - they will just know that
they are in the same boat as those downloading Lucene latest greatest for the first time.

But that's not true:  a person downloading for the first time won't have any preconcieved
expectaionts of how something will behavior; that's a very different boat from a person upgrading
is going to expect things that were working to keep working -- those things may have actaully
been bugs in earlier versions, but if they _seemed_ to be working for their use cases, it's
going to feel like it's broken when the behavior changes.  For a user who is conciously upgrading
i'm ok with that.  but when there is no easy way of knowing what behavior will change as a
result of setting luceneMatchVersion=X that doens't feel fair to the user.

Robert mentioned in an earlier comment that StopFilter's position increment behavior changes
depending on the luceneMatchVersion -- what if an existing Solr 1.3 user notices a bug in
some Tokenizer, and adds {{<luceneMatchVersion>3.0</luceneMatchVersion>}} to his
schema.xml to fix it.  Without clear documentation n _everything_ that is affected when doing
that, he may not realize that StopFilter changed at all -- and even though the position incrememnt
behavior may now be more correct, it might drasticly change the results he gets when using
dismax with a particular qs or ps value.  Hence my point that this becomes a serious documentation
concern: finding a way to make it clear to users what they need to consider when modifying
luceneMatchVersion.

bq. I'm still all for allowing Version per component for experts use. But man, I wouldn't
want to be in the boat, managing all my components as they mimic various bugs/bad behavior
for various components.

But if the example configs only show a global setting that isn't directly "linked" to any
of hte individual object configurations, then normal users won't have any idea what could
have/use individual luceneMatchVerssion settings anyway (even if they wanted to manage it
piecemeal)

Like i said: i've come around to the idea of having/advocating a global value.  Once i got
passed my mistaken thinking of "Version" as controlling "alternate versions" (as miller very
clearly put it) I started to understand what you are all saying and i agree with you: a single
global value is a good idea.

My concern is just how to document things so that people don't get confused when they do need
to change it.


> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message