lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <ser...@gmail.com>
Subject Re: Potential bug in StandardTokenizerImpl
Date Tue, 27 Nov 2007 07:18:02 GMT
I understand it would change the behavior of existing search solutions,
however the current behavior is just wrong. An ACRONYM cannot be ABC.DEF. If
you look up acronym in Wikipedia, you find only examples of I.B.M. / U.S.A.
like, or NATO, IBM, USA, but nothing of the form StandardAnalyzer currently
recognizes.

There are several ways to solve this change:
1. Create a new analyzer that fixes the problem - that way, applications
that don't want to use it will not have to, if they feel ok with the current
behavior. However, for those who would like to get a correct behavior,
they'll be able to. This is not my favorite solution, but I think it would
be preferable than simply not fixing it.
2. Fix it in the new version (2.3) and specifically mention that in the
release notes. Aren't there releases where applications need to re-build the
index because of fundamental changes?

Am I the only one who thinks that?

BTW, I changed the definition in the jflex file and recompiled using jflex
and it indeed solved the problem. It now recognizes www.abc.com. and
www.abc.com as hosts. I can attach the 'patch' files if you'd like to
compare.

On Nov 27, 2007 9:07 AM, Chris Hostetter <hossman_lucene@fucit.org> wrote:

>
> : If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>)
> : (which is correct in my opinion).
> : However, if you pass "www.abc.com." (notice the extra '.' at the end),
> the
> : output is (wwwabccom,0,12,type=<ACRONYM>).
>
> see also...
>
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
>
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
>
> one hitch which potentially changing this now is that it would break
> some searches in applications that have existing indexes built using
> previous versions.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Regards,

Shai Erera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message