lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <ser...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.
Date Thu, 04 Sep 2008 05:49:18 GMT
I think we should distinguish between what is a bug and what is an attempt
of the tokenizer to produce a meaningful token. When the tokenizer outputs a
HOST or ACRONYM token type, there's nothing that prevents you from putting a
filter after the tokenizer that will use a UIMA Annotator (for example) and
verify that the output token type is indeed correct.

For example, in the case of java.lang.NullPointerException we all understand
it's not a HOST, but unfortunately our logic hasn't been translated well
into computer instructions, yet :-). However you treat this token now is up
to you:

- If you want to be able to search for the individual parts of the host, but
still find the full host, I'd put a TokenFilter after the tokenizer that
breaks the HOST to its parts and returns the parts along with the full host
name. During query time I'd then remove that filter (i.e. create an Analyzer
w/o that filter) and thus I'd be able to search for either "apache" or "
www.apache.org".

- If you want to actually verify the output HOST is indeed a host, again,
put a TokenFilter after the tokenizer and either apply your own simple
hueristics (for example if there's a ".com", ".org", ".net" it's a HOST,
otherwise it's not - I know these don't cover all HOST types, it's just an
example), or validate that with an external tool, like a UIMA Annotator.

- You can also decide that a 2 parts HOST is not really a host, that way you
solve the "I like Apache.They rock" problem, but miss a whole handful of
hosts like "ibm.com", "apache.org", "google.com".

Again, IMO, the logic in the tokenizer today for HOSTs and ACRONYMs are
"best effort" to produce a meaningful token. If we remove those rules, for
example, it'd be impossible to detect them because the tokenizer is set to
discard any stand alone "&", ".", "@" for example.

I'm going to send out another email to the list about a bug or incosistency
I recently found in the COMPANY rule. I don't want to mix this thread with a
different issue.

On Thu, Sep 4, 2008 at 5:17 AM, Mark Lassau <mlassau@atlassian.com> wrote:

> Grant Ingersoll (JIRA) wrote:
>
>> Of course, it's still a bit weird, b/c in your case the type value is
>> going to be set to ACRONYM, when your example is clearly not one.  This
>> suggests to me that the grammar needs to be revisited, but that can wait
>> until 3.0 I believe.
>>
>>
>>
> Grant, not sure what you mean by "b/c in your case the type value is going
> to be set to ACRONYM, when your example is clearly not one."
> Once we set replaceInvalidAcronym=true, then the type is set to HOST.
>
> However, if you were to revisit the grammar, then I would be interested to
> get in on the discussion on the behaviour of <HOST>.
> For instance, if you have a document like "visit www.apache.org", you
> currently won't get a hit if you search for "apache".
> In an issue tracker like JIRA, we want to be able to search for
> "NullPointerException", and get a hit for the document "Application threw
> java.lang.NullPointerException".
>
> Also note that the current implementation has problems if the document
> doesn't contain expected whitespace.
> eg "I like Apache.They rock"
> Will get tokenized to the following:
> I                         <ALPHANUM>
> like                    <ALPHANUM>
> Apache.They    <HOST>
> rock                   <ALPHANUM>
>
> I don't think there is a simple one-size-fits-all answer to how this should
> behave. It depends on the context of the app that is using Lucene.
> The best answer may be to make some of the behaviour configurable, or have
> a suite of specific analyzers?
>
> Mark.
>
>> Most of the contributed Analyzers suffer from invalid recognition of
>>> acronyms.
>>>
>>> ------------------------------------------------------------------------------
>>>
>>>                Key: LUCENE-1373
>>>                URL: https://issues.apache.org/jira/browse/LUCENE-1373
>>>            Project: Lucene - Java
>>>         Issue Type: Bug
>>>         Components: Analysis, contrib/analyzers
>>>   Affects Versions: 2.3.2
>>>           Reporter: Mark Lassau
>>>           Priority: Minor
>>>
>>> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "
>>> www.apache.org." would be incorrectly tokenized as an acronym (note the
>>> dot at the end).
>>> Unfortunately, keeping the "backward compatibility" of a bug turns out to
>>> harm us.
>>> StandardTokenizer has a couple of ways to indicate "fix this bug", but
>>> unfortunately the default behaviour is still to be buggy.
>>> Most of the non-English analyzers provided in lucene-analyzers utilize
>>> the StandardTokenizer, and in v2.3.2 not one of these provides a way to get
>>> the non-buggy behaviour :(
>>> I refer to:
>>> * BrazilianAnalyzer
>>> * CzechAnalyzer
>>> * DutchAnalyzer
>>> * FrenchAnalyzer
>>> * GermanAnalyzer
>>> * GreekAnalyzer
>>> * ThaiAnalyzer
>>>
>>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message