lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Re: Is StandardAnalyzer good enough for multi languages...
Date Tue, 08 Jan 2013 23:43:34 GMT
On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <saisantoshi76@gmail.com> wrote:
> DoesLucene StandardAnalyzer work for all the languagues for tokenizing before
> indexing (since we are using java, I think the content is converted to UTF-8
> before tokenizing/indeing)?

No. There are multiple cases where it chooses not to break something
which it should break. Some of these cases even result in undesirable
behaviour for English, so I would be surprised if there were even a
single language which it handles acceptably.

It does follow "Unicode standards" for how to tokenise text, but these
standards were written by people who didn't quite know what they were
doing so it's really just passing the buck. I don't think Lucene
should have chosen to follow that standard in the first place, because
it rarely (if ever) gives acceptable results.

The worst examples for English, at least for us, were that it does not
break on colon (:) or underscore (_).

Colon was explained by some languages using it like an apostrophe.
Personally I think you should break on an apostrophe as well, so I'm
not really happy with this reasoning, but OK.

Underscore was completely baffling to me so I asked someone at Unicode
about it. They explained that it was because it was "used by
programmers to separate words in identifiers". This explanation is
exactly as stupid as it sounds and I hope they will realise their
stupidity some day.

> or do we need to use special analyzers for each of the language.

I do think that StandardTokenizer at least can form a good base for an
analyser. You just have to add a ton of filters to fix each additional
case you find where people don't like it. For instance, it returns
runs of Katakana as a single token, but if you did that, people
wouldn't find what they are searching for, so you make a filter to
split that back out into multiple tokens.

It would help if there were a single, core-maintained analyser for
"StandardAnalyzer with all the things people hate fixed"... but I
don't know if anyone is interested in maintaining it.

> In this case, if a document has a mixed case ( english +
> Japanese), what analyzer should we use and how can we figure it out
> dynamically before indexing?

Some language detection libraries will give you back the fragments in
the text and tell you which language is used for each fragment, so
that is totally a viable option as well. You'd just make your own
analyser which concatenates the results.

> Also, while searching if the query text contains (both english and
> Japanese), how does this work? Any criteria in choosing the analyzers?

I guess you could either ask the user what language they're searching
in or look at what characters are in their query and decide which
language(s) it matches and build the query from there. It might match
multiple...

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message