lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TK <kuros...@sonic.net>
Subject Re: Implementing custom analyzer for multi-language stemming
Date Wed, 06 Aug 2014 04:11:48 GMT

On 8/5/14, 8:36 AM, Rich Cariens wrote:
> Of course this is extremely primitive and basic, but I think it would be
> possible to write a CharFilter or TokenFilter that inspects the entire
> TokenStream to guess the language(s), perhaps even noting where languages
> change. Language and position information could be tracked, the TokenStream
> rewound and then Tokens emitted with "LanguageAttributes" for downstream
> Token stemmers to deal with.
>
I'm curious how you are planning to handle the languageAttribute.
Would each token have this attribute denoting a span of Tokens
with a language? But then how would you search
English documents that includes the term "die" while skipping
all the German documents which most likely to have "die"?

Automatic language detection works OK for long text of
regular kind of contents.  But it doesn't work well with short
text. What strategy would you use to deal with short text?

-- 
TK


Mime
View raw message