lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Wed, 03 Jun 2009 16:12:51 GMT
KK, is all of your latin script text actually english? Is there stuff like
german or french mixed in?

And for your non-english content (your examples have been indian writing
systems), is it generally true that if you had devanagari, you can assume
its hindi? or is there stuff like marathi mixed in?

Reason I say this is to invoke the right stemmers, you really need some
language detection, but perhaps in your case you can cheat and detect this
based on scripts...


On Wed, Jun 3, 2009 at 10:15 AM, KK <> wrote:

> Hi All,
> I'm indexing some non-english content. But the page also contains english
> content. As of now I'm using WhitespaceAnalyzer for all content and I'm
> storing the full webpage content under a single filed. Now we require to
> support case folding and stemmming for the english content intermingled
> with
> non-english content. I must metion that we dont have stemming and case
> folding for these non-english content. I'm stuck with this. Some one do let
> me know how to proceed for fixing this issue.
> Thanks,
> KK.

Robert Muir

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message