lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: Language detection
Date Thu, 27 Jun 2013 17:11:15 GMT
You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update 
processor to redirect languages to alternate fields, and then set the 
non-English fields to be "ignored". But, the document would still be 
indexed, just without the redirected text fields.

(Examples of using that update processor are in my book - but not the 
"ignored" step.)

There is also a Tika-specific processor as well:

If you really want to completely suppress the indexing of documents 
containing non-English text, you'll have to make an explicit check before 
sendting the document to Solr. Tika also has language detection, so you 
could call Tika from an external process before sending the document to 

-- Jack Krupansky

-----Original Message----- 
From: Hang Mang
Sent: Thursday, June 27, 2013 11:45 AM
Subject: Language detection


is there some kind of a filter or component that I could use to filter
non-english text? I have a preprocessing step that I only want to index
English documents.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message