If you can extract token stream from those files already, you can simply use
different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
heybluez wrote:
>
> I know how to do english text with POI and PDFBox and so on. Now, I want
> to start indexing non-english language such as french and spanish. Which
> extraction libs are available for me?
>
> I want to do:
>
> Excel
> Word
> PowerPoint
> PDF
> HTML
> RTF
>
> Thanks!
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
--
View this message in context: http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc....---tf4198171.html#a11962580
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|