lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael J. Prichard" <michael_prich...@mac.com>
Subject Re: extracting non-english text from word, pdf, etc....??
Date Thu, 02 Aug 2007 12:59:06 GMT
Yea, I have seen those.  I guess the question is what do you all use to 
extract text from Word, Excel, PPT and PDF?  Can I use POI, PDFBox and 
so on?  This is what I use now to extract english.

Thanks,
Michael

testn wrote:
> If you can extract token stream from those files already, you can simply use
> different analyzers to analyze those token stream appropriately. Check out
> Lucen-contrib analyzers at
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
>
>
>
> heybluez wrote:
>   
>> I know how to do english text with POI and PDFBox and so on.  Now, I want
>> to start indexing non-english language such as french and spanish.  Which
>> extraction libs are available for me?
>>
>> I want to do:
>>
>> Excel
>> Word
>> PowerPoint
>> PDF
>> HTML
>> RTF
>>
>> Thanks!
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>     
>
>   


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message