lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: extracting non-english text from word, pdf, etc....??
Date Thu, 02 Aug 2007 14:18:22 GMT
Hey Michael,

Have you given it a try?  I would think they would work, but haven't  
actually done it.   Setup a small test that reads in a PDF in French  
or Spanish and give it a try.  You might have to worry about  
encodings or something, but the structure of the files should be the  
same, i.e. they are valid Word, etc. documents.

-Grant

On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:

> Yea, I have seen those.  I guess the question is what do you all  
> use to extract text from Word, Excel, PPT and PDF?  Can I use POI,  
> PDFBox and so on?  This is what I use now to extract english.
>
> Thanks,
> Michael
>
> testn wrote:
>> If you can extract token stream from those files already, you can  
>> simply use
>> different analyzers to analyze those token stream appropriately.  
>> Check out
>> Lucen-contrib analyzers at
>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/ 
>> analyzers/src/java/org/apache/lucene/analysis/
>>
>>
>>
>> heybluez wrote:
>>
>>> I know how to do english text with POI and PDFBox and so on.   
>>> Now, I want
>>> to start indexing non-english language such as french and  
>>> spanish.  Which
>>> extraction libs are available for me?
>>>
>>> I want to do:
>>>
>>> Excel
>>> Word
>>> PowerPoint
>>> PDF
>>> HTML
>>> RTF
>>>
>>> Thanks!
>>> Michael
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message