lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <...@benlitchfield.com>
Subject Re: extracting non-english text from word, pdf, etc....??
Date Thu, 02 Aug 2007 14:46:41 GMT
In terms of PDF documents...

PDFBox should work just fine with any latin based languages; at this  
time certain PDFs that have CJK characters can pose some issues.  In  
general english/french/spanish should be fine.

Some PDFs use custom encodings that make it impossible to extract text  
and it comes out as gibberish.  As a simple test if Acrobat can  
extract the text then PDFBox should be able to as well.

Ben


Quoting Grant Ingersoll <gsingers@apache.org>:

> Hey Michael,
>
> Have you given it a try?  I would think they would work, but haven't
> actually done it.   Setup a small test that reads in a PDF in French or
> Spanish and give it a try.  You might have to worry about encodings or
> something, but the structure of the files should be the same, i.e. they
> are valid Word, etc. documents.
>
> -Grant
>
> On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:
>
>> Yea, I have seen those.  I guess the question is what do you all   
>> use to extract text from Word, Excel, PPT and PDF?  Can I use POI,   
>> PDFBox and so on?  This is what I use now to extract english.
>>
>> Thanks,
>> Michael
>>
>> testn wrote:
>>> If you can extract token stream from those files already, you can   
>>> simply use
>>> different analyzers to analyze those token stream appropriately. Check out
>>> Lucen-contrib analyzers at
>>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
>>>
>>>
>>>
>>> heybluez wrote:
>>>
>>>> I know how to do english text with POI and PDFBox and so on.  Now, I want
>>>> to start indexing non-english language such as french and spanish.  Which
>>>> extraction libs are available for me?
>>>>
>>>> I want to do:
>>>>
>>>> Excel
>>>> Word
>>>> PowerPoint
>>>> PDF
>>>> HTML
>>>> RTF
>>>>
>>>> Thanks!
>>>> Michael
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message