lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Ackley" <ryanack...@gmail.com>
Subject Re: extracting non-english text from word, pdf, etc....??
Date Fri, 03 Aug 2007 15:54:41 GMT
The textmining library (textmining.org) for Word docs should work fine
with non-english text as well. Let me know if it doesn't

On 8/2/07, Ben Litchfield <ben@benlitchfield.com> wrote:
> In terms of PDF documents...
>
> PDFBox should work just fine with any latin based languages; at this
> time certain PDFs that have CJK characters can pose some issues.  In
> general english/french/spanish should be fine.
>
> Some PDFs use custom encodings that make it impossible to extract text
> and it comes out as gibberish.  As a simple test if Acrobat can
> extract the text then PDFBox should be able to as well.
>
> Ben
>
>
> Quoting Grant Ingersoll <gsingers@apache.org>:
>
> > Hey Michael,
> >
> > Have you given it a try?  I would think they would work, but haven't
> > actually done it.   Setup a small test that reads in a PDF in French or
> > Spanish and give it a try.  You might have to worry about encodings or
> > something, but the structure of the files should be the same, i.e. they
> > are valid Word, etc. documents.
> >
> > -Grant
> >
> > On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:
> >
> >> Yea, I have seen those.  I guess the question is what do you all
> >> use to extract text from Word, Excel, PPT and PDF?  Can I use POI,
> >> PDFBox and so on?  This is what I use now to extract english.
> >>
> >> Thanks,
> >> Michael
> >>
> >> testn wrote:
> >>> If you can extract token stream from those files already, you can
> >>> simply use
> >>> different analyzers to analyze those token stream appropriately. Check out
> >>> Lucen-contrib analyzers at
> >>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
> >>>
> >>>
> >>>
> >>> heybluez wrote:
> >>>
> >>>> I know how to do english text with POI and PDFBox and so on.  Now, I
want
> >>>> to start indexing non-english language such as french and spanish. 
Which
> >>>> extraction libs are available for me?
> >>>>
> >>>> I want to do:
> >>>>
> >>>> Excel
> >>>> Word
> >>>> PowerPoint
> >>>> PDF
> >>>> HTML
> >>>> RTF
> >>>>
> >>>> Thanks!
> >>>> Michael
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message