lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Carlson <>
Subject Re: Parsing PDF documents
Date Mon, 18 Feb 2002 03:56:57 GMT
If you supply your code I'll add it the contributions area.
It would be great to have some code that already already converts the PDF
directly to a Lucene Document.


On 2/16/02 8:36 PM, "Robert MacMillan" <> wrote:

>   I found that you can use the Etymon PJ classes
> ( and extract the text from PDF documents with
> very little effort. The advantage with the Etymon classes is there is no
> need for COM objects. It worked extremely well for the majority of
> documents; at the very worst some documents would extract all the text with
> some of it out of order.(That has to do more with the layout of the document
> then anything else.)
>   On that note, I have started working on a more-effective (and efficient)
> set of classes to extract text from PDF docs. The plan was to contribute the
> classes to this community and build on the functionality over time. The
> process seems to be pretty straightforward and I hope to complete the first
> version in the near future.
>   In the intern, if anyone would like my Etymon "implementation" I'll be
> happy to send off the code provided whoever requests it is aware it was
> slapped together quickly for a concept-test and could/should be tightened up
> a LOT. The set of classes I'm currently working on address a lot of the
> limitations that are visible in the implementation. (It would probably
> suffice to say it's an example of how to use the PJ classes to extract the
> text from a PDF doc.)
> Cheers
> Robert MacMillan
> On 2/16/02 9:59 PM, "Ivaylo Zlatev" <> wrote:
>> If you want to parse PDF documents, the best approach would be to use
>> the Adobe IFilter for PDF, which is a COM component. You will need to
>> write a java client, which interacts with that COM component.
>> I believe it is easilly doable, but I have never done anything like
>> this.
>> It's a very interesting project, though.
>> Also, you will have to perform the pdf-text conversion on a windows
>> machine.
>> v/ixrefint_9sfm.asp
>> Regards,
>> Ivaylo Zlatev
>> -----Original Message-----
>> From: Otis Gospodnetic []
>> Sent: Saturday, February 16, 2002 7:15 AM
>> To: Lucene Developers List
>> Subject: RE: HTMLParser
>> Hm, I thought this place would have a PDF parser, but it does not.
>> It does seem to have a RTF parser:
>> Perhaps some of these things can be adopted by Lucene, people could
>> contribute Java classes for interacting with specific parsers, and all
>> that could then be included in Lucene to work together with those
>> DocumentHandlers mentioned a few days ago.
>> Otis
> --
> To unsubscribe, e-mail:   <>
> For additional commands, e-mail: <>

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message