lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert MacMillan <macmill...@rogers.com>
Subject Re: Parsing PDF documents
Date Sun, 17 Feb 2002 04:36:30 GMT

    I found that you can use the Etymon PJ classes
(http://www.etymon.com/pj/) and extract the text from PDF documents with
very little effort. The advantage with the Etymon classes is there is no
need for COM objects. It worked extremely well for the majority of
documents; at the very worst some documents would extract all the text with
some of it out of order.(That has to do more with the layout of the document
then anything else.)

    On that note, I have started working on a more-effective (and efficient)
set of classes to extract text from PDF docs. The plan was to contribute the
classes to this community and build on the functionality over time. The
process seems to be pretty straightforward and I hope to complete the first
version in the near future.

    In the intern, if anyone would like my Etymon "implementation" I'll be
happy to send off the code provided whoever requests it is aware it was
slapped together quickly for a concept-test and could/should be tightened up
a LOT. The set of classes I'm currently working on address a lot of the
limitations that are visible in the implementation. (It would probably
suffice to say it's an example of how to use the PJ classes to extract the
text from a PDF doc.)

Cheers

Robert MacMillan

On 2/16/02 9:59 PM, "Ivaylo Zlatev" <IZlatev@entigen.com> wrote:

> 
> If you want to parse PDF documents, the best approach would be to use
> the Adobe IFilter for PDF, which is a COM component. You will need to
> write a java client, which interacts with that COM component.
> I believe it is easilly doable, but I have never done anything like
> this.
> It's a very interesting project, though.
> Also, you will have to perform the pdf-text conversion on a windows
> machine.
> 
> http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276
> 
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsr
> v/ixrefint_9sfm.asp
> 
> 
> Regards,
> Ivaylo Zlatev
> 
> 
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Saturday, February 16, 2002 7:15 AM
> To: Lucene Developers List
> Subject: RE: HTMLParser
> 
> 
> Hm, I thought this place would have a PDF parser, but it does not.
> It does seem to have a RTF parser:
> http://cobase-www.cs.ucla.edu/pub/javacc/
> 
> Perhaps some of these things can be adopted by Lucene, people could
> contribute Java classes for interacting with specific parsers, and all
> that could then be included in Lucene to work together with those
> DocumentHandlers mentioned a few days ago.
> 
> Otis


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message