lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Carlson <carl...@bookandhammer.com>
Subject Re: Parsing PDF documents
Date Mon, 18 Feb 2002 03:56:57 GMT
Robert,
If you supply your code I'll add it the contributions area.
It would be great to have some code that already already converts the PDF
directly to a Lucene Document.

--Peter


On 2/16/02 8:36 PM, "Robert MacMillan" <macmillanr@rogers.com> wrote:

> 
>   I found that you can use the Etymon PJ classes
> (http://www.etymon.com/pj/) and extract the text from PDF documents with
> very little effort. The advantage with the Etymon classes is there is no
> need for COM objects. It worked extremely well for the majority of
> documents; at the very worst some documents would extract all the text with
> some of it out of order.(That has to do more with the layout of the document
> then anything else.)
> 
>   On that note, I have started working on a more-effective (and efficient)
> set of classes to extract text from PDF docs. The plan was to contribute the
> classes to this community and build on the functionality over time. The
> process seems to be pretty straightforward and I hope to complete the first
> version in the near future.
> 
>   In the intern, if anyone would like my Etymon "implementation" I'll be
> happy to send off the code provided whoever requests it is aware it was
> slapped together quickly for a concept-test and could/should be tightened up
> a LOT. The set of classes I'm currently working on address a lot of the
> limitations that are visible in the implementation. (It would probably
> suffice to say it's an example of how to use the PJ classes to extract the
> text from a PDF doc.)
> 
> Cheers
> 
> Robert MacMillan
> 
> On 2/16/02 9:59 PM, "Ivaylo Zlatev" <IZlatev@entigen.com> wrote:
> 
>> 
>> If you want to parse PDF documents, the best approach would be to use
>> the Adobe IFilter for PDF, which is a COM component. You will need to
>> write a java client, which interacts with that COM component.
>> I believe it is easilly doable, but I have never done anything like
>> this.
>> It's a very interesting project, though.
>> Also, you will have to perform the pdf-text conversion on a windows
>> machine.
>> 
>> http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276
>> 
>> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsr
>> v/ixrefint_9sfm.asp
>> 
>> 
>> Regards,
>> Ivaylo Zlatev
>> 
>> 
>> 
>> -----Original Message-----
>> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> Sent: Saturday, February 16, 2002 7:15 AM
>> To: Lucene Developers List
>> Subject: RE: HTMLParser
>> 
>> 
>> Hm, I thought this place would have a PDF parser, but it does not.
>> It does seem to have a RTF parser:
>> http://cobase-www.cs.ucla.edu/pub/javacc/
>> 
>> Perhaps some of these things can be adopted by Lucene, people could
>> contribute Java classes for interacting with specific parsers, and all
>> that could then be included in Lucene to work together with those
>> DocumentHandlers mentioned a few days ago.
>> 
>> Otis
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message