Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 46004 invoked from network); 8 May 2002 01:52:20 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 8 May 2002 01:52:20 -0000 Received: (qmail 6944 invoked by uid 97); 8 May 2002 01:52:26 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 6897 invoked by uid 97); 8 May 2002 01:52:25 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 6884 invoked by uid 98); 8 May 2002 01:52:24 -0000 X-Antivirus: nagoya (v4198 created Apr 24 2002) Date: Tue, 7 May 2002 21:52:05 -0400 To: Lucene Users List Subject: Re: indexing PDF files Message-ID: <20020507215205.B23779@darksleep.com> Reply-To: puff@darksleep.com References: <20020501154156.36966.qmail@web12703.mail.yahoo.com> <0FF45A9C-5E79-11D6-BFCD-000393760B7E@mac.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0FF45A9C-5E79-11D6-BFCD-000393760B7E@mac.com> User-Agent: Mutt/1.3.23i From: puffmail@darksleep.com (Steven J. Owens) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N > On Wednesday, May 1, 2002, at 05:41 PM, Otis Gospodnetic wrote: > >Wouldn't you want to convert to XML instead and use XSLT to transform > >the XML representation to any desired format by just applying a style > >sheet? > >Sounds like less work with bigger document type coverage. And then, On Fri, May 03, 2002 at 11:35:10AM +0200, petite_abeille wrote: > Sounds good... But what does it mean? I'm not that familiar with any of > the XML, XSLT hype so I don't really understand what you are getting > at... I just want to convert any type of document to text for indexing > purpose... I'm not planning to do anything else with it... However, > converting everything to PDF as a first step allow you to provide a > "preview" of any documents even if you happen not to understand the > original format (eg MS Office)... What Otis is getting at is that, while, yes,normalizing all docs to one format before indexing them is probably a good idea, it may also be a good idea to choose a target format other than PDF. XML is probably a good format for two simple reasons: it's becoming the defacto standard for data exchange, including numerous document development, delivery and management systems, there are lots and lots of tools out there, particularly in java and in open source, and more coming every day, for working with XML. PDF is a format designed for presentation in general and particularly for presenting print documents on screen. The majority of use I've seen of PDF in the years since it was introduced is as a portable printable file format. No need for postscript printers or a copy of microsoft word to print the file, just get the small, free, easily downloaded (and already installed in most browsers) acrobat reader. XML is a format designed for conversion, manipulation and transformation and in general much more heavily supported in the programming world. A good example in this case might be the Apache FOP project (http://xml.apache.org/fop/), which can generate PDF from XML. This is in general a straightforwad task; searching google for "convert pdf xml" turns up tons of links on how to convert from XML to PDF, but none on how to convert from PDF to XML. Steven J. Owens puff@darksleep.com -- To unsubscribe, e-mail: For additional commands, e-mail: