Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Date: Tue, 7 May 2002 21:52:05 -0400
To: Lucene Users List <lucene-user@jakarta.apache.org>
Subject: Re: indexing PDF files
Message-ID: <20020507215205.B23779@darksleep.com>
Reply-To: puff@darksleep.com
References: <20020501154156.36966.qmail@web12703.mail.yahoo.com>
 <0FF45A9C-5E79-11D6-BFCD-000393760B7E@mac.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <0FF45A9C-5E79-11D6-BFCD-000393760B7E@mac.com>
User-Agent: Mutt/1.3.23i
From: puffmail@darksleep.com (Steven J. Owens)

> On Wednesday, May 1, 2002, at 05:41 PM, Otis Gospodnetic wrote:
> >Wouldn't you want to convert to XML instead and use XSLT to transform
> >the XML representation to any desired format by just applying a style
> >sheet?
> >Sounds like less work with bigger document type coverage.

And then, On Fri, May 03, 2002 at 11:35:10AM +0200, petite_abeille wrote:
> Sounds good... But what does it mean? I'm not that familiar with any of 
> the XML, XSLT hype so I don't really understand what you are getting 
> at... I just want to convert any type of document to text for indexing 
> purpose... I'm not planning to do anything else with it... However, 
> converting everything to PDF as a first step allow you to provide a 
> "preview" of any documents even if you happen not to understand the 
> original format (eg MS Office)...

     What Otis is getting at is that, while, yes,normalizing all
docs to one format before indexing them is probably a good idea, it
may also be a good idea to choose a target format other than PDF.
XML is probably a good format for two simple
reasons:

     it's becoming the defacto standard for data exchange, including
     numerous document development, delivery and management systems,

     there are lots and lots of tools out there, particularly in java
     and in open source, and more coming every day, for working with XML.

     PDF is a format designed for presentation in general and
particularly for presenting print documents on screen.  The majority
of use I've seen of PDF in the years since it was introduced is as a
portable printable file format.  No need for postscript printers or a
copy of microsoft word to print the file, just get the small, free,
easily downloaded (and already installed in most browsers) acrobat
reader.  XML is a format designed for conversion, manipulation and
transformation and in general much more heavily supported in the
programming world.

     A good example in this case might be the Apache FOP project
(http://xml.apache.org/fop/), which can generate PDF from XML.  This
is in general a straightforwad task; searching google for "convert pdf
xml" turns up tons of links on how to convert from XML to PDF, but
none on how to convert from PDF to XML.

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>