Return-Path: Mailing-List: contact general-help@xml.apache.org; run by ezmlm Delivered-To: mailing list general@xml.apache.org Received: (qmail 72202 invoked from network); 28 Jan 2000 14:00:01 -0000 Received: from gate.ldn.wdr.com (193.82.179.18) by 63.211.145.10 with SMTP; 28 Jan 2000 14:00:01 -0000 Received: (from smap@localhost) by gate.ldn.wdr.com (8.8.8/8.8.8) id NAA26501 for ; Fri, 28 Jan 2000 13:59:59 GMT From: Paul.Waugh@wdr.com Received: from (eight.wdr.com [192.168.0.3]) by gate via smap (V2.0) id xma026434; Fri, 28 Jan 2000 13:59:40 GMT Received: from ln4p1013pos.ldn.swissbank.com (mailhost [192.168.0.1]) by virscan1.swissbank.com (8.8.8/8.8.8) with ESMTP id OAA19416 for ; Fri, 28 Jan 2000 14:03:26 GMT Received: from ln4d247p.ldn.swissbank.com (ln4d247p.ldn.swissbank.com [172.16.232.22]) by ln4p1013pos.ldn.swissbank.com (8.8.8/8.8.8) with ESMTP id NAA08740 for ; Fri, 28 Jan 2000 13:59:38 GMT Received: from localhost (root@localhost) by ln4d247p.ldn.swissbank.com (8.8.6 (PHNE_14041)/8.8.6/WDR alpha evision: 1.7 $) with SMTP id NAA10362 for general@xml.apache.org; Fri, 28 Jan 2000 13:59:37 GMT X-OpenMail-Hops: 1 Date: Fri, 28 Jan 2000 13:59:32 +0000 Message-Id: In-Reply-To: <38919117.1575@es.co.nz> Subject: Re: PDF to XML - LOL! MIME-Version: 1.0 TO: general@xml.apache.org Content-Type: text/plain; charset=us-ascii Content-Disposition: inline; filename="Re:" Content-Transfer-Encoding: 7bit X-Processed-By: BrianWare hpomsmf V2.3.40, 19 May 1999 X-WDR-Disclaimer: Version $Revision: 1.13 $ It has been really interesting looking at these threads on this particular item and it gives me another perspective on PDF -> XML My perspective on posting the item was that, this system has legacy docs in PDF and that from an architectual stand point if I can get them into XML then I can react to the business alot quicker. Really all I want to do is put together a frame work where the PDF docs can be mixed with associated data from other systems and then served relevant user service. ie: WWW, WAP, B2B, PDA eBook? other messaging system, anything else that comes along. I see what ever I build now should not be a quick fix to get PDF mixed in with some other stuff to deliver just to the WWW. To pick up a question in Dan's note, I think I might be able to get the source of a few documents but I would like to point out that we are talking about 10's of thousands of documents in this paricular case. :-( not good. thanks Paul ______________________________ Reply Separator _________________________________ Subject: Re: PDF to XML - LOL! Author: dman (dman@es.co.nz) at unix,mime Date: 28/01/00 12:52 Pierpaolo Fumagalli wrote: > ... You cant "recontextualize" > those informations that were extracted from their context... Indeed. I accept that someone may take it upon themselves to inline a representation of binary or propriatary(sp?) data (I still think of PDF as propriatary, in comparison to XML anyway). I guess you're welcome to introduce a block or whatever suits. The thing is, it's a bit beyond XML translators (at the moment) to look at a magazine page and break it up into its constituent bits with meaningful tag names. Heck even translating from Word->HTML is a mess unless the original has been crafted using style templates 100% of the time. In my experience PDF (with its eye on a completely different ball) tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the document even more. Honestly, if you really need to proceed in this direction, the best you're going to achieve is a parcel of nodes, similar to a 'save as plain text' function in the DTP packages. OK, possibly you can tune it to recognise titles & bylines - but only for your select group of identically structured source docs. There will be no push-button solution for a while. Seeing as you're looking into this field, have you ever tried to train HTML-Transit to do its translations? It'd be like that only worse & less accurate. Do you have access to the source documents that the PDFs were distilled from? Get hold of them and you _may_ find a better packaged solution available. - trying to be constructive this time - ...dan. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments.