Return-Path: Mailing-List: contact general-help@xml.apache.org; run by ezmlm Delivered-To: mailing list general@xml.apache.org Received: (qmail 62757 invoked from network); 29 May 2000 14:26:49 -0000 Received: from tk1.ihug.co.nz (HELO smtp1.ihug.co.nz) (203.29.160.13) by locus.apache.org with SMTP; 29 May 2000 14:26:49 -0000 Received: from cherry2000 (p28-max4.wlg.ihug.co.nz [209.78.48.220]) by smtp1.ihug.co.nz (8.9.3/8.9.3/Debian/GNU) with SMTP id CAA05535 for ; Tue, 30 May 2000 02:26:44 +1200 Message-ID: <393118A1.39D@es.co.nz> Date: Mon, 29 May 2000 01:01:21 +1200 From: Dan Morrison Organization: Disorganised X-Mailer: Mozilla 3.04 (Win95; I) MIME-Version: 1.0 To: general@xml.apache.org Subject: Re: Cocoon the other way??? References: <39326C74.A4B6BF@cs.up.ac.za> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Rating: locus.apache.org 1.6.2 0/1000/N Samuel Kock wrote: > > Hi > > I am quite new to this list, but hace been following the goings on for > some time. Most things You people talk about are a bit greek to me, but > I do have one question: Well, some of it can be a bit techie (where are your classpaths etc) but now and then higher thoughts are discussed here... > Is it possible to maybe write an extension to cocoon that would go the > other way? For example, convert a PDF file to an XML file using (I > suppose) XSLT? Or a RTF or HTML file, for that matter???? While I can sympathise with this desire, it's just not a happening thing, except in the most basic sense. It is possible to design round-trip stylesheets, so that data coming from XML into $PRESENTATION_FORMAT can be decomposed back into XML, but in the general case, the answer is currently no. If you were able to impose the same limitation upon your source input as XMLers do, with strict DTDs and all, sure! But the general case is nothing like that. The task you're looking at is (I assume) migration of legacy documents into a more versatile medium. The problem is that you cannot add value to these documents automatically without some deeper understanding of the context and the rules the layout follows. _If_ you were converting vanilla HTML, which used

,

in a consistant way, you could extract an XML file that could indicate "here is the beginning of a section, its title is ZZZ. This probably isn't even enough to deduce where you could put your and tags, but it would maintain the structure that was already there. But I'd say the only website site i've seen in the last year which tried to do so was W3C. Think of trying to make sense of the semantic content of any high-profile site that hasn't been designed with these hooks in there beforehand. (news sites have hooks, as do many database-driven sites) Theoretically the same could be done with Word files, /assuming/ that every author had used 'styles' consistantly and rigorously. In the real world that almost never happens. A company could design a perfect standard template, but the first bozo to use it will delete the date by accident, then replace it with text that may look exactly the same, but be 'bold italic' instead of style:date. So the automated process that is to slurp this doc has much less chance of popping it into a field. Have a play with HTML Transit, and some real world docs from a period of time, and you'll see that it can be done, but painfully, and with many special cases and 'training' of the algorithms. PDF is even further removed. When it comes to extracting semantic context out of a DTP doc, you can compare it with trying to convince an OCR scanner to behave. It can be done, but /generally/ on a case-by-case basis. Designers don't say "Here is the title, here is the byline", they position it there, somehow, and leave it to the viewer to deduce from font size and position what part of the document it is. You can try and teach a scanning application that font size 18 indicates a title, but good luck getting it to find the correct context or logical position for a pullquote. I can't speak for the Cocoon dev guys, but this worthy field of endeavour is probably a different ball game from what XSL publishing is about in the current environment. I /am/ predicting the time when round-trip publishing is a reality, (working on it myself part-time) but as for legacy documents, you're looking at a tidy subset of artificial recognition to get results. Or something like HTML Transit. Effective, but not automatic. This is my real-world P.O.V. having been there in many guises. Try having 28 departments getting faxed (I know, I know) a template page, then having 127 documents come back containing some sort of Word 6 representation of a 5 x 16 field table full of mixed data. /Although/ visually and semantically most of them were equivalent, macroing that into a database was pretty raw. Just done the same thing again on a smaller scale last week, when auditing several several hundred IP allocations for the university... "please correct the user details for your department as shown on this document and send it back" Uuurg. The things they did to that plaintext... There may be some academics out there that can prove this whole thing is perfectly possible in theory, But as usual, I'm just speaking from the trenches. God lick. .dan. :=====================:====================: : Dan Morrison : The Web Limited : : http://here.is/dan : http://web.co.nz : : dman@es.co.nz : danm@web.co.nz : : 04 384 1472 : 04 495 8250 : : 025 207 1140 : : :.....................:....................: : If ignorance is bliss, why aren't more people happy? :.........................................: