Return-Path: Delivered-To: apmail-xmlgraphics-general-archive@www.apache.org Received: (qmail 90832 invoked from network); 19 Nov 2007 09:26:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Nov 2007 09:26:48 -0000 Received: (qmail 52582 invoked by uid 500); 19 Nov 2007 09:26:35 -0000 Mailing-List: contact general-help@xmlgraphics.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@xmlgraphics.apache.org Delivered-To: mailing list general@xmlgraphics.apache.org Received: (qmail 52555 invoked by uid 99); 19 Nov 2007 09:26:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Nov 2007 01:26:35 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [213.239.215.103] (HELO tux17.hoststar.ch) (213.239.215.103) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Nov 2007 09:26:24 +0000 Received: from [127.0.0.1] (adsl-84-227-18-109.adslplus.ch [84.227.18.109]) (authenticated bits=0) by tux17.hoststar.ch (8.13.6/8.12.11) with ESMTP id lAJ9QEa6009166; Mon, 19 Nov 2007 10:26:15 +0100 Date: Mon, 19 Nov 2007 10:26:47 +0100 From: Jeremias Maerki To: general@xmlgraphics.apache.org, tika-dev@incubator.apache.org, sanselan-dev@incubator.apache.org, Ben Litchfield Subject: Metadata use by Apache Java projects Message-Id: <20071119093545.2C0D.DEV@jeremias-maerki.ch> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable X-Mailer: Becky! ver. 2.28.01 [en] X-Antivirus: avast! (VPS 071118-2, 18.11.2007), Outbound message X-Antivirus-Status: Clean X-Virus-Checked: Checked by ClamAV on apache.org (I realize this is heavy cross-posting but it's probably the best way to reach all the players I want to address.) As you may know, I've started developing an XMP metadata package inside XML Graphics Commons in order to support XMP metadata (and ultimately PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata. What is XMP? XMP, for those who don't know about it, is based on a subset of RDF to provide a flexible and extensible way of storing/representing document metadata. Yesterday, I was surprised to discover that Adobe has published an XMP Toolkit with Java support under the BSD license. In contrast to my effort, Adobe's toolkit is quite complete if maybe a bit more complicated to use. That got me thinking: Every project I'm sending this message to is using document metadata in some form: - Apache XML Graphics: embeds document metadata in the generated files (just FOP at the moment, but Batik is a similar candidate) - Tika (in incubation): has as one of its main purposes the extraction of metadata - Sanselan (in incubation): extracts and embeds metadata from/in bitmap images - PDFBox (incubation in discussion): extracts and embeds XMP metadata from/in PDF files (see also JempBox) Every one of these projects has its own means to represent metadata in memory. Wouldn't it make sense to have a common approach? I've worked with XMP for some time now and I can say it's ideal to work with. It also defines guidelines to embed XMP metadata in various file formats. It's also relatively easy to map metadata between different file formats (Dublin Core, EXIF, PDF Info etc.). Sanselan and Tika have both chosen a very simple approach but is it versatile enough for the future? While the simple Map in Tika allows for multiple authors, for example, it doesn't support language alternatives for things such as dc:title or dc:description. I'm seriously thinking about abandoning most of my XMP package work in XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't support, tough: - Metadata merging functionality (which I need for synchronizing the PDF Info object and the XMP packet for PDF/A) - Schema-specific adapters (for Dublin Core and many other XMP Schemas) for easier programming (which both Ben and I have written for JempBox and XML Graphics Commons). Adobe's toolkit only allows generic access. Some links: Adobe XMP website: http://www.adobe.com/products/xmp/ Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/ JempBox: http://sourceforge.net/projects/jempbox Apache XML Graphics Commons: http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apach= e/xmlgraphics/xmp/ My questions: - Any interest in converging on a unified model/approach? - If yes, where shall we develop this? As part of Tika (although it's still in incubation)? As a seperate project (maybe as Apache Commons subproject)? If more than XML Graphics uses this, XML Graphics is probably not the right home. - Is Adobe's XMP toolkit interesting for adoption (!=3Dincubation)? Is the JempBox or XML Graphics Commons approach more interesting? - Where's the best place to discuss this? We can't keep posting to several mailing lists. At any rate, I would volunteer to spearhead this effort, especially since I have immediate need to have complete XMP functionality. I've almost finished mapping all XMP structures in XG Commons but I haven't committed my latest changes (for structured properties) and I may still not cover all details of XMP. Thanks for reading this far, Jeremias Maerki --------------------------------------------------------------------- Apache XML Graphics Project URL: http://xmlgraphics.apache.org/ To unsubscribe, e-mail: general-unsubscribe@xmlgraphics.apache.org For additional commands, e-mail: general-help@xmlgraphics.apache.org