uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From holmberg2066@comcast.net (g...@holmberg.name)
Subject Document "properties" and SourceDocumentInformation
Date Wed, 28 Feb 2007 06:26:44 GMT
What is the recommended way of storing document properties, such as "author", "date created",
"title", etc?

I also need some data for internal uses, such as the document size and URI.

One other requirement: this is not a closed vertical solution with a known set of annotators
designed to inter-operate.  This is an application platform that will use some known annotators
but allow plugging in arbitrary unknown annotators from other companies (that's why one uses
UIMA, of course!).  Also, some of our annotators may be used in UIMA containers from other
companies with unknown annotators.  So my code can't depend on either the UIMA container providing,
or all of the other annotators (but possibly our own) knowing about, any data structure containing
these properties.

I see a few possibilities:

1. Add features to DocumentAnnotation
2. Add features to SourceDocumentInformation
3. Create my own annotation or TOP FS.

The documentation recommends not adding features to DocumentAnnotation if you are using JCas
(I am).  I agree--what if both my annotators and someone else's annotator have added features
to DA?  It just wouldn't work, right?

It's the same with SDI, if two annotators both add features to it.  They in conflict, and
they can't be merged.

SDI is useful however, since it has the document size and URI.  Despite it being in a package
called "examples", in truth it's become a standard.  All the annotators the ship with UIMA
use it.  If you want to use the semantic search (Juru) indexing CAS Consumer, you have to
use SDI.   I'm sure many annotators in the world have used SDI.

I would like my annotators and UIMA container to be compatible with all those annotators.
 Therefore, I think I have to use SDI for size and URI, but not modify it.

Creating my own annotation (or is extending TOP FS better?) seems like the best answer.  My
UIMA container and set of annotators would know about it, and other's annotators wouldn't
be affected.  My annotators would have to gracefully degrade when running in a UIMA container
that doesn't provide this new annotation.

What are people's thoughts?  1, 2 or 3?

================

Longer term, I think we as a community need to define Type Systems that allow inter-operability
of annotators and CAS Consumers.  For example, we could create an official SourceDocumentInformation
that allows arbitrary sets of document properties as simple name-value pairs.  In other words,
add this feature to SourceDocumentInformation:

        properties           uima.cas.FSArray    PropertyFS

    uima.PropertyFS    uima.cas.TOP
        name                  uima.cas.String
        value                   uima.cas.String
        scheme               uima.cas.String

And define that names, values, and schemes conform to the Dublin Core Metadata Initiative
standards.


Similarly, I think we need to create Type System standards for representing document structure.
 For example, how could HTML elements and attributes be stored in the CAS such that all annotators
could depend on them being there and therefore make intelligent use of them?


And finally, we need some Type System standards for representing certain common result annotations,
such as lexical markup and named entities.  How can we combine two annotators from different
companies if they don't have a shared definition of the data flowing between them?


And isn't this the whole point of UIMA?  It appears to me that the UIMA dream won't come true
until we create these standards for data exchange or data transformation within the CAS.

In my opinion, the current situation really limits the usefulness of UIMA as a platform for
text processing (unless you control every piece of code in the system, of course).

How do we start such a consortium?

Thanks for listening,


Greg Holmberg

Mime
View raw message