uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Lally" <ala...@alum.rpi.edu>
Subject Re: Document "properties" and SourceDocumentInformation
Date Wed, 28 Feb 2007 14:38:57 GMT
On 2/28/07, Thilo Goetz <twgoetz@gmx.de> wrote:
> greg@holmberg.name wrote:
> > 1. Add features to DocumentAnnotation
> > 2. Add features to SourceDocumentInformation
> > 3. Create my own annotation or TOP FS.
> >
> If you use the JCas, as you say you do, definitely 3.  There is no need
> to use an annotation, extending TOP would be sufficient.

I agree.  Adding feature to existing types should generally be avoided
if there's an acceptable alternative solution, especially if you're
using JCas.  BTW in 2.1 we've added the ability to index and retrieve
non-Annotation FeatureStructures without having to define a custom
index in your component descriptor, which should make it much more
convenient to use a document metadata Type that extends TOP.

> > Longer term, I think we as a community need to define Type Systems that allow inter-operability
of annotators and CAS Consumers.  For example, we could create an official SourceDocumentInformation
that allows arbitrary sets of document properties as simple name-value pairs.  In other words,
add this feature to SourceDocumentInformation:
> >
> >         properties           uima.cas.FSArray    PropertyFS
> >
> >     uima.PropertyFS    uima.cas.TOP
> >         name                  uima.cas.String
> >         value                   uima.cas.String
> >         scheme               uima.cas.String
> I'm personally not a big fan of arbitrary attribute-value schemes like
> this.  You need yet another place (outside the type system) where you
> document what the properties are that you define and expect.

Agreed.  Our hope is that the type system would be used for declaring
these things.  There can be more than one Type declared for holding
different kinds of document metadata (e.g. a DublinCoreMetadata  type,
in addition to other types with different properties).

Perhaps, it might be useful if these all extended from some base
DocumentMetadata type that did not define any features, just so it
would be clear that these all represented some kind of

> > Similarly, I think we need to create Type System standards for representing document
structure.  For example, how could HTML elements and attributes be stored in the CAS such
that all annotators could depend on them being there and therefore make intelligent use of
> >
> >
> > And finally, we need some Type System standards for representing certain common
result annotations, such as lexical markup and named entities.  How can we combine two annotators
from different companies if they don't have a shared definition of the data flowing between
> >
> >
> > And isn't this the whole point of UIMA?  It appears to me that the UIMA dream won't
come true until we create these standards for data exchange or data transformation within
the CAS.
> >
> > In my opinion, the current situation really limits the usefulness of UIMA as a platform
for text processing (unless you control every piece of code in the system, of course).
> >
> > How do we start such a consortium?
> This mailing list is a good start ;-).  I know there are others who work
> on similar things, but I'll let them speak for themselves.
> One issue of course is that it is difficult to agree on any common type
> system.  It's hard enough to even agree on what an annotation is, let
> alone specific types of annotations.  We could try to define a certain
> base set on Apache.  I would hesitate to put more built-in types into
> UIMA itself, though.  I'd rather have a type system repository where we
> modularly define certain kinds of type systems (such as html markup, for
> example), and that people can use, or not.

Right.  I think largely it's the uima users, not the framework
developers, who would have to participate in forging agreements on
common type systems.  So uima-user seems like a good place to have
discussions like that, at least once this list has a larger number of
subscribers, which we hope will happen after the Apache release out
and people have migrated to it.

Possibly, type systems that have gotten significant support on
uima-user might be included in the Apache UIMA release, initially as
part of the "sandbox".


View raw message