uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Default Result Specifications too complicated?
Date Tue, 17 Apr 2007 00:41:21 GMT
Adam Lally wrote:
> I'm interested in getting others' opinions on this.  I was recently
> helping some users who were having a problem where a 3rd-party
> annotator they were using wasn't producing annotations that they
> expected it to.  The annotator was embedded in a nested aggregate.  It
> took me a couple of hours to figure out why it wasn't working (and if
> they hadn't asked a uima developer, they might still be looking for
> it).

This, to me, says we should look for ways to improve this...
>
> The reason was that this particular annotator made use of the
> ResultSpecification (list of types/features that it should produce)
> and was "optimizing" by not producing annotations not listed in the
> Result Spec.  In this case, there was a downstream annotator that
> incorrectly ommitted the type in question from its input capabilities.
> This makes the framework conclude that this type is not necessary, so
> it won't be included in the Result Spec. (see below for more details
> on how this works).
>
> I think the main problem here is that most users have ignored the
> Result Specification feature (we even encourage this by suggesting in
> our docs that it's only for optimization), and they get very little
> other feedback about whether they have set their input/output types.
> So they are totally unprepared to debug something like this.
>
> One possible solution is to turn off this Result Spec stuff by
> default, and provide a global switch (in the
> PerformanceTuningProperties) to turn it back on.  That way most users
> can _safely_ ignore Result Specs, and more savvy users who turn them
> on to get the best performance would presumably be more equipped to
> debug the problems that might result.

Sounds reasonable.
>
> Also we could start giving more feedback about incorrect input/output
> capabilities, although it's not totally clear what the best way to do
> that is.  It would not be good for performance to actually enforce
> these during actually processing.

Perhaps during "startup" a scan could be made to determine if any descriptor
declared type XXX as input, but no other descriptor output type XXX.  This
would not be a complete check, of course, but could catch some errors, and
wouldn't have a run time penalty (other that a startup one).
>
> Any thoughts?

A couple of thought:

1) "ANT" is another system that implements quite a bit of silent 
"default" behavior.
It has a run option, -verbose, which causes it to output a message 
anytime this
defaulting behavior happens.  An example: the script says to set a 
property, but
the property is already set, so Ant skips it.  Another example, the 
script says to
load a property file, but the file doesn't exist (normally no message).

We could have a -verbose mode so that when users find things aren't 
working the
way they want, they have an "easy" thing to set that is "tuned" to give 
them just the
kinds of info they might need to figure out what's going on.

2) The result spec thing, if we consider it to be an optimization, might 
be done as a
two phase thing: 
  a) phase one would output information from the run saying which things 
might be
      skip-able. 
  b) phase two would be the actual optimization, and rather than being 
indirect, it could
      be direct - such as setting something explicitly that says in big 
letters DO NOT OUTPUT
      the following types because they're not used downstream.

This would replace an implicit (and therefore more hidden) behavior with 
an explicit one,
which is a good thing, I think, for simplicity.

-Marshall


>
> -Adam
>
>
> P.S. Here are the specific rules for the Result Spec (this is
> documented in the manual more or less in this form):
>
> The default Result Spec is automatically computed from the
> capabilities in the component descriptors, as follows:
>
> 1) The outermost aggregate's result spec is set to the list of its
> declared output types.
> 2) The result spec for each delegate is set to the union of the
> aggregate's result spec with the set of all input types of all other
> delegates in the aggregate.  (This is so that we ask each annotator to
> produce types that may be needed by a subsequent annotator.  This rule
> is applied independent of the order of the flow, so as to be
> completely general in the case of a custom flow controller.)
> 3) For a nested aggregate, apply rule #2 recursively.
>
> I think these rules make logical sense, and I can't think of any
> easier rules to apply other than to forget the whole thing.
>
>


Mime
View raw message