uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: read/write resource sharing
Date Thu, 30 Aug 2007 14:15:28 GMT
Hi Andrew and everyone reading this thread -

There are some misconceptions, I think, in this thread, caused by some
imprecise language in our Javadocs, perhaps.

The way external resources are shared among annotators is that the XML
descriptors specify an "interface" and an "implementation".  Multiple
external resource dependencies can be specified to share the same
implementation.  Aggregates can override the implementation specified in
their delegates. 

One instance of the implementation class is created by the framework;
this instance is "shared" by all annotators running in the same JVM.

This shared implementation can by user-written, and can do anything it
wants.  For instance, it could keep an in-memory copy of some data and
make it available to all the annotators sharing this; the design of both
the interface and the concrete implementation class is up to you.

That being said, the framework supplies some example interfaces /
implementations for this, one of which is "DataResource".  If you look
at the implementation of DataResource - you can see that the thought
behind the Javadoc comment "if you directly access the resource, the
benefits of the ResourceManager  (caching and sharing) are lost" is
perhaps misleading.  The "caching" being contemplated here was to read a
remote (assume slow to access) file and write it out in the local file
system, to be accessed more quickly.  However ***this is not
implemented***.    The impl code (which you can see on-line if  you
don't want to download the source, it is here for release 2.2:  
http://svn.apache.org/viewvc/incubator/uima/uimaj/tags/uimaj-2.2.0/uimaj-2.2.0-incubating/uimaj-core/src/main/java/org/apache/uima/resource/impl/DataResource_impl.java?view=markup
) says
/**
 * A simple {@link DataResource} implementation that can read data from
a file via a URL. There is
 * an attribute for specifying the location of a local cache for a
remote file, but this is not
 * currently being used.
 */

The getInputStream() method of this, when called by different annotators
sharing the same instance, will return a new, unshared input stream,
each time this is called.

So - I think this resource is probably not what you want.

You might want to implement your own resource, to do exactly the kind of
sharing you want.  If you do, please keep in mind the different possible
deployment alternatives that others using your components may set up. 
For instance, if they deploy things with some scale-out where multiple
instances are running concurrently in the same JVM, then you will need
to insure that your implementation is thread-safe, and follows the rules
for Java memory model.  This will involve using "synchronized" or
"volatile" keywords, for instance, in appropriate spots.

Furthermore, if your components could be deployed in some arrangement
where some of them are running on different JVMs (perhaps scaled out
across multiple hosts, for instance), then to actually share data,
you'll need to use the same techniques used in web servers that do this
- for instance, putting shared data into a database, and having all the
parts access that database.

I hope this is helpful, but please let me know if I've misunderstood the
questions...

-Marshall

Andrew Shirk wrote:
> Hi Michael,
>
> Yes, that's the approach I started with, but the DataResource javadoc
> indicates that if you directly access the resource, the benefits of
> the ResourceManager  (caching and sharing) are lost. Furthermore, if
> in my SharedResourceObject implementation I make modifications to the
> resource, then it will be out of sync with the ResourceManager's
> cache. The next annotator very well may get the stale version of the
> resource.
>
> Thilo, I'm afraid that's the approach I may end up having to use, but
> it's really a kludge.
>
> Is there no global variable space, outside of the CAS, for the entire
> aggregate? If there were, that would be the best solution I think...
>
> Thanks for the suggestions.
>
> Andrew
>
> At 11:27 AM 8/29/2007, you wrote:
>> Another possibility are external resources. When defining external
>> resources one or more annotators can share the same resource.
>> The UIMA framework take care of the resource's life cycle.
>> You will find some documentation about external resources in the UIMA
>> reference guide at 2.4.1.10. External Resource Dependencies.
>> You can also check the UIMA examples - tutorial ex6 use external
>> resources. (apache-uima/examples/descriptors/tutorial/ex6)
>>
>> -- Michael
>>
>> Thilo Goetz wrote:
>>> If this happens often, one idea might be just to
>>> stick the information in the CAS.  That way you
>>> can even run several instances of this pipeline
>>> and it will still work ;-)  Of course you're not
>>> persisting the info that way, not sure if this is
>>> a requirement or not.
>>>
>>> --Thilo
>>>
>>> Andrew Shirk wrote:
>>>
>>>> What is the best practice for sharing read/write resources amongst
>>>> analysis engines in an aggregate? For example, say you have an
>>>> annotator
>>>> early in a flow that reads a configuration file off disk in order
>>>> determine its behavior. Then, the next annotator does something, and
>>>> needs to write changes to the configuration file so that another
>>>> annotator downstream, whose behavior is also determined by the
>>>> contents
>>>> of the configuration file, can read in the resource that contains the
>>>> changes.
>>>>
>>>> Does this make sense?
>>>>
>>>> Any help or ideas would be appreciated. I can think of some ugly
>>>> hacks,
>>>> but it would be nice to know if I'm missing some portion of the API
>>>> that
>>>> supports this type of scenario.
>>>>
>>>> Thanks, Andrew
>>>>
>
>
>


Mime
View raw message