uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "rohan rai" <hiroha...@gmail.com>
Subject Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed
Date Mon, 30 Jun 2008 10:36:24 GMT
Sorry for misleading you guys by keeping a few facts with myself.
Let me elaborate and tell you the actual problem and the solution I found

Actually I am running my UIMA app over hadoop.
There I encountered a big problem regarding which I had asked in this forum
Then I found out the solution which later got posted over here
This solved a set of problem but it started to give performance issues.
Instead of speeding up and scaling up I started facing two sets of problem
because of the solution mentioned in the

problem 1) Out of memory error
The solution talks about using
XMLInputSource in = new

to load the xmls and using resource manager to do so.

But if this activity is carried on in Map/Reduce class then eventually one
gets out of memory error inspite of increasing the heap size considerably.

The solution is to
initialize these Analysis engine etc in the configure(JobConf) method of the
Mapper,Reducer class so as to create a single instance of it in each hadoop
task. One can even reuse the cas created using cas.reset() method.

By this way the problem of out of memory was solved.

Now I started facing another problem regarding performance.
The source of which was the usage of Resource Manager mentioned in the wiki
to solve another problem.

It was caused as each class mentioned in the descriptor, was bought from the
job temp directory to task temp directory.

Now the problem became to achieve and solve the problem for which the wiki
entry was made without using Resource Manager.

The solution is to fake imports (Yeah indeed, Ironical, that faking proved
to be useful :)). Now what we can do is in the  class file where the
Map/Reduce task has been implemented we need to import all the classes
required by the descriptor initialized in those class.

This ensures the presence of these classes at each individual task and thus
giving considerable increase in performance

Keeping the points mentioned in mind I was  now the beauty of UIMA and
hadoop together to my own benefit


On Thu, Jun 26, 2008 at 10:52 PM, Thilo Goetz <twgoetz@gmx.de> wrote:

> rohan rai wrote:
>> @Pascal: As I have already said the timing does not scale linearly
>>              Secondly it the approx times which I have specified
>> @Frank:
>>     I was talking about actual adding of annotation to CAS
>>    Record refer to lets say in tags like these <a>.....</a>
>>    and the document consist of such record
>>    Annotation is done via this method
>>                               MyType annotation = new MyType(jCas);
>>                               annotation.setBegin(start);
>>                               annotation.setEnd(end);
>>                               annotation.addToIndexes();
>>   This takes a lot of time which is not likeable.
> I don't know what you mean by a lot of time, but
> you can create hundreds of thousands of annotations
> like this per second on a standard windows machine.
> You can easily verify this by running this code in
> isolation (with mock data).
> You're more likely seeing per document overhead.
> For example, resetting the CAS after processing
> a document is not so cheap.  However, I still don't
> know why things are so slow for you.  For example,
> I ran the following experiment.  I installed the
> Whitespace Tokenizer pear file into c:\tmp and ran
> it 10000 times on its own descriptor.  That creates
> approx 10Mio annotations.  On my 18 months old Xeon
> this ran in about 4 seconds.  Code and output is
> below, for you to recreate.  So I'm not sure you have
> correctly identified your bottleneck.
>  public static void main(String[] args) {
>    try {
>      System.out.println("Starting setup.");
>      XMLParser parser = UIMAFramework.getXMLParser();
>      ResourceSpecifier spec = parser.parseResourceSpecifier(new
> XMLInputSource(new File(
>          "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml")));
>      AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec, null,
> null);
>      String text = FileUtils.file2String(new File(
>          "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml"));
>      CAS cas = ae.newCAS();
>      System.out.println("Setup done, starting processing.");
>      final int max = 10000;
>      long time = System.currentTimeMillis();
>      for (int i = 0; i < max; i++) {
>        cas.reset();
>        cas.setDocumentText(text);
>        ae.process(cas);
>        if (cas.getAnnotationIndex().size() != 1080) {
>          // There are 1080 annotations created for each run
>          System.out.println("Processing error.");
>        }
>      }
>      time = System.currentTimeMillis() - time;
>      System.out.println("Time for processing " + max + " documents, " + max
> * 1080
>          + " annotations: " + new TimeSpan(time));
>    } catch (Exception e) {
>      e.printStackTrace();
>    }
>  }
> Output on my machine:
> Starting setup.
> Setup done, starting processing.
> Time for processing 10000 documents, 10800000 annotations: 4.078 sec
> --Thilo
>> Regards
>> Rohan
>> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
>> Frank.LeHouillier@gd-ais.com> wrote:
>>  Just to clarify, what do you mean by "annotation"?  Is there a specific
>>> Analysis Engine that you are using? What is a "record"? Is this a
>>> document?  It would actually be surprizing for many applications if
>>> annotation were not the bottleneck, given that some annotation processes
>>> are quite expensive, but this doesn't seem like what you mean here. I
>>> can't tell from your question whether it is the process that determines
>>> the annotations that is a burden or the actual adding of the annotations
>>> to the cas.
>>> -----Original Message-----
>>> From: rohan rai [mailto:hirohanin@gmail.com]
>>> Sent: Thursday, June 26, 2008 7:36 AM
>>> To: uima-user@incubator.apache.org
>>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
>>> When I profile a UIMA application
>>> What I see that annonation takes a lot of time If I profile I see that
>>> to annotate 1 record , it takes around 0.06 seconds Now you may say its
>>> good Now scale up Although it does not scale up linearly. But here is
>>> rough estimate on experiments done 6000 records take 6 min to annotate
>>> 800000 record tale around 10 hrs min to annotate Which is bad.
>>> One thing is that I am treating each record individually as a cas Even
>>> if I treat all the record as a single cas it takes around 6-7 hrs Which
>>> is still not good in terms of speed
>>> Is there a way out?
>>> Can I improve performance by any means??
>>> Regards
>>> Rohan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message