uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche" <lists.digitalpeb...@gmail.com>
Subject Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed
Date Tue, 19 Aug 2008 14:56:47 GMT
Rohan,

I was not asking about scalability at all but about the way you built the
job file. Have found the anwer to my problem in the meantime : the procedure
you described on the Wiki page is valid in distributed mode only (pseudo or
real); I was tryong in standalone mode. I will update the Wiki page.

J.

2008/8/19 rohan rai <hirohanin@gmail.com>

> Hey Julien
>
>  There are two aspect of making UIMA work with hadoop..
>
> First to make it run...Somehow run on short data for the proof of
> concept...
>
> And then worry about the scalability
>
> Have you gone through the link
> http://cwiki.apache.org/confluence/display/UIMA/Running+UIMA+Apps+on+Hadoop
> or
> http://rohanrai.blogspot.com/2008/06/uima-hadoop.html
> When you have understood what is going on over here..
> Then you should look at this thread which specifically talks about
> scalability issues
>
> Feel free to query more, if you are still  unable to make progress
> Regards
> Rohan
>
>
> After
>
> On Tue, Aug 19, 2008 at 3:49 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi Rohan,
>>
>> I saw that thread on the uima list and am in a similar situation. Would
>> you mind telling me how you built the job file? I have one which contains
>> all my libs and xml configuration files but it does not get automatically
>> extracted + I can't access my files using the ClassLoader.
>>
>> Do you use conf.setJar() at all?
>>
>> Thanks
>>
>> Julien
>>
>>
>> 2008/6/30 rohan rai <hirohanin@gmail.com>
>>
>> Sorry for misleading you guys by keeping a few facts with myself.
>>> Let me elaborate and tell you the actual problem and the solution I found
>>>
>>> Actually I am running my UIMA app over hadoop.
>>> There I encountered a big problem regarding which I had asked in this
>>> forum
>>> before
>>> Then I found out the solution which later got posted over here
>>> http://cwiki.apache.org/UIMA/running-uima-apps-on-hadoop.html
>>> This solved a set of problem but it started to give performance issues.
>>> Instead of speeding up and scaling up I started facing two sets of
>>> problem
>>> because of the solution mentioned in the
>>> wiki
>>>
>>> problem 1) Out of memory error
>>> The solution talks about using
>>> XMLInputSource in = new
>>>
>>> XMLInputSource(ClassLoader.getSystemResourceAsStream(aeXmlDescriptor),null)
>>>
>>> to load the xmls and using resource manager to do so.
>>>
>>> But if this activity is carried on in Map/Reduce class then eventually
>>> one
>>> gets out of memory error inspite of increasing the heap size
>>> considerably.
>>>
>>> The solution is to
>>> initialize these Analysis engine etc in the configure(JobConf) method of
>>> the
>>> Mapper,Reducer class so as to create a single instance of it in each
>>> hadoop
>>> task. One can even reuse the cas created using cas.reset() method.
>>>
>>> By this way the problem of out of memory was solved.
>>>
>>> Now I started facing another problem regarding performance.
>>> The source of which was the usage of Resource Manager mentioned in the
>>> wiki
>>> to solve another problem.
>>>
>>> It was caused as each class mentioned in the descriptor, was bought from
>>> the
>>> job temp directory to task temp directory.
>>>
>>> Now the problem became to achieve and solve the problem for which the
>>> wiki
>>> entry was made without using Resource Manager.
>>>
>>> The solution is to fake imports (Yeah indeed, Ironical, that faking
>>> proved
>>> to be useful :)). Now what we can do is in the  class file where the
>>> Map/Reduce task has been implemented we need to import all the classes
>>> required by the descriptor initialized in those class.
>>>
>>> This ensures the presence of these classes at each individual task and
>>> thus
>>> giving considerable increase in performance
>>>
>>> Keeping the points mentioned in mind I was  now the beauty of UIMA and
>>> hadoop together to my own benefit
>>>
>>> Regards
>>> Rohan
>>>
>>>
>>>
>>>
>>> On Thu, Jun 26, 2008 at 10:52 PM, Thilo Goetz <twgoetz@gmx.de> wrote:
>>>
>>> > rohan rai wrote:
>>> >
>>> >> @Pascal: As I have already said the timing does not scale linearly
>>> >>              Secondly it the approx times which I have specified
>>> >> @Frank:
>>> >>     I was talking about actual adding of annotation to CAS
>>> >>    Record refer to lets say in tags like these <a>.....</a>
>>> >>    and the document consist of such record
>>> >>    Annotation is done via this method
>>> >>                               MyType annotation = new MyType(jCas);
>>> >>                               annotation.setBegin(start);
>>> >>                               annotation.setEnd(end);
>>> >>                               annotation.addToIndexes();
>>> >>   This takes a lot of time which is not likeable.
>>> >>
>>> >
>>> > I don't know what you mean by a lot of time, but
>>> > you can create hundreds of thousands of annotations
>>> > like this per second on a standard windows machine.
>>> > You can easily verify this by running this code in
>>> > isolation (with mock data).
>>> >
>>> > You're more likely seeing per document overhead.
>>> > For example, resetting the CAS after processing
>>> > a document is not so cheap.  However, I still don't
>>> > know why things are so slow for you.  For example,
>>> > I ran the following experiment.  I installed the
>>> > Whitespace Tokenizer pear file into c:\tmp and ran
>>> > it 10000 times on its own descriptor.  That creates
>>> > approx 10Mio annotations.  On my 18 months old Xeon
>>> > this ran in about 4 seconds.  Code and output is
>>> > below, for you to recreate.  So I'm not sure you have
>>> > correctly identified your bottleneck.
>>> >
>>> >  public static void main(String[] args) {
>>> >    try {
>>> >      System.out.println("Starting setup.");
>>> >      XMLParser parser = UIMAFramework.getXMLParser();
>>> >      ResourceSpecifier spec = parser.parseResourceSpecifier(new
>>> > XMLInputSource(new File(
>>> >
>>>  "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml")));
>>> >      AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec,
>>> null,
>>> > null);
>>> >      String text = FileUtils.file2String(new File(
>>> >
>>>  "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml"));
>>> >      CAS cas = ae.newCAS();
>>> >      System.out.println("Setup done, starting processing.");
>>> >      final int max = 10000;
>>> >      long time = System.currentTimeMillis();
>>> >      for (int i = 0; i < max; i++) {
>>> >        cas.reset();
>>> >        cas.setDocumentText(text);
>>> >        ae.process(cas);
>>> >        if (cas.getAnnotationIndex().size() != 1080) {
>>> >          // There are 1080 annotations created for each run
>>> >          System.out.println("Processing error.");
>>> >        }
>>> >      }
>>> >      time = System.currentTimeMillis() - time;
>>> >      System.out.println("Time for processing " + max + " documents, " +
>>> max
>>> > * 1080
>>> >          + " annotations: " + new TimeSpan(time));
>>> >    } catch (Exception e) {
>>> >      e.printStackTrace();
>>> >    }
>>> >  }
>>> >
>>> > Output on my machine:
>>> >
>>> > Starting setup.
>>> > Setup done, starting processing.
>>> > Time for processing 10000 documents, 10800000 annotations: 4.078 sec
>>> >
>>> > --Thilo
>>> >
>>> >
>>> >
>>> >
>>> >> Regards
>>> >> Rohan
>>> >>
>>> >>
>>> >> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
>>> >> Frank.LeHouillier@gd-ais.com> wrote:
>>> >>
>>> >>  Just to clarify, what do you mean by "annotation"?  Is there a
>>> specific
>>> >>> Analysis Engine that you are using? What is a "record"? Is this
a
>>> >>> document?  It would actually be surprizing for many applications
if
>>> >>> annotation were not the bottleneck, given that some annotation
>>> processes
>>> >>> are quite expensive, but this doesn't seem like what you mean here.
I
>>> >>> can't tell from your question whether it is the process that
>>> determines
>>> >>> the annotations that is a burden or the actual adding of the
>>> annotations
>>> >>> to the cas.
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: rohan rai [mailto:hirohanin@gmail.com]
>>> >>> Sent: Thursday, June 26, 2008 7:36 AM
>>> >>> To: uima-user@incubator.apache.org
>>> >>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of
speed
>>> >>>
>>> >>> When I profile a UIMA application
>>> >>> What I see that annonation takes a lot of time If I profile I see
>>> that
>>> >>> to annotate 1 record , it takes around 0.06 seconds Now you may
say
>>> its
>>> >>> good Now scale up Although it does not scale up linearly. But here
is
>>> >>> rough estimate on experiments done 6000 records take 6 min to
>>> annotate
>>> >>> 800000 record tale around 10 hrs min to annotate Which is bad.
>>> >>> One thing is that I am treating each record individually as a cas
>>> Even
>>> >>> if I treat all the record as a single cas it takes around 6-7 hrs
>>> Which
>>> >>> is still not good in terms of speed
>>> >>>
>>> >>> Is there a way out?
>>> >>> Can I improve performance by any means??
>>> >>>
>>> >>> Regards
>>> >>> Rohan
>>> >>>
>>> >>>
>>> >>
>>>
>>
>>
>>
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message