Rohan, I was not asking about scalability at all but about the way you built the job file. Have found the anwer to my problem in the meantime : the procedure you described on the Wiki page is valid in distributed mode only (pseudo or real); I was tryong in standalone mode. I will update the Wiki page. J. 2008/8/19 rohan rai > Hey Julien > > There are two aspect of making UIMA work with hadoop.. > > First to make it run...Somehow run on short data for the proof of > concept... > > And then worry about the scalability > > Have you gone through the link > http://cwiki.apache.org/confluence/display/UIMA/Running+UIMA+Apps+on+Hadoop > or > http://rohanrai.blogspot.com/2008/06/uima-hadoop.html > When you have understood what is going on over here.. > Then you should look at this thread which specifically talks about > scalability issues > > Feel free to query more, if you are still unable to make progress > Regards > Rohan > > > After > > On Tue, Aug 19, 2008 at 3:49 PM, Julien Nioche < > lists.digitalpebble@gmail.com> wrote: > >> Hi Rohan, >> >> I saw that thread on the uima list and am in a similar situation. Would >> you mind telling me how you built the job file? I have one which contains >> all my libs and xml configuration files but it does not get automatically >> extracted + I can't access my files using the ClassLoader. >> >> Do you use conf.setJar() at all? >> >> Thanks >> >> Julien >> >> >> 2008/6/30 rohan rai >> >> Sorry for misleading you guys by keeping a few facts with myself. >>> Let me elaborate and tell you the actual problem and the solution I found >>> >>> Actually I am running my UIMA app over hadoop. >>> There I encountered a big problem regarding which I had asked in this >>> forum >>> before >>> Then I found out the solution which later got posted over here >>> http://cwiki.apache.org/UIMA/running-uima-apps-on-hadoop.html >>> This solved a set of problem but it started to give performance issues. >>> Instead of speeding up and scaling up I started facing two sets of >>> problem >>> because of the solution mentioned in the >>> wiki >>> >>> problem 1) Out of memory error >>> The solution talks about using >>> XMLInputSource in = new >>> >>> XMLInputSource(ClassLoader.getSystemResourceAsStream(aeXmlDescriptor),null) >>> >>> to load the xmls and using resource manager to do so. >>> >>> But if this activity is carried on in Map/Reduce class then eventually >>> one >>> gets out of memory error inspite of increasing the heap size >>> considerably. >>> >>> The solution is to >>> initialize these Analysis engine etc in the configure(JobConf) method of >>> the >>> Mapper,Reducer class so as to create a single instance of it in each >>> hadoop >>> task. One can even reuse the cas created using cas.reset() method. >>> >>> By this way the problem of out of memory was solved. >>> >>> Now I started facing another problem regarding performance. >>> The source of which was the usage of Resource Manager mentioned in the >>> wiki >>> to solve another problem. >>> >>> It was caused as each class mentioned in the descriptor, was bought from >>> the >>> job temp directory to task temp directory. >>> >>> Now the problem became to achieve and solve the problem for which the >>> wiki >>> entry was made without using Resource Manager. >>> >>> The solution is to fake imports (Yeah indeed, Ironical, that faking >>> proved >>> to be useful :)). Now what we can do is in the class file where the >>> Map/Reduce task has been implemented we need to import all the classes >>> required by the descriptor initialized in those class. >>> >>> This ensures the presence of these classes at each individual task and >>> thus >>> giving considerable increase in performance >>> >>> Keeping the points mentioned in mind I was now the beauty of UIMA and >>> hadoop together to my own benefit >>> >>> Regards >>> Rohan >>> >>> >>> >>> >>> On Thu, Jun 26, 2008 at 10:52 PM, Thilo Goetz wrote: >>> >>> > rohan rai wrote: >>> > >>> >> @Pascal: As I have already said the timing does not scale linearly >>> >> Secondly it the approx times which I have specified >>> >> @Frank: >>> >> I was talking about actual adding of annotation to CAS >>> >> Record refer to lets say in tags like these ..... >>> >> and the document consist of such record >>> >> Annotation is done via this method >>> >> MyType annotation = new MyType(jCas); >>> >> annotation.setBegin(start); >>> >> annotation.setEnd(end); >>> >> annotation.addToIndexes(); >>> >> This takes a lot of time which is not likeable. >>> >> >>> > >>> > I don't know what you mean by a lot of time, but >>> > you can create hundreds of thousands of annotations >>> > like this per second on a standard windows machine. >>> > You can easily verify this by running this code in >>> > isolation (with mock data). >>> > >>> > You're more likely seeing per document overhead. >>> > For example, resetting the CAS after processing >>> > a document is not so cheap. However, I still don't >>> > know why things are so slow for you. For example, >>> > I ran the following experiment. I installed the >>> > Whitespace Tokenizer pear file into c:\tmp and ran >>> > it 10000 times on its own descriptor. That creates >>> > approx 10Mio annotations. On my 18 months old Xeon >>> > this ran in about 4 seconds. Code and output is >>> > below, for you to recreate. So I'm not sure you have >>> > correctly identified your bottleneck. >>> > >>> > public static void main(String[] args) { >>> > try { >>> > System.out.println("Starting setup."); >>> > XMLParser parser = UIMAFramework.getXMLParser(); >>> > ResourceSpecifier spec = parser.parseResourceSpecifier(new >>> > XMLInputSource(new File( >>> > >>> "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml"))); >>> > AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec, >>> null, >>> > null); >>> > String text = FileUtils.file2String(new File( >>> > >>> "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml")); >>> > CAS cas = ae.newCAS(); >>> > System.out.println("Setup done, starting processing."); >>> > final int max = 10000; >>> > long time = System.currentTimeMillis(); >>> > for (int i = 0; i < max; i++) { >>> > cas.reset(); >>> > cas.setDocumentText(text); >>> > ae.process(cas); >>> > if (cas.getAnnotationIndex().size() != 1080) { >>> > // There are 1080 annotations created for each run >>> > System.out.println("Processing error."); >>> > } >>> > } >>> > time = System.currentTimeMillis() - time; >>> > System.out.println("Time for processing " + max + " documents, " + >>> max >>> > * 1080 >>> > + " annotations: " + new TimeSpan(time)); >>> > } catch (Exception e) { >>> > e.printStackTrace(); >>> > } >>> > } >>> > >>> > Output on my machine: >>> > >>> > Starting setup. >>> > Setup done, starting processing. >>> > Time for processing 10000 documents, 10800000 annotations: 4.078 sec >>> > >>> > --Thilo >>> > >>> > >>> > >>> > >>> >> Regards >>> >> Rohan >>> >> >>> >> >>> >> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. < >>> >> Frank.LeHouillier@gd-ais.com> wrote: >>> >> >>> >> Just to clarify, what do you mean by "annotation"? Is there a >>> specific >>> >>> Analysis Engine that you are using? What is a "record"? Is this a >>> >>> document? It would actually be surprizing for many applications if >>> >>> annotation were not the bottleneck, given that some annotation >>> processes >>> >>> are quite expensive, but this doesn't seem like what you mean here. I >>> >>> can't tell from your question whether it is the process that >>> determines >>> >>> the annotations that is a burden or the actual adding of the >>> annotations >>> >>> to the cas. >>> >>> >>> >>> -----Original Message----- >>> >>> From: rohan rai [mailto:hirohanin@gmail.com] >>> >>> Sent: Thursday, June 26, 2008 7:36 AM >>> >>> To: uima-user@incubator.apache.org >>> >>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed >>> >>> >>> >>> When I profile a UIMA application >>> >>> What I see that annonation takes a lot of time If I profile I see >>> that >>> >>> to annotate 1 record , it takes around 0.06 seconds Now you may say >>> its >>> >>> good Now scale up Although it does not scale up linearly. But here is >>> >>> rough estimate on experiments done 6000 records take 6 min to >>> annotate >>> >>> 800000 record tale around 10 hrs min to annotate Which is bad. >>> >>> One thing is that I am treating each record individually as a cas >>> Even >>> >>> if I treat all the record as a single cas it takes around 6-7 hrs >>> Which >>> >>> is still not good in terms of speed >>> >>> >>> >>> Is there a way out? >>> >>> Can I improve performance by any means?? >>> >>> >>> >>> Regards >>> >>> Rohan >>> >>> >>> >>> >>> >> >>> >> >> >> >> -- >> DigitalPebble Ltd >> http://www.digitalpebble.com >> > > -- DigitalPebble Ltd http://www.digitalpebble.com