Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0E45A11B93 for ; Mon, 14 Jul 2014 09:03:21 +0000 (UTC) Received: (qmail 1970 invoked by uid 500); 14 Jul 2014 09:03:19 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 1931 invoked by uid 500); 14 Jul 2014 09:03:19 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 1918 invoked by uid 99); 14 Jul 2014 09:03:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Jul 2014 09:03:19 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of debbie.d.zhang@gmail.com designates 209.85.220.48 as permitted sender) Received: from [209.85.220.48] (HELO mail-pa0-f48.google.com) (209.85.220.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Jul 2014 09:03:15 +0000 Received: by mail-pa0-f48.google.com with SMTP id bj1so1610484pad.35 for ; Mon, 14 Jul 2014 02:02:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:from:to:cc:references:in-reply-to:subject:date :mime-version:content-type:thread-index:content-language; bh=VzwKAzntNcY8hYyTOvOxsJwtdVZ3cKyYjRbI3gulGTc=; b=SG4N0Ftlb/5HJld4bHWSlcPPNlaJsiBNnqRxVTVb+5oZezMmALhliLQ0/h2M6YZsXF hc3Iwji76Nw9kzGtQMk7fZGIg8tJffDLZmOuOnj6Ca3W05D6GzyhntetjOmc+JSXgV3V 8v44fIJapbyx5WOvmTtFe1fMR7UR46GMzifyaZS9wRJ3EAnuubvn3gMWzImHJZtmNALX aiNFP7RcqahpduP9puuBgCJonbT8wqlBH/pcS3mq8dfeWEn4qIZFuolkfEksynt0T5Pg uvRGIghSHj/SNtvjxa8aJjW2Jdr0ZB0n+kZwLDketKo+31QOvkmCNp5zKqp3AMG6am9I 0JoQ== X-Received: by 10.66.161.199 with SMTP id xu7mr15702590pab.89.1405328570159; Mon, 14 Jul 2014 02:02:50 -0700 (PDT) Received: from Leica (154.23.233.220.static.exetel.com.au. [220.233.23.154]) by mx.google.com with ESMTPSA id pn4sm10133612pbb.7.2014.07.14.02.02.45 for (version=TLSv1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 14 Jul 2014 02:02:47 -0700 (PDT) Message-ID: <53c39cb7.e486440a.1ffa.ffffa76c@mx.google.com> X-Google-Original-Message-ID: <001501cf9f42$71435610$53ca0230$@d.zhang@gmail.com> From: "Debbie Zhang" To: Cc: References: <-3192216138359466814@unknownmsgid> <4F0AC08C-2435-4298-A6E3-3A6DAB1820A4@utah.edu> <53be76d6.ed21460a.550a.ffffbf1a@mx.google.com> In-Reply-To: Subject: RE: Read file name in an annotator Date: Mon, 14 Jul 2014 19:03:08 +1000 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0016_01CF9F96.42EF6610" X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Ac+cM7WJGtC9cxm1QBWm1IIxlRQyCQDDFEuQ Content-Language: en-au X-Virus-Checked: Checked by ClamAV on apache.org ------=_NextPart_000_0016_01CF9F96.42EF6610 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hi Ravi, Thank you very much for the sample code. However, in my case, the PEAR = file will be deployed to a different system. Therefore, I have no access = to "file.getAbsoluteFile().toURL().toString())". I searched the uima-user mailing list archives and found an old post = which was sent by Marshall Schor last year: http://mail-archives.apache.org/mod_mbox/uima-user/201205.mbox/%3C4FA095E= 1.8070102@schor.com%3E Within this post, CTakes was suggested. I = downloaded CTakes. I tried to use = org.apache.ctakes.typesystem.type.structured.DocumentID defined by = cTakes. However, I can't get it working. typeSystemDescriptor.xml: - typeSystemDescriptor 1.0 - - - uima.TestThirdPartyLib uima.tcas.Annotation TestThirdPartyLib.xml (my annotation which uses = org.apache.ctakes.typesystem.type.structured.DocumentID as input)=20 - org.apache.uima.java true annotators.TestThirdPartyLibDescriptor - TestThirdPartyLibDescriptor 1.0 - - - - uima.TestThirdPartyLib uima.tcas.Annotation - - - org.apache.ctakes.typesystem.type.structure= d.DocumentID - uima.TestThirdPartyLib - true true false TestThirdPartyLibDescriptor.java: package annotators; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.LinkedList; import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; import uima.TestThirdPartyLib; import org.apache.ctakes.typesystem.type.structured.DocumentID; import org.apache.uima.UimaContext; import org.apache.uima.analysis_component.AnalysisComponent; import org.apache.uima.analysis_component.JCasAnnotator_ImplBase; import org.apache.uima.cas.FSIterator; import org.apache.uima.jcas.JCas; import org.apache.uima.jcas.JFSIndexRepository; import org.apache.uima.jcas.cas.TOP; import org.apache.uima.resource.ResourceInitializationException; /** * Test annotation */ public class TestThirdPartyLibDescriptor extends JCasAnnotator_ImplBase { /** * @see AnalysisComponent#initialize(UimaContext) */ public void initialize(UimaContext aContext) throws = ResourceInitializationException { super.initialize(aContext); } /** * @see JCasAnnotator_ImplBase#process(JCas) */ public void process(JCas aJCas) { =09 String docText =3D aJCas.getDocumentText(); test(aJCas); =20 System.out.println("Say something"); } private void test(JCas aJCas) { =20 //System.out.println("Full text:*"+aJCas.getDocumentText()+"*"); =09 JFSIndexRepository indexes =3D aJCas.getJFSIndexRepository(); FSIterator documentIDIterator =3D = indexes.getAllIndexedFS(DocumentID.type); while (documentIDIterator.isValid()) { DocumentID documentIDAnnotation =3D (DocumentID) = documentIDIterator.next(); String documentID =3D documentIDAnnotation.getDocumentID(); System.out.println("DocumentID: "+documentID); } =09 //create an annotation=20 TestThirdPartyLib annotation =3D new TestThirdPartyLib(aJCas); //annotation.setBegin(la.begin()); //annotation.setEnd(la.end()); annotation.addToIndexes(); } } TestMain.java import uima.*; import org.apache.uima.UIMAFramework; import org.apache.uima.analysis_engine.AnalysisEngine; import org.apache.uima.jcas.JCas; import org.apache.uima.cas.FSIndex; import org.apache.uima.cas.FSIterator; import org.apache.uima.cas.text.AnnotationFS; import org.apache.uima.resource.ResourceSpecifier; import org.apache.uima.util.XMLInputSource; import java.io.File; import java.io.BufferedReader; import java.io.FileReader; import java.io.InputStream; import java.io.InputStreamReader; import java.util.regex.Pattern; public class TestMain { =09 static String readFile(File infile) throws Exception { //read file BufferedReader reader =3D new BufferedReader( new FileReader(infile)); StringBuffer fileData =3D new StringBuffer(); char[] buf =3D new char[1024]; int numRead=3D0; while((numRead=3Dreader.read(buf)) !=3D -1){ String readData =3D String.valueOf(buf, 0, numRead); fileData.append(readData); } reader.close(); =20 return fileData.toString(); } public static void main(String[] args) throws Exception {=09 try { System.out.println("Say something"); File aeFile =3D new File("desc/TestThirdPartyLibDescriptor.xml"); XMLInputSource in =3D new XMLInputSource(aeFile); ResourceSpecifier specifier =3D UIMAFramework.getXMLParser().parseResourceSpecifier(in); AnalysisEngine ae =3D UIMAFramework.produceAnalysisEngine(specifier); JCas jcas =3D ae.newJCas(); File inputFileFolder =3D new File("data"); int count =3D 0; for (final File fileEntry : inputFileFolder.listFiles()) { if (fileEntry.isDirectory()) { continue; } else=20 { =09 //if (fileEntry.getName().indexOf(filename)!=3D-1) { //System.out.println(count+": "+fileEntry.getName()); String filecontent =3D TestMain.readFile(fileEntry); =09 //analyze a document jcas.setDocumentText(filecontent); ae.process(jcas); =09 jcas.reset(); =09 count +=3D 1; //break; } } } } catch(Exception e) { e.printStackTrace(); } } } It seems to be silly to use cTakes just using it for getting the file = name. However, I really need to get the file name as it is the only way = to identify a file. Can anyone tell me what I did wrong so = org.apache.ctakes.typesystem.type.structured.DocumentID doesn't work? Any help and suggest will be greatly appreciated! Thank you! Regards, Debbie Zhang =20 > -----Original Message----- > From: Ravindra [mailto:ravindra.bajpai@gmail.com] > Sent: Thursday, 10 July 2014 9:39 PM > To: user@uima.apache.org > Cc: thomas.ginter@utah.edu > Subject: Re: Read file name in an annotator >=20 > May this help - >=20 > // Also store location of source document in CAS. This information > is critical > // if CAS Consumers will need to know where the original document > contents are located. > // For example, the Semantic Search CAS Indexer writes this > information into the > // search index that it creates, which allows applications that = use > the search index to > // locate the documents that satisfy their semantic queries. > SourceDocumentInformation srcDocInfo =3D new > SourceDocumentInformation(jcas); > srcDocInfo.setUri(file.getAbsoluteFile().toURL().toString()); > srcDocInfo.setOffsetInSource(0); > srcDocInfo.setDocumentSize((int) file.length()); > srcDocInfo.setLastSegment(mCurrentIndex =3D=3D mFiles.size()); > srcDocInfo.addToIndexes(); >=20 >=20 > followed by > // retrieve the filename of the input file from the CAS > FSIterator it =3D > jcas.getAnnotationIndex(SourceDocumentInformation.type).iterator(); > File outFile =3D null; > if (it.hasNext()) { > SourceDocumentInformation fileLoc =3D = (SourceDocumentInformation) > it.next(); > File inFile; > try { > inFile =3D new File(new URL(fileLoc.getUri()).getPath()); > String outFileName =3D inFile.getName(); > if (fileLoc.getOffsetInSource() > 0) { > outFileName +=3D ("_" + fileLoc.getOffsetInSource()); > } > outFileName +=3D ".xmi"; > outFile =3D new File(mOutputDir, outFileName); > modelFileName =3D mOutputDir.getAbsolutePath() + "/" + > inFile.getName() + ".ecore"; > } catch (MalformedURLException e1) { > // invalid URL, use default processing below > } > } >=20 > look for SourceDocumentInformation in the examples >=20 >=20 > -- > Ravi. > *''We do not inherit the earth from our ancestors, we borrow it from > our children.'' PROTECT IT !* >=20 >=20 > On Thu, Jul 10, 2014 at 4:49 PM, Debbie Zhang = > wrote: >=20 > > Thanks Thomas. May I ask if there is any sample code of UIMA readers > > that can provide file name information for developing annotation? I > > was looking on the internet today, but couldn't find one. Thanks > again > > for your help - much appreciated! > > > > Regards, > > > > Debbie Zhang > > > > > -----Original Message----- > > > From: Thomas Ginter [mailto:thomas.ginter@utah.edu] > > > Sent: Thursday, 10 July 2014 5:00 AM > > > To: user@uima.apache.org > > > Subject: Re: Read file name in an annotator > > > > > > Hi Debbie, > > > > > > The file name is not provided by default in UIMA although I = believe > > > the UIMA FileReader does populate a SourceDocumentInformation > > > annotation with this information. Our group has a set of readers > > > that populate our own annotation type to provide location data and > > > other meta- information for each record (CAS) being processed. In > > > short you will be better off writing your reader to provide that > information for you. > > > > > > Thanks, > > > > > > Thomas Ginter > > > 801-448-7676 > > > thomas.ginter@utah.edu > > > > > > > > > > > > > > > On Jul 9, 2014, at 5:41, Debbie Zhang > wrote: > > > > > > > Hi, > > > > > > > > Can anyone tell me how to read the file name in an annotator > using > > > the > > > > JCas? It seems the DocumentAnnotation does't contain file name. > > > > Thank you! > > > > > > > > Best regards, > > > > > > > > Debbie Zhang > > > > > > ------=_NextPart_000_0016_01CF9F96.42EF6610 Content-Type: text/xml; name="TestThirdPartyLibDescriptor.xml" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="TestThirdPartyLibDescriptor.xml" =0A= =0A= org.apache.uima.java=0A= true = annotators.TestThirdPartyLibDescriptor=0A= =0A= TestThirdPartyLibDescriptor=0A= =0A= 1.0=0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= uima.TestThirdPartyLib=0A= =0A= uima.tcas.Annotation=0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= org.apache.ctakes.typesystem.type.structure= d.DocumentID=0A= =0A= =0A= uima.TestThirdPartyLib=0A= =0A= =0A= =0A= =0A= =0A= true=0A= true=0A= false=0A= =0A= =0A= =0A= =0A= ------=_NextPart_000_0016_01CF9F96.42EF6610 Content-Type: text/xml; name="typeSystemDescriptor_1.xml" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="typeSystemDescriptor_1.xml" =EF=BB=BF typeSystemDescriptor 1.0 uima.TestThirdPartyLib uima.tcas.Annotation ------=_NextPart_000_0016_01CF9F96.42EF6610--