Return-Path: Delivered-To: apmail-uima-user-archive@www.apache.org Received: (qmail 66353 invoked from network); 27 Dec 2010 16:10:58 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Dec 2010 16:10:58 -0000 Received: (qmail 26123 invoked by uid 500); 27 Dec 2010 16:10:58 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 25825 invoked by uid 500); 27 Dec 2010 16:10:55 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 25812 invoked by uid 99); 27 Dec 2010 16:10:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Dec 2010 16:10:55 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of duluthted@gmail.com designates 209.85.214.47 as permitted sender) Received: from [209.85.214.47] (HELO mail-bw0-f47.google.com) (209.85.214.47) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Dec 2010 16:10:50 +0000 Received: by bwz10 with SMTP id 10so9117348bwz.6 for ; Mon, 27 Dec 2010 08:10:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:reply-to:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:content-type:content-transfer-encoding; bh=mWy83XwsFGfSsQzx8HQrYAg2UCwmgazC2NV9yCGhO8M=; b=I2lcCL7JVfMV7yEkYWShS7V+QuYXVAI5e0aehjfolSPRG3NXoepSDY2UcEoNodmAon k4pa7tciwDNGZGqgYJu6XAva4HyfGfBgxKhTlIxQz6fmQ9Qx5airMUg91cgvDZz0bp88 z6ZG2te3qir8JPBqenIiQ2Qfk2907oau6Qm+4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:reply-to:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=WGEglWxhZxLxnl8KZmxIkpqBxzskO5EZxoxKpZaZUnbyPfbhChas41raSyW8Lq7XAV qur5yYpFRsi0Xs44E6QPgYyijhYeMz2Gx1tuXyrHdVhvFsXTjI69D9gYTs7cXvIEcR3z VLu5WgQDNQ4UWIlLabRaipG20dQiwRp71GWMg= MIME-Version: 1.0 Received: by 10.204.52.138 with SMTP id i10mr152584bkg.23.1293466228201; Mon, 27 Dec 2010 08:10:28 -0800 (PST) Sender: duluthted@gmail.com Reply-To: tpederse@d.umn.edu Received: by 10.204.77.203 with HTTP; Mon, 27 Dec 2010 08:10:28 -0800 (PST) In-Reply-To: References: Date: Mon, 27 Dec 2010 10:10:28 -0600 X-Google-Sender-Auth: M3cHOHS1hyZ5Q0IaYyxLLypNyVc Message-ID: Subject: Re: UIMA for extracting book "entities" from tables of contents, etc. as RDF? From: Ted Pedersen To: user@uima.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable BTW, one potential consideration in this is that in addition to providing a dictionary of terms (as Dictionary Annotator and Concept Mapper seem to provide), I'm also interested in providing regular expressions that can be matched in my text. So I will have entities that I want to identify that might occur in a dictionary, or might be defined by a regular expression. I guess this must be pretty common, but I'm wondering if either Dictionary Annotator or Concept Mapper integrate better with Regular Expression Annotator? In case I'm not being clear about what I'm referring to... Regular Expression Annotator http://uima.apache.org/downloads/sandbox/RegexAnnotatorUserGuide/RegexAnnot= atorUserGuide.html#sandbox.regexAnnotator.conceptsFile.concepts Dictionary Annotator http://uima.apache.org/downloads/sandbox/DictionaryAnnotatorUserGuide/Dicti= onaryAnnotatorUserGuide.html Concept Mapper http://uima.apache.org/downloads/sandbox/ConceptMapperAnnotatorUserGuide/Co= nceptMapperAnnotatorUserGuide.html Anyway, assuming that I specify entities using both Regular Expressions and Dictionary entries, is there a preferred way to use and/or combine the above (or anything else?) The goal at this point is simply to identify those entities in text for later downstream processing. Thanks! Ted On Mon, Dec 27, 2010 at 9:59 AM, Ted Pedersen wrote: > Thanks to Tommaso for a very interesting posting, and to Darren for > the question that generated it. > > As a kind of follow-on question to one of the suggestions made by Tommaso= .... > > I'm particularly interested in the functionality provided by Concept > Mapper, or maybe Dictionary Annotator (that is having the ability to > create a dictionary and then be able to recognize when a dictionary > term occurs in my text). From reading over the documentation it seems > like Concept Mapper and Dictionary Annotator are fairly similar. To be > honest I don't know much about UIMA, but am trying to learn, so there > might be some subtleties here I don't see (that would make one want to > prefer one of these over the other). > > Is there a short summary of the differences between Concept Mapper and > Dictionary Annotator, and does anyone have any strong feelings about > when you should use one over the other? > > Cordially, > Ted > > On Mon, Dec 27, 2010 at 2:45 AM, Tommaso Teofili > wrote: >> Hi Darren, >> >> 2010/12/23 Darren Cruse >> >>> Hi guys I apologize for a newbie question but I'm quite new to UIMA and= the >>> whole area of information extraction/entity extraction. =A0And I'm hopi= ng >>> someone can tell me if UIMA is a proper tool for a project that I've be= en >>> working on (with other tools) that I've been having trouble with. >>> >>> >>> Basically the task is to extract meta data from html in the form of RDF= . >>> =A0Where the html represents books/articles/papers/etc. that typically = have >>> an >>> "outline" or "table of contents", and part of the task involves extract= ing >>> the entities "behind" (so to speak) the table of contents. >>> >> >> this is perfectly aligned to UIMA scope as it deals with to discovering >> hidden knowledge >> >> >>> >>> >>> So e.g. if the "corpus" of html pages are from a book, and the book has >>> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6 >>> Sections, >>> Section 1 has three Parts, etc. =A0Then my resulting RDF has to model t= hese >>> things (entities/classes/whatever you'd call them) and understand the >>> "hierarchy" of what contains what. >>> >>> >>> The real challenging part is that it's a pretty large volume of materia= l >>> with many different books/articles/papers/etc. =A0And there is a lot of >>> variability, as each were authored by different people not following an= y >>> particular template. >>> >> >> On the "large volume of material" topic I think that UIMA-AS [1] can hel= p >> you as you need to scale. >> >> >>> >>> >>> For example what I called a "table of contents" is rarely a single page= but >>> more often it's exploded across multiple "outline" pages where e.g. a h= igh >>> level table of contents page goes to the level of chapter links. =A0And= then >>> each chapter may have it's own "outline" breaking down the sections wit= hin >>> that chapter. =A0Or it might not, different books can differ. =A0For ex= ample >>> the >>> pages making up the chapter may just have headings referring to the >>> titles/names of the sections without being organized into a chapter >>> "outline" at all. =A0Yet I'm still responsible for identifying what the >>> sections are. >>> >>> >>> Somewhat helpful is that headings often indicate the kind of thing they >>> are, >>> e.g. "Section 3: =A0The Life of the Spleen, Wrap-Up". =A0Not always tho= ugh, >>> e.g. >>> I may only get the "The Life of the Spleen, Wrap-Up" part (without "Sec= tion >>> 3:" on the front). >>> >>> >>> Or I may get both forms in different places in the book, where ideally = I >>> should relate the two references as being the same thing. >>> >>> >>> And where different places can refer to the same thing with other >>> differences too. =A0Possibly the case of the letters differ, or in this >>> example there could be one heading with "Wrap-Up" and another with =A0"= Wrap >>> Up" (one with the dash the other without the dash). >>> >>> >>> As far as understanding the relationships between things i.e. that Chap= ter >>> 3 >>> contains Sections 1 through 3 and Section 1 contains two "Parts", where= the >>> things do appear in a "table of contents" or "outline" page, it seems l= ike >>> the arrangement/formatting of those pages give the clue as to "what >>> contains >>> what". =A0i.e. Things "contained" typically follow what they're contain= ed by, >>> and are often indented (but not necessarily, it can just be that the >>> "parent" is bolded, yet they might not be indented beneath their "paren= t"). >>> >>> >>> >>> Apologize for the long winded description but hopefully it will help to >>> clarify my question since I'm new to UIMA: >>> >>> >>> a. =A0Does it sound like a "UIMA kind of problem"? :) >>> >> >> I recently on a similar use case and yes I think this sounds a UIMA kind= of >> problem. >> My very abstract advice is to use a bottom-up approach, that is recogniz= e >> words, then sentences, then sections at first; after that you can "play" >> with sections and understand relationships with chapters and so on. >> >> >>> >>> i.e. These "things" I'm trying to understand like >>> Volume/Chapter/Section/etc. - would you call those "entities" in the wa= y >>> I've heard the term "entity extraction"? >>> >>> >>> b. =A0And I gave so much detail so I could also ask: =A0Does this sound= like a >>> straightforward use for UIMA, or does it sound like a *difficult* use f= or >>> UIMA? >>> >> >> it sounds to me a straightforward use of UIMA but this doesn't mean it'l= l be >> that easy :) >> >> >>> >>> >>> c. =A0Regarding b, I can imagine me giving UIMA regular expressions to = look >>> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of tim= e >>> like of the chapters I know the book has (this is the idea of a "Gazete= er" >>> yes?), but I'm unclear: =A0does UIMA also address this thing where I'm = trying >>> to understand "what *contains* what"? >>> >> >> I'd recommend regular expressions as latest thing to rely on, as they ar= e >> not so easy to maintain along time and also not so efficient; however th= ey >> can really help sometimes. >> I'd go through simple NLP phases as tokenizing and POS tagging along wit= h >> "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe >> introducing OpenNLP[4] tools to use chunkers. >> >> >>> >>> >>> d. =A0i.e. Does UIMA support the need to look at the relationship betwe= en >>> things e.g. "does this heading follow another heading, and was that oth= er >>> heading identified as a "Section", and is this heading indented further= to >>> the right than that one, so I guess this must be a "Part" within that >>> "Section". =A0Does UIMA support that kind of thing? =A0If so does that = have a >>> name I can search on? :) >>> >> >> What you have to do to support that in UIMA is define some annotator tha= t >> recognize headings creating, for example, HeadingAnnotations and then us= e, >> for example, the ConfigurableFeatureExtractor[5] to see what follows wha= t >> and those kind of things. >> >> >> >>> >>> >>> e. =A0When I mentioned the slight inconsistencies in how things are >>> referenced >>> (the case being different, a dash being omitted, etc). I think I've hea= rd >>> the phrase "fuzzy matching". =A0I'm guessing that's part of what UIMA >>> provides? >>> >> >> "fuzzy matching" is more likely to be part of IR systems (as Lucene/Solr= ) >> however you can place your own tokenizer to parse text as you need; in U= IMA >> you can get the simple tokenizer and place also the stemmer block >> (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of= a >> word. >> >> >>> >>> >>> Thanks for any tips I apologize for such a long question I'd been looki= ng >>> at >>> the UIMA docs but I was new enough I decided I needed to appeal to thos= e of >>> you with greater experience. :) >>> >> >> Finally regarding RDF there is not an RDF CAS consumer in UIMA but it ca= n be >> simply built using Apache Clerezza UIMA Utils module[7]; I'll write a >> separate email about this as soon as possible. >> >> Thanks to you, hope my small hints can help you. >> Cheers, >> Tommaso >> >> [1] : http://uima.apache.org/doc-uimaas-what.html >> [2] : http://uima.apache.org/sandbox.html#dict.annotator >> [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator >> [4] : http://incubator.apache.org/opennlp/ >> [5] : >> http://uima.apache.org/sandbox.html#configurable.feature.extractor.annot= ator >> [6] : http://uima.apache.org/sandbox.html#snowball.annotator >> [7] : >> http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.cler= ezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/ >> >> >> >> >> >>> >>> >>> (is there any kind of "Text Extraction for Dummies" kind of introductio= n >>> anybody would recommend for a newbie btw?) >>> >>> >>> Thanks again, >>> >>> >>> Darren >>> >> > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse > --=20 Ted Pedersen http://www.d.umn.edu/~tpederse