Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Received-SPF: pass (athena.apache.org: domain of duluthted@gmail.com
 designates 209.85.214.47 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:reply-to:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=WGEglWxhZxLxnl8KZmxIkpqBxzskO5EZxoxKpZaZUnbyPfbhChas41raSyW8Lq7XAV
         qur5yYpFRsi0Xs44E6QPgYyijhYeMz2Gx1tuXyrHdVhvFsXTjI69D9gYTs7cXvIEcR3z
         VLu5WgQDNQ4UWIlLabRaipG20dQiwRp71GWMg=
MIME-Version: 1.0
Sender: duluthted@gmail.com
Reply-To: tpederse@d.umn.edu
In-Reply-To: <AANLkTi=J_M1xKOwHi_-fhZZtfdT4iMTXBJezDsMuKAu9@mail.gmail.com>
References: <AANLkTin9C4LYc+sTvBfP25Vfsf8b+KVca__qJxAQTEh8@mail.gmail.com>
	<AANLkTimDoSs7TPY4vLjBDea9++HKTKW1pTg5Q+GpKC9H@mail.gmail.com>
	<AANLkTi=J_M1xKOwHi_-fhZZtfdT4iMTXBJezDsMuKAu9@mail.gmail.com>
Date: Mon, 27 Dec 2010 10:10:28 -0600
Message-ID: <AANLkTimuqa2gQkVz=tZkNmQO3K5mB3sPT4KCaejTkhkV@mail.gmail.com>
Subject: Re: UIMA for extracting book "entities" from tables of contents, etc.
 as RDF?
From: Ted Pedersen <tpederse@d.umn.edu>
To: user@uima.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

BTW, one potential consideration in this is that in addition to
providing a dictionary of terms (as Dictionary Annotator and Concept
Mapper seem to provide), I'm also interested in providing regular
expressions that can be matched in my text. So I will have entities
that I want to identify that might occur in a dictionary, or might be
defined by a regular expression. I guess this must be pretty common,
but I'm wondering if either Dictionary Annotator or Concept Mapper
integrate better with Regular Expression Annotator?

In case I'm not being clear about what I'm referring to...

Regular Expression Annotator
http://uima.apache.org/downloads/sandbox/RegexAnnotatorUserGuide/RegexAnnot=
atorUserGuide.html#sandbox.regexAnnotator.conceptsFile.concepts

Dictionary Annotator
http://uima.apache.org/downloads/sandbox/DictionaryAnnotatorUserGuide/Dicti=
onaryAnnotatorUserGuide.html

Concept Mapper
http://uima.apache.org/downloads/sandbox/ConceptMapperAnnotatorUserGuide/Co=
nceptMapperAnnotatorUserGuide.html

Anyway, assuming that I specify entities using both Regular
Expressions and Dictionary entries, is there a preferred way to use
and/or combine the above (or anything else?) The goal at this point is
simply to identify those entities in text for later downstream
processing.

Thanks!
Ted

On Mon, Dec 27, 2010 at 9:59 AM, Ted Pedersen <tpederse@d.umn.edu> wrote:
> Thanks to Tommaso for a very interesting posting, and to Darren for
> the question that generated it.
>
> As a kind of follow-on question to one of the suggestions made by Tommaso=
....
>
> I'm particularly interested in the functionality provided by Concept
> Mapper, or maybe Dictionary Annotator (that is having the ability to
> create a dictionary and then be able to recognize when a dictionary
> term occurs in my text). From reading over the documentation it seems
> like Concept Mapper and Dictionary Annotator are fairly similar. To be
> honest I don't know much about UIMA, but am trying to learn, so there
> might be some subtleties here I don't see (that would make one want to
> prefer one of these over the other).
>
> Is there a short summary of the differences between Concept Mapper and
> Dictionary Annotator, and does anyone have any strong feelings about
> when you should use one over the other?
>
> Cordially,
> Ted
>
> On Mon, Dec 27, 2010 at 2:45 AM, Tommaso Teofili
> <tommaso.teofili@gmail.com> wrote:
>> Hi Darren,
>>
>> 2010/12/23 Darren Cruse <darren.cruse@gmail.com>
>>
>>> Hi guys I apologize for a newbie question but I'm quite new to UIMA and=
 the
>>> whole area of information extraction/entity extraction. =A0And I'm hopi=
ng
>>> someone can tell me if UIMA is a proper tool for a project that I've be=
en
>>> working on (with other tools) that I've been having trouble with.
>>>
>>>
>>> Basically the task is to extract meta data from html in the form of RDF=
.
>>> =A0Where the html represents books/articles/papers/etc. that typically =
have
>>> an
>>> "outline" or "table of contents", and part of the task involves extract=
ing
>>> the entities "behind" (so to speak) the table of contents.
>>>
>>
>> this is perfectly aligned to UIMA scope as it deals with to discovering
>> hidden knowledge
>>
>>
>>>
>>>
>>> So e.g. if the "corpus" of html pages are from a book, and the book has
>>> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6
>>> Sections,
>>> Section 1 has three Parts, etc. =A0Then my resulting RDF has to model t=
hese
>>> things (entities/classes/whatever you'd call them) and understand the
>>> "hierarchy" of what contains what.
>>>
>>>
>>> The real challenging part is that it's a pretty large volume of materia=
l
>>> with many different books/articles/papers/etc. =A0And there is a lot of
>>> variability, as each were authored by different people not following an=
y
>>> particular template.
>>>
>>
>> On the "large volume of material" topic I think that UIMA-AS [1] can hel=
p
>> you as you need to scale.
>>
>>
>>>
>>>
>>> For example what I called a "table of contents" is rarely a single page=
 but
>>> more often it's exploded across multiple "outline" pages where e.g. a h=
igh
>>> level table of contents page goes to the level of chapter links. =A0And=
 then
>>> each chapter may have it's own "outline" breaking down the sections wit=
hin
>>> that chapter. =A0Or it might not, different books can differ. =A0For ex=
ample
>>> the
>>> pages making up the chapter may just have headings referring to the
>>> titles/names of the sections without being organized into a chapter
>>> "outline" at all. =A0Yet I'm still responsible for identifying what the
>>> sections are.
>>>
>>>
>>> Somewhat helpful is that headings often indicate the kind of thing they
>>> are,
>>> e.g. "Section 3: =A0The Life of the Spleen, Wrap-Up". =A0Not always tho=
ugh,
>>> e.g.
>>> I may only get the "The Life of the Spleen, Wrap-Up" part (without "Sec=
tion
>>> 3:" on the front).
>>>
>>>
>>> Or I may get both forms in different places in the book, where ideally =
I
>>> should relate the two references as being the same thing.
>>>
>>>
>>> And where different places can refer to the same thing with other
>>> differences too. =A0Possibly the case of the letters differ, or in this
>>> example there could be one heading with "Wrap-Up" and another with =A0"=
Wrap
>>> Up" (one with the dash the other without the dash).
>>>
>>>
>>> As far as understanding the relationships between things i.e. that Chap=
ter
>>> 3
>>> contains Sections 1 through 3 and Section 1 contains two "Parts", where=
 the
>>> things do appear in a "table of contents" or "outline" page, it seems l=
ike
>>> the arrangement/formatting of those pages give the clue as to "what
>>> contains
>>> what". =A0i.e. Things "contained" typically follow what they're contain=
ed by,
>>> and are often indented (but not necessarily, it can just be that the
>>> "parent" is bolded, yet they might not be indented beneath their "paren=
t").
>>>
>>>
>>>
>>> Apologize for the long winded description but hopefully it will help to
>>> clarify my question since I'm new to UIMA:
>>>
>>>
>>> a. =A0Does it sound like a "UIMA kind of problem"? :)
>>>
>>
>> I recently on a similar use case and yes I think this sounds a UIMA kind=
 of
>> problem.
>> My very abstract advice is to use a bottom-up approach, that is recogniz=
e
>> words, then sentences, then sections at first; after that you can "play"
>> with sections and understand relationships with chapters and so on.
>>
>>
>>>
>>> i.e. These "things" I'm trying to understand like
>>> Volume/Chapter/Section/etc. - would you call those "entities" in the wa=
y
>>> I've heard the term "entity extraction"?
>>>
>>>
>>> b. =A0And I gave so much detail so I could also ask: =A0Does this sound=
 like a
>>> straightforward use for UIMA, or does it sound like a *difficult* use f=
or
>>> UIMA?
>>>
>>
>> it sounds to me a straightforward use of UIMA but this doesn't mean it'l=
l be
>> that easy :)
>>
>>
>>>
>>>
>>> c. =A0Regarding b, I can imagine me giving UIMA regular expressions to =
look
>>> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of tim=
e
>>> like of the chapters I know the book has (this is the idea of a "Gazete=
er"
>>> yes?), but I'm unclear: =A0does UIMA also address this thing where I'm =
trying
>>> to understand "what *contains* what"?
>>>
>>
>> I'd recommend regular expressions as latest thing to rely on, as they ar=
e
>> not so easy to maintain along time and also not so efficient; however th=
ey
>> can really help sometimes.
>> I'd go through simple NLP phases as tokenizing and POS tagging along wit=
h
>> "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe
>> introducing OpenNLP[4] tools to use chunkers.
>>
>>
>>>
>>>
>>> d. =A0i.e. Does UIMA support the need to look at the relationship betwe=
en
>>> things e.g. "does this heading follow another heading, and was that oth=
er
>>> heading identified as a "Section", and is this heading indented further=
 to
>>> the right than that one, so I guess this must be a "Part" within that
>>> "Section". =A0Does UIMA support that kind of thing? =A0If so does that =
have a
>>> name I can search on? :)
>>>
>>
>> What you have to do to support that in UIMA is define some annotator tha=
t
>> recognize headings creating, for example, HeadingAnnotations and then us=
e,
>> for example, the ConfigurableFeatureExtractor[5] to see what follows wha=
t
>> and those kind of things.
>>
>>
>>
>>>
>>>
>>> e. =A0When I mentioned the slight inconsistencies in how things are
>>> referenced
>>> (the case being different, a dash being omitted, etc). I think I've hea=
rd
>>> the phrase "fuzzy matching". =A0I'm guessing that's part of what UIMA
>>> provides?
>>>
>>
>> "fuzzy matching" is more likely to be part of IR systems (as Lucene/Solr=
)
>> however you can place your own tokenizer to parse text as you need; in U=
IMA
>> you can get the simple tokenizer and place also the stemmer block
>> (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of=
 a
>> word.
>>
>>
>>>
>>>
>>> Thanks for any tips I apologize for such a long question I'd been looki=
ng
>>> at
>>> the UIMA docs but I was new enough I decided I needed to appeal to thos=
e of
>>> you with greater experience. :)
>>>
>>
>> Finally regarding RDF there is not an RDF CAS consumer in UIMA but it ca=
n be
>> simply built using Apache Clerezza UIMA Utils module[7]; I'll write a
>> separate email about this as soon as possible.
>>
>> Thanks to you, hope my small hints can help you.
>> Cheers,
>> Tommaso
>>
>> [1] : http://uima.apache.org/doc-uimaas-what.html
>> [2] : http://uima.apache.org/sandbox.html#dict.annotator
>> [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator
>> [4] : http://incubator.apache.org/opennlp/
>> [5] :
>> http://uima.apache.org/sandbox.html#configurable.feature.extractor.annot=
ator
>> [6] : http://uima.apache.org/sandbox.html#snowball.annotator
>> [7] :
>> http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.cler=
ezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/
>>
>>
>>
>>
>>
>>>
>>>
>>> (is there any kind of "Text Extraction for Dummies" kind of introductio=
n
>>> anybody would recommend for a newbie btw?)
>>>
>>>
>>> Thanks again,
>>>
>>>
>>> Darren
>>>
>>
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>


--=20
Ted Pedersen
http://www.d.umn.edu/~tpederse