uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Pedersen <tpede...@d.umn.edu>
Subject Re: UIMA for extracting book "entities" from tables of contents, etc. as RDF?
Date Mon, 27 Dec 2010 16:10:28 GMT
BTW, one potential consideration in this is that in addition to
providing a dictionary of terms (as Dictionary Annotator and Concept
Mapper seem to provide), I'm also interested in providing regular
expressions that can be matched in my text. So I will have entities
that I want to identify that might occur in a dictionary, or might be
defined by a regular expression. I guess this must be pretty common,
but I'm wondering if either Dictionary Annotator or Concept Mapper
integrate better with Regular Expression Annotator?

In case I'm not being clear about what I'm referring to...

Regular Expression Annotator
http://uima.apache.org/downloads/sandbox/RegexAnnotatorUserGuide/RegexAnnotatorUserGuide.html#sandbox.regexAnnotator.conceptsFile.concepts

Dictionary Annotator
http://uima.apache.org/downloads/sandbox/DictionaryAnnotatorUserGuide/DictionaryAnnotatorUserGuide.html

Concept Mapper
http://uima.apache.org/downloads/sandbox/ConceptMapperAnnotatorUserGuide/ConceptMapperAnnotatorUserGuide.html

Anyway, assuming that I specify entities using both Regular
Expressions and Dictionary entries, is there a preferred way to use
and/or combine the above (or anything else?) The goal at this point is
simply to identify those entities in text for later downstream
processing.

Thanks!
Ted

On Mon, Dec 27, 2010 at 9:59 AM, Ted Pedersen <tpederse@d.umn.edu> wrote:
> Thanks to Tommaso for a very interesting posting, and to Darren for
> the question that generated it.
>
> As a kind of follow-on question to one of the suggestions made by Tommaso....
>
> I'm particularly interested in the functionality provided by Concept
> Mapper, or maybe Dictionary Annotator (that is having the ability to
> create a dictionary and then be able to recognize when a dictionary
> term occurs in my text). From reading over the documentation it seems
> like Concept Mapper and Dictionary Annotator are fairly similar. To be
> honest I don't know much about UIMA, but am trying to learn, so there
> might be some subtleties here I don't see (that would make one want to
> prefer one of these over the other).
>
> Is there a short summary of the differences between Concept Mapper and
> Dictionary Annotator, and does anyone have any strong feelings about
> when you should use one over the other?
>
> Cordially,
> Ted
>
> On Mon, Dec 27, 2010 at 2:45 AM, Tommaso Teofili
> <tommaso.teofili@gmail.com> wrote:
>> Hi Darren,
>>
>> 2010/12/23 Darren Cruse <darren.cruse@gmail.com>
>>
>>> Hi guys I apologize for a newbie question but I'm quite new to UIMA and the
>>> whole area of information extraction/entity extraction.  And I'm hoping
>>> someone can tell me if UIMA is a proper tool for a project that I've been
>>> working on (with other tools) that I've been having trouble with.
>>>
>>>
>>> Basically the task is to extract meta data from html in the form of RDF.
>>>  Where the html represents books/articles/papers/etc. that typically have
>>> an
>>> "outline" or "table of contents", and part of the task involves extracting
>>> the entities "behind" (so to speak) the table of contents.
>>>
>>
>> this is perfectly aligned to UIMA scope as it deals with to discovering
>> hidden knowledge
>>
>>
>>>
>>>
>>> So e.g. if the "corpus" of html pages are from a book, and the book has
>>> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6
>>> Sections,
>>> Section 1 has three Parts, etc.  Then my resulting RDF has to model these
>>> things (entities/classes/whatever you'd call them) and understand the
>>> "hierarchy" of what contains what.
>>>
>>>
>>> The real challenging part is that it's a pretty large volume of material
>>> with many different books/articles/papers/etc.  And there is a lot of
>>> variability, as each were authored by different people not following any
>>> particular template.
>>>
>>
>> On the "large volume of material" topic I think that UIMA-AS [1] can help
>> you as you need to scale.
>>
>>
>>>
>>>
>>> For example what I called a "table of contents" is rarely a single page but
>>> more often it's exploded across multiple "outline" pages where e.g. a high
>>> level table of contents page goes to the level of chapter links.  And then
>>> each chapter may have it's own "outline" breaking down the sections within
>>> that chapter.  Or it might not, different books can differ.  For example
>>> the
>>> pages making up the chapter may just have headings referring to the
>>> titles/names of the sections without being organized into a chapter
>>> "outline" at all.  Yet I'm still responsible for identifying what the
>>> sections are.
>>>
>>>
>>> Somewhat helpful is that headings often indicate the kind of thing they
>>> are,
>>> e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though,
>>> e.g.
>>> I may only get the "The Life of the Spleen, Wrap-Up" part (without "Section
>>> 3:" on the front).
>>>
>>>
>>> Or I may get both forms in different places in the book, where ideally I
>>> should relate the two references as being the same thing.
>>>
>>>
>>> And where different places can refer to the same thing with other
>>> differences too.  Possibly the case of the letters differ, or in this
>>> example there could be one heading with "Wrap-Up" and another with  "Wrap
>>> Up" (one with the dash the other without the dash).
>>>
>>>
>>> As far as understanding the relationships between things i.e. that Chapter
>>> 3
>>> contains Sections 1 through 3 and Section 1 contains two "Parts", where the
>>> things do appear in a "table of contents" or "outline" page, it seems like
>>> the arrangement/formatting of those pages give the clue as to "what
>>> contains
>>> what".  i.e. Things "contained" typically follow what they're contained by,
>>> and are often indented (but not necessarily, it can just be that the
>>> "parent" is bolded, yet they might not be indented beneath their "parent").
>>>
>>>
>>>
>>> Apologize for the long winded description but hopefully it will help to
>>> clarify my question since I'm new to UIMA:
>>>
>>>
>>> a.  Does it sound like a "UIMA kind of problem"? :)
>>>
>>
>> I recently on a similar use case and yes I think this sounds a UIMA kind of
>> problem.
>> My very abstract advice is to use a bottom-up approach, that is recognize
>> words, then sentences, then sections at first; after that you can "play"
>> with sections and understand relationships with chapters and so on.
>>
>>
>>>
>>> i.e. These "things" I'm trying to understand like
>>> Volume/Chapter/Section/etc. - would you call those "entities" in the way
>>> I've heard the term "entity extraction"?
>>>
>>>
>>> b.  And I gave so much detail so I could also ask:  Does this sound like a
>>> straightforward use for UIMA, or does it sound like a *difficult* use for
>>> UIMA?
>>>
>>
>> it sounds to me a straightforward use of UIMA but this doesn't mean it'll be
>> that easy :)
>>
>>
>>>
>>>
>>> c.  Regarding b, I can imagine me giving UIMA regular expressions to look
>>> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of time
>>> like of the chapters I know the book has (this is the idea of a "Gazeteer"
>>> yes?), but I'm unclear:  does UIMA also address this thing where I'm trying
>>> to understand "what *contains* what"?
>>>
>>
>> I'd recommend regular expressions as latest thing to rely on, as they are
>> not so easy to maintain along time and also not so efficient; however they
>> can really help sometimes.
>> I'd go through simple NLP phases as tokenizing and POS tagging along with
>> "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe
>> introducing OpenNLP[4] tools to use chunkers.
>>
>>
>>>
>>>
>>> d.  i.e. Does UIMA support the need to look at the relationship between
>>> things e.g. "does this heading follow another heading, and was that other
>>> heading identified as a "Section", and is this heading indented further to
>>> the right than that one, so I guess this must be a "Part" within that
>>> "Section".  Does UIMA support that kind of thing?  If so does that have a
>>> name I can search on? :)
>>>
>>
>> What you have to do to support that in UIMA is define some annotator that
>> recognize headings creating, for example, HeadingAnnotations and then use,
>> for example, the ConfigurableFeatureExtractor[5] to see what follows what
>> and those kind of things.
>>
>>
>>
>>>
>>>
>>> e.  When I mentioned the slight inconsistencies in how things are
>>> referenced
>>> (the case being different, a dash being omitted, etc). I think I've heard
>>> the phrase "fuzzy matching".  I'm guessing that's part of what UIMA
>>> provides?
>>>
>>
>> "fuzzy matching" is more likely to be part of IR systems (as Lucene/Solr)
>> however you can place your own tokenizer to parse text as you need; in UIMA
>> you can get the simple tokenizer and place also the stemmer block
>> (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of a
>> word.
>>
>>
>>>
>>>
>>> Thanks for any tips I apologize for such a long question I'd been looking
>>> at
>>> the UIMA docs but I was new enough I decided I needed to appeal to those of
>>> you with greater experience. :)
>>>
>>
>> Finally regarding RDF there is not an RDF CAS consumer in UIMA but it can be
>> simply built using Apache Clerezza UIMA Utils module[7]; I'll write a
>> separate email about this as soon as possible.
>>
>> Thanks to you, hope my small hints can help you.
>> Cheers,
>> Tommaso
>>
>> [1] : http://uima.apache.org/doc-uimaas-what.html
>> [2] : http://uima.apache.org/sandbox.html#dict.annotator
>> [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator
>> [4] : http://incubator.apache.org/opennlp/
>> [5] :
>> http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator
>> [6] : http://uima.apache.org/sandbox.html#snowball.annotator
>> [7] :
>> http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/
>>
>>
>>
>>
>>
>>>
>>>
>>> (is there any kind of "Text Extraction for Dummies" kind of introduction
>>> anybody would recommend for a newbie btw?)
>>>
>>>
>>> Thanks again,
>>>
>>>
>>> Darren
>>>
>>
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Mime
View raw message