uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: UIMA for extracting book "entities" from tables of contents, etc. as RDF?
Date Mon, 27 Dec 2010 16:47:16 GMT
Hi Ted,
thanks for your comments!
Regarding differences between DictionaryAnnotator and ConceptMapper there is
a previous thread that should help understanding such comparison [1].

2010/12/27 Ted Pedersen <tpederse@d.umn.edu>

> Anyway, assuming that I specify entities using both Regular
> Expressions and Dictionary entries, is there a preferred way to use
> and/or combine the above (or anything else?) The goal at this point is
> simply to identify those entities in text for later downstream
> processing.
>

You probably have to put the "dictionary" analysis engine (be DA or CM) in
the pipeline along with the RegularExpression Annotator and then combine the
generated annotations inside a third custom annotator or via the
Configurable Feature Extractor.
Note that you can build also named entities recognition blocks using OpenNLP
(see, for example, [2]) with existing models or creating your own ones.
Hope this helps.
Cheers,
Tommaso

[1] : http://markmail.org/thread/oyhct2lh4uj2ow2h
[2] :
http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Name_Finder



>
> Thanks!
> Ted
>
> On Mon, Dec 27, 2010 at 9:59 AM, Ted Pedersen <tpederse@d.umn.edu> wrote:
> > Thanks to Tommaso for a very interesting posting, and to Darren for
> > the question that generated it.
> >
> > As a kind of follow-on question to one of the suggestions made by
> Tommaso....
> >
> > I'm particularly interested in the functionality provided by Concept
> > Mapper, or maybe Dictionary Annotator (that is having the ability to
> > create a dictionary and then be able to recognize when a dictionary
> > term occurs in my text). From reading over the documentation it seems
> > like Concept Mapper and Dictionary Annotator are fairly similar. To be
> > honest I don't know much about UIMA, but am trying to learn, so there
> > might be some subtleties here I don't see (that would make one want to
> > prefer one of these over the other).
> >
> > Is there a short summary of the differences between Concept Mapper and
> > Dictionary Annotator, and does anyone have any strong feelings about
> > when you should use one over the other?
> >
> > Cordially,
> > Ted
> >
> > On Mon, Dec 27, 2010 at 2:45 AM, Tommaso Teofili
> > <tommaso.teofili@gmail.com> wrote:
> >> Hi Darren,
> >>
> >> 2010/12/23 Darren Cruse <darren.cruse@gmail.com>
> >>
> >>> Hi guys I apologize for a newbie question but I'm quite new to UIMA and
> the
> >>> whole area of information extraction/entity extraction.  And I'm hoping
> >>> someone can tell me if UIMA is a proper tool for a project that I've
> been
> >>> working on (with other tools) that I've been having trouble with.
> >>>
> >>>
> >>> Basically the task is to extract meta data from html in the form of
> RDF.
> >>>  Where the html represents books/articles/papers/etc. that typically
> have
> >>> an
> >>> "outline" or "table of contents", and part of the task involves
> extracting
> >>> the entities "behind" (so to speak) the table of contents.
> >>>
> >>
> >> this is perfectly aligned to UIMA scope as it deals with to discovering
> >> hidden knowledge
> >>
> >>
> >>>
> >>>
> >>> So e.g. if the "corpus" of html pages are from a book, and the book has
> >>> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6
> >>> Sections,
> >>> Section 1 has three Parts, etc.  Then my resulting RDF has to model
> these
> >>> things (entities/classes/whatever you'd call them) and understand the
> >>> "hierarchy" of what contains what.
> >>>
> >>>
> >>> The real challenging part is that it's a pretty large volume of
> material
> >>> with many different books/articles/papers/etc.  And there is a lot of
> >>> variability, as each were authored by different people not following
> any
> >>> particular template.
> >>>
> >>
> >> On the "large volume of material" topic I think that UIMA-AS [1] can
> help
> >> you as you need to scale.
> >>
> >>
> >>>
> >>>
> >>> For example what I called a "table of contents" is rarely a single page
> but
> >>> more often it's exploded across multiple "outline" pages where e.g. a
> high
> >>> level table of contents page goes to the level of chapter links.  And
> then
> >>> each chapter may have it's own "outline" breaking down the sections
> within
> >>> that chapter.  Or it might not, different books can differ.  For
> example
> >>> the
> >>> pages making up the chapter may just have headings referring to the
> >>> titles/names of the sections without being organized into a chapter
> >>> "outline" at all.  Yet I'm still responsible for identifying what the
> >>> sections are.
> >>>
> >>>
> >>> Somewhat helpful is that headings often indicate the kind of thing they
> >>> are,
> >>> e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though,
> >>> e.g.
> >>> I may only get the "The Life of the Spleen, Wrap-Up" part (without
> "Section
> >>> 3:" on the front).
> >>>
> >>>
> >>> Or I may get both forms in different places in the book, where ideally
> I
> >>> should relate the two references as being the same thing.
> >>>
> >>>
> >>> And where different places can refer to the same thing with other
> >>> differences too.  Possibly the case of the letters differ, or in this
> >>> example there could be one heading with "Wrap-Up" and another with
>  "Wrap
> >>> Up" (one with the dash the other without the dash).
> >>>
> >>>
> >>> As far as understanding the relationships between things i.e. that
> Chapter
> >>> 3
> >>> contains Sections 1 through 3 and Section 1 contains two "Parts", where
> the
> >>> things do appear in a "table of contents" or "outline" page, it seems
> like
> >>> the arrangement/formatting of those pages give the clue as to "what
> >>> contains
> >>> what".  i.e. Things "contained" typically follow what they're contained
> by,
> >>> and are often indented (but not necessarily, it can just be that the
> >>> "parent" is bolded, yet they might not be indented beneath their
> "parent").
> >>>
> >>>
> >>>
> >>> Apologize for the long winded description but hopefully it will help to
> >>> clarify my question since I'm new to UIMA:
> >>>
> >>>
> >>> a.  Does it sound like a "UIMA kind of problem"? :)
> >>>
> >>
> >> I recently on a similar use case and yes I think this sounds a UIMA kind
> of
> >> problem.
> >> My very abstract advice is to use a bottom-up approach, that is
> recognize
> >> words, then sentences, then sections at first; after that you can "play"
> >> with sections and understand relationships with chapters and so on.
> >>
> >>
> >>>
> >>> i.e. These "things" I'm trying to understand like
> >>> Volume/Chapter/Section/etc. - would you call those "entities" in the
> way
> >>> I've heard the term "entity extraction"?
> >>>
> >>>
> >>> b.  And I gave so much detail so I could also ask:  Does this sound
> like a
> >>> straightforward use for UIMA, or does it sound like a *difficult* use
> for
> >>> UIMA?
> >>>
> >>
> >> it sounds to me a straightforward use of UIMA but this doesn't mean
> it'll be
> >> that easy :)
> >>
> >>
> >>>
> >>>
> >>> c.  Regarding b, I can imagine me giving UIMA regular expressions to
> look
> >>> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of
> time
> >>> like of the chapters I know the book has (this is the idea of a
> "Gazeteer"
> >>> yes?), but I'm unclear:  does UIMA also address this thing where I'm
> trying
> >>> to understand "what *contains* what"?
> >>>
> >>
> >> I'd recommend regular expressions as latest thing to rely on, as they
> are
> >> not so easy to maintain along time and also not so efficient; however
> they
> >> can really help sometimes.
> >> I'd go through simple NLP phases as tokenizing and POS tagging along
> with
> >> "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe
> >> introducing OpenNLP[4] tools to use chunkers.
> >>
> >>
> >>>
> >>>
> >>> d.  i.e. Does UIMA support the need to look at the relationship between
> >>> things e.g. "does this heading follow another heading, and was that
> other
> >>> heading identified as a "Section", and is this heading indented further
> to
> >>> the right than that one, so I guess this must be a "Part" within that
> >>> "Section".  Does UIMA support that kind of thing?  If so does that have
> a
> >>> name I can search on? :)
> >>>
> >>
> >> What you have to do to support that in UIMA is define some annotator
> that
> >> recognize headings creating, for example, HeadingAnnotations and then
> use,
> >> for example, the ConfigurableFeatureExtractor[5] to see what follows
> what
> >> and those kind of things.
> >>
> >>
> >>
> >>>
> >>>
> >>> e.  When I mentioned the slight inconsistencies in how things are
> >>> referenced
> >>> (the case being different, a dash being omitted, etc). I think I've
> heard
> >>> the phrase "fuzzy matching".  I'm guessing that's part of what UIMA
> >>> provides?
> >>>
> >>
> >> "fuzzy matching" is more likely to be part of IR systems (as
> Lucene/Solr)
> >> however you can place your own tokenizer to parse text as you need; in
> UIMA
> >> you can get the simple tokenizer and place also the stemmer block
> >> (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of
> a
> >> word.
> >>
> >>
> >>>
> >>>
> >>> Thanks for any tips I apologize for such a long question I'd been
> looking
> >>> at
> >>> the UIMA docs but I was new enough I decided I needed to appeal to
> those of
> >>> you with greater experience. :)
> >>>
> >>
> >> Finally regarding RDF there is not an RDF CAS consumer in UIMA but it
> can be
> >> simply built using Apache Clerezza UIMA Utils module[7]; I'll write a
> >> separate email about this as soon as possible.
> >>
> >> Thanks to you, hope my small hints can help you.
> >> Cheers,
> >> Tommaso
> >>
> >> [1] : http://uima.apache.org/doc-uimaas-what.html
> >> [2] : http://uima.apache.org/sandbox.html#dict.annotator
> >> [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator
> >> [4] : http://incubator.apache.org/opennlp/
> >> [5] :
> >>
> http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator
> >> [6] : http://uima.apache.org/sandbox.html#snowball.annotator
> >> [7] :
> >>
> http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/
> >>
> >>
> >>
> >>
> >>
> >>>
> >>>
> >>> (is there any kind of "Text Extraction for Dummies" kind of
> introduction
> >>> anybody would recommend for a newbie btw?)
> >>>
> >>>
> >>> Thanks again,
> >>>
> >>>
> >>> Darren
> >>>
> >>
> >
> >
> >
> > --
> > Ted Pedersen
> > http://www.d.umn.edu/~tpederse
> >
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message