From Darren Cruse <darren.cr...@gmail.com>
Subject UIMA for extracting book "entities" from tables of contents, etc. as RDF?
Date Thu, 23 Dec 2010 17:22:25 GMT
Hi guys I apologize for a newbie question but I'm quite new to UIMA and the
whole area of information extraction/entity extraction.  And I'm hoping
someone can tell me if UIMA is a proper tool for a project that I've been
working on (with other tools) that I've been having trouble with.

Basically the task is to extract meta data from html in the form of RDF.
 Where the html represents books/articles/papers/etc. that typically have an
"outline" or "table of contents", and part of the task involves extracting
the entities "behind" (so to speak) the table of contents.

So e.g. if the "corpus" of html pages are from a book, and the book has
Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6 Sections,
Section 1 has three Parts, etc.  Then my resulting RDF has to model these
things (entities/classes/whatever you'd call them) and understand the
"hierarchy" of what contains what.

The real challenging part is that it's a pretty large volume of material
with many different books/articles/papers/etc.  And there is a lot of
variability, as each were authored by different people not following any
particular template.

For example what I called a "table of contents" is rarely a single page but
more often it's exploded across multiple "outline" pages where e.g. a high
level table of contents page goes to the level of chapter links.  And then
each chapter may have it's own "outline" breaking down the sections within
that chapter.  Or it might not, different books can differ.  For example the
pages making up the chapter may just have headings referring to the
titles/names of the sections without being organized into a chapter
"outline" at all.  Yet I'm still responsible for identifying what the
sections are.

Somewhat helpful is that headings often indicate the kind of thing they are,
e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though, e.g.
I may only get the "The Life of the Spleen, Wrap-Up" part (without "Section
3:" on the front).

Or I may get both forms in different places in the book, where ideally I
should relate the two references as being the same thing.

And where different places can refer to the same thing with other
differences too.  Possibly the case of the letters differ, or in this
example there could be one heading with "Wrap-Up" and another with  "Wrap
Up" (one with the dash the other without the dash).

As far as understanding the relationships between things i.e. that Chapter 3
contains Sections 1 through 3 and Section 1 contains two "Parts", where the
things do appear in a "table of contents" or "outline" page, it seems like
the arrangement/formatting of those pages give the clue as to "what contains
what".  i.e. Things "contained" typically follow what they're contained by,
and are often indented (but not necessarily, it can just be that the
"parent" is bolded, yet they might not be indented beneath their "parent").

Apologize for the long winded description but hopefully it will help to
clarify my question since I'm new to UIMA:

a.  Does it sound like a "UIMA kind of problem"? :)

i.e. These "things" I'm trying to understand like
Volume/Chapter/Section/etc. - would you call those "entities" in the way
I've heard the term "entity extraction"?

b.  And I gave so much detail so I could also ask:  Does this sound like a
straightforward use for UIMA, or does it sound like a *difficult* use for

c.  Regarding b, I can imagine me giving UIMA regular expressions to look
for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of time
like of the chapters I know the book has (this is the idea of a "Gazeteer"
yes?), but I'm unclear:  does UIMA also address this thing where I'm trying
to understand "what *contains* what"?

d.  i.e. Does UIMA support the need to look at the relationship between
things e.g. "does this heading follow another heading, and was that other
heading identified as a "Section", and is this heading indented further to
the right than that one, so I guess this must be a "Part" within that
"Section".  Does UIMA support that kind of thing?  If so does that have a
name I can search on? :)

e.  When I mentioned the slight inconsistencies in how things are referenced
(the case being different, a dash being omitted, etc). I think I've heard
the phrase "fuzzy matching".  I'm guessing that's part of what UIMA

Thanks for any tips I apologize for such a long question I'd been looking at
the UIMA docs but I was new enough I decided I needed to appeal to those of
you with greater experience. :)

(is there any kind of "Text Extraction for Dummies" kind of introduction
anybody would recommend for a newbie btw?)

Thanks again,


