uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: UIMA for extracting book "entities" from tables of contents, etc. as RDF?
Date Mon, 27 Dec 2010 08:45:36 GMT
Hi Darren,

2010/12/23 Darren Cruse <darren.cruse@gmail.com>

> Hi guys I apologize for a newbie question but I'm quite new to UIMA and the
> whole area of information extraction/entity extraction.  And I'm hoping
> someone can tell me if UIMA is a proper tool for a project that I've been
> working on (with other tools) that I've been having trouble with.
>
>
> Basically the task is to extract meta data from html in the form of RDF.
>  Where the html represents books/articles/papers/etc. that typically have
> an
> "outline" or "table of contents", and part of the task involves extracting
> the entities "behind" (so to speak) the table of contents.
>

this is perfectly aligned to UIMA scope as it deals with to discovering
hidden knowledge


>
>
> So e.g. if the "corpus" of html pages are from a book, and the book has
> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6
> Sections,
> Section 1 has three Parts, etc.  Then my resulting RDF has to model these
> things (entities/classes/whatever you'd call them) and understand the
> "hierarchy" of what contains what.
>
>
> The real challenging part is that it's a pretty large volume of material
> with many different books/articles/papers/etc.  And there is a lot of
> variability, as each were authored by different people not following any
> particular template.
>

On the "large volume of material" topic I think that UIMA-AS [1] can help
you as you need to scale.


>
>
> For example what I called a "table of contents" is rarely a single page but
> more often it's exploded across multiple "outline" pages where e.g. a high
> level table of contents page goes to the level of chapter links.  And then
> each chapter may have it's own "outline" breaking down the sections within
> that chapter.  Or it might not, different books can differ.  For example
> the
> pages making up the chapter may just have headings referring to the
> titles/names of the sections without being organized into a chapter
> "outline" at all.  Yet I'm still responsible for identifying what the
> sections are.
>
>
> Somewhat helpful is that headings often indicate the kind of thing they
> are,
> e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though,
> e.g.
> I may only get the "The Life of the Spleen, Wrap-Up" part (without "Section
> 3:" on the front).
>
>
> Or I may get both forms in different places in the book, where ideally I
> should relate the two references as being the same thing.
>
>
> And where different places can refer to the same thing with other
> differences too.  Possibly the case of the letters differ, or in this
> example there could be one heading with "Wrap-Up" and another with  "Wrap
> Up" (one with the dash the other without the dash).
>
>
> As far as understanding the relationships between things i.e. that Chapter
> 3
> contains Sections 1 through 3 and Section 1 contains two "Parts", where the
> things do appear in a "table of contents" or "outline" page, it seems like
> the arrangement/formatting of those pages give the clue as to "what
> contains
> what".  i.e. Things "contained" typically follow what they're contained by,
> and are often indented (but not necessarily, it can just be that the
> "parent" is bolded, yet they might not be indented beneath their "parent").
>
>
>
> Apologize for the long winded description but hopefully it will help to
> clarify my question since I'm new to UIMA:
>
>
> a.  Does it sound like a "UIMA kind of problem"? :)
>

I recently on a similar use case and yes I think this sounds a UIMA kind of
problem.
My very abstract advice is to use a bottom-up approach, that is recognize
words, then sentences, then sections at first; after that you can "play"
with sections and understand relationships with chapters and so on.


>
> i.e. These "things" I'm trying to understand like
> Volume/Chapter/Section/etc. - would you call those "entities" in the way
> I've heard the term "entity extraction"?
>
>
> b.  And I gave so much detail so I could also ask:  Does this sound like a
> straightforward use for UIMA, or does it sound like a *difficult* use for
> UIMA?
>

it sounds to me a straightforward use of UIMA but this doesn't mean it'll be
that easy :)


>
>
> c.  Regarding b, I can imagine me giving UIMA regular expressions to look
> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of time
> like of the chapters I know the book has (this is the idea of a "Gazeteer"
> yes?), but I'm unclear:  does UIMA also address this thing where I'm trying
> to understand "what *contains* what"?
>

I'd recommend regular expressions as latest thing to rely on, as they are
not so easy to maintain along time and also not so efficient; however they
can really help sometimes.
I'd go through simple NLP phases as tokenizing and POS tagging along with
"Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe
introducing OpenNLP[4] tools to use chunkers.


>
>
> d.  i.e. Does UIMA support the need to look at the relationship between
> things e.g. "does this heading follow another heading, and was that other
> heading identified as a "Section", and is this heading indented further to
> the right than that one, so I guess this must be a "Part" within that
> "Section".  Does UIMA support that kind of thing?  If so does that have a
> name I can search on? :)
>

What you have to do to support that in UIMA is define some annotator that
recognize headings creating, for example, HeadingAnnotations and then use,
for example, the ConfigurableFeatureExtractor[5] to see what follows what
and those kind of things.



>
>
> e.  When I mentioned the slight inconsistencies in how things are
> referenced
> (the case being different, a dash being omitted, etc). I think I've heard
> the phrase "fuzzy matching".  I'm guessing that's part of what UIMA
> provides?
>

"fuzzy matching" is more likely to be part of IR systems (as Lucene/Solr)
however you can place your own tokenizer to parse text as you need; in UIMA
you can get the simple tokenizer and place also the stemmer block
(SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of a
word.


>
>
> Thanks for any tips I apologize for such a long question I'd been looking
> at
> the UIMA docs but I was new enough I decided I needed to appeal to those of
> you with greater experience. :)
>

Finally regarding RDF there is not an RDF CAS consumer in UIMA but it can be
simply built using Apache Clerezza UIMA Utils module[7]; I'll write a
separate email about this as soon as possible.

Thanks to you, hope my small hints can help you.
Cheers,
Tommaso

[1] : http://uima.apache.org/doc-uimaas-what.html
[2] : http://uima.apache.org/sandbox.html#dict.annotator
[3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator
[4] : http://incubator.apache.org/opennlp/
[5] :
http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator
[6] : http://uima.apache.org/sandbox.html#snowball.annotator
[7] :
http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/





>
>
> (is there any kind of "Text Extraction for Dummies" kind of introduction
> anybody would recommend for a newbie btw?)
>
>
> Thanks again,
>
>
> Darren
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message