uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Holmberg" <holmberg2...@comcast.net>
Subject Re: Stripping HTML but maintaining annotations for tags
Date Tue, 19 Jun 2012 03:49:36 GMT
Hi Dave--

The Tika MarkupAnnotator does this.


Greg Holmberg

> Hi there,
> I would like to create a pipeline that starts with HTML markup. I need
> to strip this to plain text, so it can be processed by different
> annotators, like POS, chunking, entity detection, etc. However I would
> also like to keep track of which regions correspond to the original
> html tags, like links, paragraphs, em, etc. Basically I would like a
> final annotator that takes advantage of structural annotations (from
> html) and semantic annotations (from the other components), all at
> once.
> So, I can imagine starting off with a component that strips the html
> markup and adds annotations to keep track of the tags I am interested
> in. Does such a component exist already? It seems like something a lot
> of people would want.
> If I do need to create it from scratch, what kind of component is it?
> It's not just a straight annotator, because it needs to change the
> SOFA: it needs to replace the markup with plain text.
> Or should I have it create a new view of the document, so we maintain
> a markup view and a plain text view of the document? This seems weird,
> considering I will never care about the markup view again. Also, how
> would I make sure the other annotators (which I won't be coding
> myself) operate on the plain text view of the document rather than the
> markup view?
> Thanks, Dave

View raw message