Hi Dave--
The Tika MarkupAnnotator does this.
http://uima.apache.org/sandbox.html#tika.annotator
Greg Holmberg
> Hi there,
>
> I would like to create a pipeline that starts with HTML markup. I need
> to strip this to plain text, so it can be processed by different
> annotators, like POS, chunking, entity detection, etc. However I would
> also like to keep track of which regions correspond to the original
> html tags, like links, paragraphs, em, etc. Basically I would like a
> final annotator that takes advantage of structural annotations (from
> html) and semantic annotations (from the other components), all at
> once.
>
> So, I can imagine starting off with a component that strips the html
> markup and adds annotations to keep track of the tags I am interested
> in. Does such a component exist already? It seems like something a lot
> of people would want.
>
> If I do need to create it from scratch, what kind of component is it?
> It's not just a straight annotator, because it needs to change the
> SOFA: it needs to replace the markup with plain text.
>
> Or should I have it create a new view of the document, so we maintain
> a markup view and a plain text view of the document? This seems weird,
> considering I will never care about the markup view again. Also, how
> would I make sure the other annotators (which I won't be coding
> myself) operate on the plain text view of the document rather than the
> markup view?
>
> Thanks, Dave
|