uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Hernandez <nicolas.hernan...@gmail.com>
Subject Re: Stripping HTML but maintaining annotations for tags
Date Wed, 20 Jun 2012 09:44:11 GMT
Hi David

The components XML2CAS of the uima-connectors project [1,2] do that
too in a similar way to the Tika MarkupAnnotator. You can also specify
the input and the output views.
The major differences are:
  * XML2CAS works only with XML but it allows you to specify the XML
tags you want to turn into annotations in your CAS. And the created
annotations have finer type structure (for example, annotations are
both created for XML elements and attributes, all being
  * MarkupAnnotator can handle HTML by adding the TagSoup parser jar
[3] in the classpath.


[1] http://code.google.com/p/uima-common/downloads/detail?name=uima-common-v120111.jar
[2] http://code.google.com/p/uima-connectors/downloads/detail?name=uima-connectors-v111205.jar
[3] http://ccil.org/~cowan/XML/tagsoup/

On Tue, Jun 19, 2012 at 5:49 AM, Greg Holmberg <holmberg2066@comcast.net> wrote:
> Hi Dave--
> The Tika MarkupAnnotator does this.
> http://uima.apache.org/sandbox.html#tika.annotator
> Greg Holmberg
>> Hi there,
>> I would like to create a pipeline that starts with HTML markup. I need
>> to strip this to plain text, so it can be processed by different
>> annotators, like POS, chunking, entity detection, etc. However I would
>> also like to keep track of which regions correspond to the original
>> html tags, like links, paragraphs, em, etc. Basically I would like a
>> final annotator that takes advantage of structural annotations (from
>> html) and semantic annotations (from the other components), all at
>> once.
>> So, I can imagine starting off with a component that strips the html
>> markup and adds annotations to keep track of the tags I am interested
>> in. Does such a component exist already? It seems like something a lot
>> of people would want.
>> If I do need to create it from scratch, what kind of component is it?
>> It's not just a straight annotator, because it needs to change the
>> SOFA: it needs to replace the markup with plain text.
>> Or should I have it create a new view of the document, so we maintain
>> a markup view and a plain text view of the document? This seems weird,
>> considering I will never care about the markup view again. Also, how
>> would I make sure the other annotators (which I won't be coding
>> myself) operate on the plain text view of the document rather than the
>> markup view?
>> Thanks, Dave

Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

View raw message