Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Received-SPF: pass (athena.apache.org: domain of nicolas.hernandez@gmail.com
 designates 209.85.214.175 as permitted sender)
MIME-Version: 1.0
Reply-To: nicolas.hernandez@univ-nantes.fr
In-Reply-To: <op.wf4swyxy303kzn@oreo.holmberg>
References: 
 <CAMPsBOwxR_NMHKZM4-QdizCcrhM5=cu2=EyJ0B0S3ZN5yZRv9Q@mail.gmail.com>
 <op.wf4swyxy303kzn@oreo.holmberg>
From: Nicolas Hernandez <nicolas.hernandez@gmail.com>
Date: Wed, 20 Jun 2012 11:44:11 +0200
Message-ID: 
 <CAB7Now45sWCwtOLLisy4sZP-j327FbKpzOfiKssg83iT8gGaBw@mail.gmail.com>
Subject: Re: Stripping HTML but maintaining annotations for tags
To: user@uima.apache.org
Cc: David Milne <d.n.milne@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi David

The components XML2CAS of the uima-connectors project [1,2] do that
too in a similar way to the Tika MarkupAnnotator. You can also specify
the input and the output views.
The major differences are:
  * XML2CAS works only with XML but it allows you to specify the XML
tags you want to turn into annotations in your CAS. And the created
annotations have finer type structure (for example, annotations are
both created for XML elements and attributes, all being
interconnected).
  * MarkupAnnotator can handle HTML by adding the TagSoup parser jar
[3] in the classpath.

Best

[1] http://code.google.com/p/uima-common/downloads/detail?name=3Duima-commo=
n-v120111.jar
[2] http://code.google.com/p/uima-connectors/downloads/detail?name=3Duima-c=
onnectors-v111205.jar
[3] http://ccil.org/~cowan/XML/tagsoup/

On Tue, Jun 19, 2012 at 5:49 AM, Greg Holmberg <holmberg2066@comcast.net> w=
rote:
> Hi Dave--
>
> The Tika MarkupAnnotator does this.
>
> http://uima.apache.org/sandbox.html#tika.annotator
>
> Greg Holmberg
>
>
>> Hi there,
>>
>> I would like to create a pipeline that starts with HTML markup. I need
>> to strip this to plain text, so it can be processed by different
>> annotators, like POS, chunking, entity detection, etc. However I would
>> also like to keep track of which regions correspond to the original
>> html tags, like links, paragraphs, em, etc. Basically I would like a
>> final annotator that takes advantage of structural annotations (from
>> html) and semantic annotations (from the other components), all at
>> once.
>>
>> So, I can imagine starting off with a component that strips the html
>> markup and adds annotations to keep track of the tags I am interested
>> in. Does such a component exist already? It seems like something a lot
>> of people would want.
>>
>> If I do need to create it from scratch, what kind of component is it?
>> It's not just a straight annotator, because it needs to change the
>> SOFA: it needs to replace the markup with plain text.
>>
>> Or should I have it create a new view of the document, so we maintain
>> a markup view and a plain text view of the document? This seems weird,
>> considering I will never care about the markup view again. Also, how
>> would I make sure the other annotators (which I won't be coding
>> myself) operate on the plain text view of the document rather than the
>> markup view?
>>
>> Thanks, Dave


--=20
Dr. Nicolas Hernandez
Associate Professor (Ma=EEtre de Conf=E9rences)
Universit=E9 de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67