Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 63AFDCCE4 for ; Wed, 20 Jun 2012 09:45:02 +0000 (UTC) Received: (qmail 19226 invoked by uid 500); 20 Jun 2012 09:45:01 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 19016 invoked by uid 500); 20 Jun 2012 09:44:58 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 18983 invoked by uid 99); 20 Jun 2012 09:44:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Jun 2012 09:44:57 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nicolas.hernandez@gmail.com designates 209.85.214.175 as permitted sender) Received: from [209.85.214.175] (HELO mail-ob0-f175.google.com) (209.85.214.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Jun 2012 09:44:52 +0000 Received: by obcva7 with SMTP id va7so3288420obc.6 for ; Wed, 20 Jun 2012 02:44:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:from:date:message-id :subject:to:cc:content-type:content-transfer-encoding; bh=9SBU7y6KLsBdbIjNk6YXy+7i0I6BaXrfJAMMCY/MXmo=; b=EJfvS6UpLEAFhJ9v5xc2Id0XW7A0NQpbN2x0IsSrOsIaLQ4nK5XvICecG8EsLkmIQ3 8nsEuz6uqT8OTAW3yLlefoFai1iwiwN6wzBSzeYLD0hj/D7SzWMiGIYIQ2KVJZTz8K+N 20wYpyWMKt2Qbjwy88si6npXgZhLi1uwDi8TVNJ9Xw/DqVE2f/+nts91QmI/1nVICXf9 FbB41mUuTShmhVwNrg7spR3x/iWAb6mmhVtgyTzIVMU+qX4y5zll1JE44P3wkITcxwsO dRABpd5jg42FlL8acYRkojhqgMhcPlPEc3Fg0OigGrixSgeCxQHgTCR6+AkMz4pK00mW AKKA== Received: by 10.60.14.71 with SMTP id n7mr23149031oec.43.1340185472087; Wed, 20 Jun 2012 02:44:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.43.73 with HTTP; Wed, 20 Jun 2012 02:44:11 -0700 (PDT) Reply-To: nicolas.hernandez@univ-nantes.fr In-Reply-To: References: From: Nicolas Hernandez Date: Wed, 20 Jun 2012 11:44:11 +0200 Message-ID: Subject: Re: Stripping HTML but maintaining annotations for tags To: user@uima.apache.org Cc: David Milne Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi David The components XML2CAS of the uima-connectors project [1,2] do that too in a similar way to the Tika MarkupAnnotator. You can also specify the input and the output views. The major differences are: * XML2CAS works only with XML but it allows you to specify the XML tags you want to turn into annotations in your CAS. And the created annotations have finer type structure (for example, annotations are both created for XML elements and attributes, all being interconnected). * MarkupAnnotator can handle HTML by adding the TagSoup parser jar [3] in the classpath. Best [1] http://code.google.com/p/uima-common/downloads/detail?name=3Duima-commo= n-v120111.jar [2] http://code.google.com/p/uima-connectors/downloads/detail?name=3Duima-c= onnectors-v111205.jar [3] http://ccil.org/~cowan/XML/tagsoup/ On Tue, Jun 19, 2012 at 5:49 AM, Greg Holmberg w= rote: > Hi Dave-- > > The Tika MarkupAnnotator does this. > > http://uima.apache.org/sandbox.html#tika.annotator > > Greg Holmberg > > >> Hi there, >> >> I would like to create a pipeline that starts with HTML markup. I need >> to strip this to plain text, so it can be processed by different >> annotators, like POS, chunking, entity detection, etc. However I would >> also like to keep track of which regions correspond to the original >> html tags, like links, paragraphs, em, etc. Basically I would like a >> final annotator that takes advantage of structural annotations (from >> html) and semantic annotations (from the other components), all at >> once. >> >> So, I can imagine starting off with a component that strips the html >> markup and adds annotations to keep track of the tags I am interested >> in. Does such a component exist already? It seems like something a lot >> of people would want. >> >> If I do need to create it from scratch, what kind of component is it? >> It's not just a straight annotator, because it needs to change the >> SOFA: it needs to replace the markup with plain text. >> >> Or should I have it create a new view of the document, so we maintain >> a markup view and a plain text view of the document? This seems weird, >> considering I will never care about the markup view again. Also, how >> would I make sure the other annotators (which I won't be coding >> myself) operate on the plain text view of the document rather than the >> markup view? >> >> Thanks, Dave --=20 Dr. Nicolas Hernandez Associate Professor (Ma=EEtre de Conf=E9rences) Universit=E9 de Nantes - LINA CNRS UMR 6241 http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n +33 (0)2 51 12 53 94 +33 (0)2 40 30 60 67