uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pablo Duboue" <pablo.dub...@gmail.com>
Subject Re: A question about HTML reader component
Date Fri, 24 Aug 2007 16:50:18 GMT
Hi Chengmin,

The blank lines you refer to are easy to remove and are there by
design. The detagger has a list of "non-paragraph separating tags",
any other tag is supposed to delimit chunks of text, thus the added
blank lines. But there is no reason that behavior can't be
parameterized.

If you want to join the (IBM internal) project, please stop by the
Community Source w3 site.

Best regards,

Pablo

On 8/24/07, Chengmin Ding <chengmin.ding@gmail.com> wrote:
> Hi, Folks,
>
> We have been using UIMA to mine data points from some documents in plain
> text format and our AE worked fine. But recently those documents are
> delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and our
> AEs can no longer mine the data correctly. Our question is if whether there
> is any HTML Collection Reader component or library already available so we
> do not need to reinvent the wheel?
>
> We tried an HTMLCommon collection reader but looks like it cannot parse a
> table correctly. It often adds many blank lines between tables cells/rows
> which confuses our AE.
>
> Any of your help is highly appreciated.
>
> Thanks
>
> -Chengmin
>

Mime
View raw message