uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chengmin Ding" <chengmin.d...@gmail.com>
Subject Re: A question about HTML reader component
Date Fri, 24 Aug 2007 18:00:03 GMT
Thank you Pablo for the prompt reply. I will check out the w3
community project and possibly participate in it. I think this HTML
detagging function is such a useful one and deservers more participation.

-Chengmin

On 8/24/07, Pablo Duboue <pablo.duboue@gmail.com> wrote:
>
> Hi Chengmin,
>
> The blank lines you refer to are easy to remove and are there by
> design. The detagger has a list of "non-paragraph separating tags",
> any other tag is supposed to delimit chunks of text, thus the added
> blank lines. But there is no reason that behavior can't be
> parameterized.
>
> If you want to join the (IBM internal) project, please stop by the
> Community Source w3 site.
>
> Best regards,
>
> Pablo
>
> On 8/24/07, Chengmin Ding <chengmin.ding@gmail.com> wrote:
> > Hi, Folks,
> >
> > We have been using UIMA to mine data points from some documents in plain
> > text format and our AE worked fine. But recently those documents are
> > delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and
> our
> > AEs can no longer mine the data correctly. Our question is if whether
> there
> > is any HTML Collection Reader component or library already available so
> we
> > do not need to reinvent the wheel?
> >
> > We tried an HTMLCommon collection reader but looks like it cannot parse
> a
> > table correctly. It often adds many blank lines between tables
> cells/rows
> > which confuses our AE.
> >
> > Any of your help is highly appreciated.
> >
> > Thanks
> >
> > -Chengmin
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message