uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chengmin Ding" <chengmin.d...@gmail.com>
Subject A question about HTML reader component
Date Fri, 24 Aug 2007 16:34:34 GMT
Hi, Folks,

We have been using UIMA to mine data points from some documents in plain
text format and our AE worked fine. But recently those documents are
delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and our
AEs can no longer mine the data correctly. Our question is if whether there
is any HTML Collection Reader component or library already available so we
do not need to reinvent the wheel?

We tried an HTMLCommon collection reader but looks like it cannot parse a
table correctly. It often adds many blank lines between tables cells/rows
which confuses our AE.

Any of your help is highly appreciated.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message