manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Tavard <olivier.tav...@francelabs.com>
Subject MCF transformation connector contribution
Date Thu, 15 Mar 2018 10:35:16 GMT
Hello MCF community,

I developed a transformation connector based on Jsoup. The goal of this code id to simply
choose an encompassing tag in a HTML document for text extracting. And inside this tag, this
connector allows you to remove subparts that you do no want : all the tags corresponding to
declared types or specific attribute tag names for example.
I would like to know if it could interest you. The code is in Apache V2 licence  and I integrated
it in our enterprise search solution (Datafari). This morning I integrated the code in a fork
MCF project on GitHub. Obviously it needs some work including code refactoring, renaming classes,
unit tests that I will be able to do if you are interested by the code.
The code is here : https://github.com/otavard/manifoldcf/tree/htmlextractorconnector <https://github.com/otavard/manifoldcf/commits/htmlextractorconnector>
And the documentation here : https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector
<https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector>

Best regards,

Olivier TAVARD



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message