manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re:
Date Thu, 21 Feb 2019 02:25:41 GMT
Hi Kaya,

You should be able to use the existing Solr connector to index documents
into Solr.
You will probably need to write a Repository connector to access the REST
api you describe.
If the kind of scraping you need to do can be covered by the html-extractor
transformer in its current form, then you can insert it into the pipeline
between the other two connections and you should be all set.

Karl


On Wed, Feb 20, 2019 at 9:17 PM Kayak28 <kaya.ota.oss@gmail.com> wrote:

> Hello, falks:
>
> I have a question about crawling and scraping in Manifold CF.
> I want to the following sequence of tasks by using MCF.
>
> 1. crawling data from RESTful api
> 2. scraping data
> 3. insert the data to Apache Solr
>
> In this case, how I need to setup Manifold CF is:
> 1. define output connector to access RESTful api (by using Web crawler
> connector or Generic connector? )
>
> 2. define transformer connector to scrap html (by using html-extractor
> transformer connector...?)
> 3. define output connector to be Solr
>
>
> OR do I have to use other software such as Apache Nifi to control the
> sequence of these tasks?
>
> I appreciate for any comments and replays.
>
> Sincerely,
> Kaya
>
>
>

Mime
View raw message