manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re:
Date Thu, 21 Feb 2019 08:05:31 GMT
Yes, I would separate the work of transforming documents from the work of
fetching them.

Karl


On Wed, Feb 20, 2019 at 9:46 PM Kayak28 <kaya.ota.oss@gmail.com> wrote:

> Hello, Mr. Karl Wright:
>
> Thank you for quick response.
> As you mentioned, yes I am so writing my Repository Connector to access
> the REST api I want to use.
>
> If I need to do more scraping than provided html-extractor, then I should
> write a transformer connector that works as I want.
> Is the statement right?  And it is not good idea to do scraping in my
> Repository Connector, isn't it?
>
> Again, I appreciate for replying these basic questions.
>
> Sincerely,
> Kaya
>
>
> 2019年2月21日(木) 11:26 Karl Wright <daddywri@gmail.com>:
>
>> Hi Kaya,
>>
>> You should be able to use the existing Solr connector to index documents
>> into Solr.
>> You will probably need to write a Repository connector to access the REST
>> api you describe.
>> If the kind of scraping you need to do can be covered by the
>> html-extractor transformer in its current form, then you can insert it into
>> the pipeline between the other two connections and you should be all set.
>>
>> Karl
>>
>>
>> On Wed, Feb 20, 2019 at 9:17 PM Kayak28 <kaya.ota.oss@gmail.com> wrote:
>>
>>> Hello, falks:
>>>
>>> I have a question about crawling and scraping in Manifold CF.
>>> I want to the following sequence of tasks by using MCF.
>>>
>>> 1. crawling data from RESTful api
>>> 2. scraping data
>>> 3. insert the data to Apache Solr
>>>
>>> In this case, how I need to setup Manifold CF is:
>>> 1. define output connector to access RESTful api (by using Web crawler
>>> connector or Generic connector? )
>>>
>>> 2. define transformer connector to scrap html (by using html-extractor
>>> transformer connector...?)
>>> 3. define output connector to be Solr
>>>
>>>
>>> OR do I have to use other software such as Apache Nifi to control the
>>> sequence of these tasks?
>>>
>>> I appreciate for any comments and replays.
>>>
>>> Sincerely,
>>> Kaya
>>>
>>>
>>>

Mime
View raw message