manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-916) Amazon CloudSearch output connector
Date Tue, 20 May 2014 14:22:38 GMT


Karl Wright commented on CONNECTORS-916:

bq. But why do we need to keep entire document ? I thought if a job send some documents successfully,
MCF does not need keep these documents any more (so MCF delete documents data from disk at
the end of notifyOfJobCompletion()

As you pointed out, there can be errors trying to upload documents in batch to Amazon.  If
the connector accumulates documents telling ManifoldCF that each document was accepted by
the connector, there is no way to force ManifoldCF to resend any document to the connector
if the upload to Amazon fails later.

But, if the connector keeps a local file-based image of what should be sent to Amazon, and
tries to update Amazon at the end of each job run, then this can be retried many times without
any loss of data.  The rule is that the connector must keep around *all* of the data in the
chunk that was refused by Amazon, and allow that data to be partially replaced in the next
crawl.  It would also be really important to be sure that any Amazon errors would be reported
well enough that someone can figure out what document caused the upload to amazon to fail,
and why, so that the problem can be fixed.

> Amazon CloudSearch output connector
> -----------------------------------
>                 Key: CONNECTORS-916
>                 URL:
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Takumi Yoshida
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>         Attachments: 0507.diff, 0520.diff, 0520_2.diff, 1.patch, 2.diff, 3.diff,,, exception_handling.diff, exception_handling_2.diff, licenselist.txt
> I wrote some codes snipetts of output connector for Amazon CloudSearch.
> I would like you to review my code. You can crawl web site and feed HTML page to Amazon
> but it is not perfectly completed followoing reason.
> - does not write any codes for configuration page.
> - supporting file type is only HTML
> Thank you for your time,
>  Takumi Yoshida

This message was sent by Atlassian JIRA

View raw message