manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafa Haro <rh...@apache.org>
Subject Re: Rejecting Documents at Repository Connectors
Date Fri, 08 Aug 2014 14:19:19 GMT
Hi Karl,

Thanks a lot for your response. Now everything is clear. I had the
intuition of using the activities object but honestly I didn't go through
the documentation. My fault. I will take it now into account.

Cheers,
Rafa

On Friday, August 8, 2014, Karl Wright <daddywri@gmail.com> wrote:

> Hi Rafa,
>
> The processDocuments() method decides what the disposition of every
> document should be for each document it is handed.  Your connector is
> expected to call one of several different IProcessActivities depending on
> what the decision is.  See the 1.7 Javadoc for IProcessActivity:
>
> * The processing flow for a document is expected to go something like this:
> * (1) The connector's processDocuments() method is called with a set of
> documents to be processed.
> * (2) The connector computes a version string for each document in the set
> as part of determining
> *    whether the document indeed needs to be refetched.
> * (3) For each document processed, there can be one of several
> dispositions:
> *   (a) There is no such document (anymore): deleteDocument() called for
> the document.
> *   (b) The document is (re)indexed: ingestDocumentWithException() is
> called for the document.
> *   (c) The document is determined to be unchanged and no updates are
> needed: nothing needs to be called
> *     for the document.
> *   (d) The document is determined to be unchanged BUT the version string
> needs to be updated: recordDocument()
> *     is called for the document.
> *   (e) The document is determined to be unindexable BUT it still exists in
> the repository: noDocument()
> *    is called for the document.
> *   (f) There was a service interruption: ServiceInterruption is thrown.
> * (4) In order to determine whether a document needs to be reindexed, the
> method checkDocumentNeedsReindexing()
> *    is available to return an opinion on that matter.
>
> This is not quite complete because there is also a removeDocument() method
> that is available which is
> not described, but you get the idea.  So it doesn't make much sense to for
> processDocuments() to also return results; essentially the
> processDocument() method has to do that already.
>
> As for this question:
> >>>>>>
> Would be reasonable to also generally extend  the Transformation Connector
> and Output Connector interfaces to allow returning not only a
> rejection/acceptance code but also a Reason String Message?
> <<<<<<
>
> Well, the idea right now behind accept/reject is that it informs the
> framework whether to remove the document from the queue or not.  There's no
> place to record why it was removed from the queue, since it's no longer in
> the queue at all.  Instead, your repository, transformation, or output
> connector can record the basic reason for rejection in the history for the
> crawl.  For example, if it calls checkMimeTypeIndexable() and gets back a
> false result, it can record that the document was rejected because the mime
> type of XXX was not accepted by the downstream pipeline.  This will tell
> you what happened to the document, and roughly why.  Later, we could
> consider having the check methods return a status object rather than a
> boolean, so a
> more detailed message could be provided for history logging or connector
> log output.  If you open a ticket for this, it would probably need to wait
> until 2.0 though.
>
> Hope this helps.
> Karl
>
>
>
> On Fri, Aug 8, 2014 at 8:59 AM, Rafa Haro <rharo@apache.org <javascript:;>>
> wrote:
>
> > Hi devs,
> >
> > I have a quick question, more a curiosity than other thing. At
> > Transformation Connectors and Output Connectors, we have the possibility
> to
> > return a code like DOCUMENTSTATUS_REJECTED as result of the main
> > addDocument method, indicating that the document has been rejected. I
> > suppose that this code is recorded by Manifold and later the user can
> check
> > for the rejected documents. I’m facing now a situation in a Repository
> > Connector I’m extending where I have enough information about the
> document
> > to decide rejecting it or not. But I have not found any way within the
> > Framework to notify this rejection in a Repository Connector.
> >
> > Couple of questions:
> >
> > - Would be reasonable to extend current Repository Connector interface
> for
> > allowing returning a rejection or acceptance code in the processDocuments
> > method?
> >
> > - Would be reasonable to also generally extend  the Transformation
> > Connector and Output Connector interfaces to allow returning not only a
> > rejection/acceptance code but also a Reason String Message?
> >
> > Thanks all!!
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message