manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Document components
Date Wed, 26 Nov 2014 00:02:36 GMT
See CONNECTORS-1115.  I looked into this; looked relatively easy to add a
method to IProcessActivity that does what you request.  Please give it a
try and let me know how it works for you.

Thanks,
Karl


On Tue, Nov 25, 2014 at 6:22 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Markus,
>
> >>>>>>
> noDocument() removes the document or the specified component from the
> output but keeps track of the version in the status queue. The decision of
> not indexing the document/component is considered persistent as long as the
> version string does not change.
>
> deleteDocument() removes the document and all its components from output
> and the status queue. The decision of not indexing the document will have
> to be made again when the document is processed the next time (version
> string is irrelevant)
>
> removeDocument() removes the primary document from the output and from the
> status queue but keeps components in the output. The decision of not
> indexing the document will have to be made again when the document is
> processed the next time (version string is irrelevant)
>
> Is this correct?
> <<<<<<
>
> Yes.
>
> >>>>>>
> The scenario is indexing documents with embedded documents. The embedded
> documents are ingested as components.
>
> We assume a document with multiple components was ingested. For the next
> processing the version does not change.
> So the whole document should not be refetched.
> But how i can prevent the deletion of the components when the document is
> not re-fetched?
> I saw the method "retainDocument" which seems to be the way to go, but the
> problem is that without fetching the document
> i have no knowledge about the available components.
> Is there any other way to retain all components without knowing them?
> <<<<<<
>
> Not at present; the assumption for components is that the processing of a
> primary document will allow your connector to determine disposition of all
> components of the primary document every time that processDocuments() is
> called for it.  Effectively that means that the assumption is that
> determining what components are in a document is a relatively inexpensive
> operation.  It's necessary to make that assumption, because that's the only
> way the bookkeeping can work - MCF needs to know what happens with the
> components, when all it has is a processDocuments() call.  I'll look into
> how hard it would be to add the functionality you are looking for though,
> and get back to you.
>
> >>>>>>
> About a patch for a Test Connector:
> I think i could contribute something.
> Do you have general requirements/guideline for test connectors?
> Are there examples of a similar test connector?
> <<<<<<
>
> Look at
> framework/pull-agent/src/test/java/org/apache/manifoldcf/crawler/tests.
> There are a number of test connectors there, and tests that use them.
>
> Thanks,
> Karl
>
>
> On Tue, Nov 25, 2014 at 5:53 PM, Markus Schuch <markus_schuch@web.de>
> wrote:
>
>> Hi Karl,
>>
>> thanks for the clarification about primary document disposition.
>>
>> I'm still not 100% sure if i understand the differences... i try to
>> explain it in my own words:
>>
>> noDocument() removes the document or the specified component from the
>> output but keeps track of the version in the status queue. The decision of
>> not indexing the document/component is considered persistent as long as the
>> version string does not change.
>>
>> deleteDocument() removes the document and all its components from output
>> and the status queue. The decision of not indexing the document will have
>> to be made again when the document is processed the next time (version
>> string is irrelevant)
>>
>> removeDocument() removes the primary document from the output and from
>> the status queue but keeps components in the output. The decision of not
>> indexing the document will have to be made again when the document is
>> processed the next time (version string is irrelevant)
>>
>> Is this correct?
>>
>> -----------------------------
>>
>> An new question i have:
>>
>> The scenario is indexing documents with embedded documents. The embedded
>> documents are ingested as components.
>>
>> We assume a document with multiple components was ingested. For the next
>> processing the version does not change.
>> So the whole document should not be refetched.
>> But how i can prevent the deletion of the components when the document is
>> not re-fetched?
>> I saw the method "retainDocument" which seems to be the way to go, but
>> the problem is that without fetching the document
>> i have no knowledge about the available components.
>> Is there any other way to retain all components without knowing them?
>>
>> ----------------------------
>>
>> About a patch for a Test Connector:
>> I think i could contribute something.
>> Do you have general requirements/guideline for test connectors?
>> Are there examples of a similar test connector?
>>
>> Regards,
>> Markus
>>
>
>

Mime
View raw message