Hi Markus,

>>>>>>
noDocument() removes the document or the specified component from the output but keeps track of the version in the status queue. The decision of not indexing the document/component is considered persistent as long as the version string does not change.

deleteDocument() removes the document and all its components from output and the status queue. The decision of not indexing the document will have to be made again when the document is processed the next time (version string is irrelevant)

removeDocument() removes the primary document from the output and from the status queue but keeps components in the output. The decision of not indexing the document will have to be made again when the document is processed the next time (version string is irrelevant)

Is this correct?
<<<<<<

Yes.

>>>>>>
The scenario is indexing documents with embedded documents. The embedded documents are ingested as components.

We assume a document with multiple components was ingested. For the next processing the version does not change.
So the whole document should not be refetched.
But how i can prevent the deletion of the components when the document is not re-fetched?
I saw the method "retainDocument" which seems to be the way to go, but the problem is that without fetching the document
i have no knowledge about the available components.
Is there any other way to retain all components without knowing them?
<<<<<<

Not at present; the assumption for components is that the processing of a primary document will allow your connector to determine disposition of all components of the primary document every time that processDocuments() is called for it.  Effectively that means that the assumption is that determining what components are in a document is a relatively inexpensive operation.  It's necessary to make that assumption, because that's the only way the bookkeeping can work - MCF needs to know what happens with the components, when all it has is a processDocuments() call.  I'll look into how hard it would be to add the functionality you are looking for though, and get back to you.

>>>>>>
About a patch for a Test Connector:
I think i could contribute something.
Do you have general requirements/guideline for test connectors?
Are there examples of a similar test connector?
<<<<<<

Look at framework/pull-agent/src/test/java/org/apache/manifoldcf/crawler/tests.  There are a number of test connectors there, and tests that use them.

Thanks,
Karl


On Tue, Nov 25, 2014 at 5:53 PM, Markus Schuch <markus_schuch@web.de> wrote:
Hi Karl,
 
thanks for the clarification about primary document disposition.

I'm still not 100% sure if i understand the differences... i try to explain it in my own words:

noDocument() removes the document or the specified component from the output but keeps track of the version in the status queue. The decision of not indexing the document/component is considered persistent as long as the version string does not change.

deleteDocument() removes the document and all its components from output and the status queue. The decision of not indexing the document will have to be made again when the document is processed the next time (version string is irrelevant)

removeDocument() removes the primary document from the output and from the status queue but keeps components in the output. The decision of not indexing the document will have to be made again when the document is processed the next time (version string is irrelevant)

Is this correct?

-----------------------------

An new question i have:

The scenario is indexing documents with embedded documents. The embedded documents are ingested as components.

We assume a document with multiple components was ingested. For the next processing the version does not change.
So the whole document should not be refetched.
But how i can prevent the deletion of the components when the document is not re-fetched?
I saw the method "retainDocument" which seems to be the way to go, but the problem is that without fetching the document
i have no knowledge about the available components.
Is there any other way to retain all components without knowing them?

----------------------------

About a patch for a Test Connector:
I think i could contribute something.
Do you have general requirements/guideline for test connectors?
Are there examples of a similar test connector?

Regards,
Markus