manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Document components
Date Tue, 25 Nov 2014 23:22:09 GMT
Hi Markus,

>>>>>>
noDocument() removes the document or the specified component from the
output but keeps track of the version in the status queue. The decision of
not indexing the document/component is considered persistent as long as the
version string does not change.

deleteDocument() removes the document and all its components from output
and the status queue. The decision of not indexing the document will have
to be made again when the document is processed the next time (version
string is irrelevant)

removeDocument() removes the primary document from the output and from the
status queue but keeps components in the output. The decision of not
indexing the document will have to be made again when the document is
processed the next time (version string is irrelevant)

Is this correct?
<<<<<<

Yes.

>>>>>>
The scenario is indexing documents with embedded documents. The embedded
documents are ingested as components.

We assume a document with multiple components was ingested. For the next
processing the version does not change.
So the whole document should not be refetched.
But how i can prevent the deletion of the components when the document is
not re-fetched?
I saw the method "retainDocument" which seems to be the way to go, but the
problem is that without fetching the document
i have no knowledge about the available components.
Is there any other way to retain all components without knowing them?
<<<<<<

Not at present; the assumption for components is that the processing of a
primary document will allow your connector to determine disposition of all
components of the primary document every time that processDocuments() is
called for it.  Effectively that means that the assumption is that
determining what components are in a document is a relatively inexpensive
operation.  It's necessary to make that assumption, because that's the only
way the bookkeeping can work - MCF needs to know what happens with the
components, when all it has is a processDocuments() call.  I'll look into
how hard it would be to add the functionality you are looking for though,
and get back to you.

>>>>>>
About a patch for a Test Connector:
I think i could contribute something.
Do you have general requirements/guideline for test connectors?
Are there examples of a similar test connector?
<<<<<<

Look at
framework/pull-agent/src/test/java/org/apache/manifoldcf/crawler/tests.
There are a number of test connectors there, and tests that use them.

Thanks,
Karl


On Tue, Nov 25, 2014 at 5:53 PM, Markus Schuch <markus_schuch@web.de> wrote:

> Hi Karl,
>
> thanks for the clarification about primary document disposition.
>
> I'm still not 100% sure if i understand the differences... i try to
> explain it in my own words:
>
> noDocument() removes the document or the specified component from the
> output but keeps track of the version in the status queue. The decision of
> not indexing the document/component is considered persistent as long as the
> version string does not change.
>
> deleteDocument() removes the document and all its components from output
> and the status queue. The decision of not indexing the document will have
> to be made again when the document is processed the next time (version
> string is irrelevant)
>
> removeDocument() removes the primary document from the output and from the
> status queue but keeps components in the output. The decision of not
> indexing the document will have to be made again when the document is
> processed the next time (version string is irrelevant)
>
> Is this correct?
>
> -----------------------------
>
> An new question i have:
>
> The scenario is indexing documents with embedded documents. The embedded
> documents are ingested as components.
>
> We assume a document with multiple components was ingested. For the next
> processing the version does not change.
> So the whole document should not be refetched.
> But how i can prevent the deletion of the components when the document is
> not re-fetched?
> I saw the method "retainDocument" which seems to be the way to go, but the
> problem is that without fetching the document
> i have no knowledge about the available components.
> Is there any other way to retain all components without knowing them?
>
> ----------------------------
>
> About a patch for a Test Connector:
> I think i could contribute something.
> Do you have general requirements/guideline for test connectors?
> Are there examples of a similar test connector?
>
> Regards,
> Markus
>

Mime
View raw message