manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions
Date Thu, 15 Nov 2012 10:36:13 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497916#comment-13497916
] 

Karl Wright commented on CONNECTORS-567:
----------------------------------------

There are a number of connectors that need to do version checks across many threads, not just
one, which is why I originally designed the connector interface the way I did.

I could imagine supporting both models, however.  The IxxxActivity interfaces were invented
to allow the crawling model to be extended without breaking existing connectors.  All you
would have to do (in theory) to support something like what you are talking about would be
to add a new ISeedingActivity method that would record not only a document's discovery, but
also its version information.

However, this is not a trivial change internally, because the flow at the moment involves
obtaining the version information in the same worker thread that would process the information
if the version indicated that was needed.  So dispatch to the worker thread will have already
taken place either way, and the only real difference would be that somehow we'd decide it
was unnecessary to call getDocumentVersions() for certain documents.  But you'd still need
to support getDocumentVersions() for older documents, as you point out, so I'm having a bit
of a hard time figuring out exactly when a document would be "old enough" to call getDocumentVersions().

A much easier model would be to support an all-in-one approach, which might be appropriate
for something like JDBC.  In that model the seeding query returns everything, and getDocumentVersions()
and processDocuments() does nothing.

It may be worth reading ManifoldCF in Action, especially the parts about crawling models,
since that may help inform your thoughts a bit.

                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with data it
already has.
> Current data flow needs one call to addSeedDocuments, then call to getDocumentVersions,
which essentialy must fetch same data, and after that one more call to processDocuments. The
last one probably needs separate call because it needs to fetch document body, however seeding
and getting versions in many cases work on very same data (and probably duplicating requests
to repository).
> Now - reducing number of needed request to repository by eliminating getDocumentVersions
call for document which have version returned by addSeedDocuments could significantly reduce
load.
> getDocumentVersions would still be called for older docuemnts (not returned by addSeedDocuments)
to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message