manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maciej Lizewski (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-567) Extended seeding interface which provides document versions
Date Thu, 15 Nov 2012 11:40:13 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497956#comment-13497956
] 

Maciej Lizewski commented on CONNECTORS-567:
--------------------------------------------

I would also go with two scenarios to maintain compatibility with current model.

My point is that there plenty case when listing document also gives you information about
its version: directory listing gives you file modyfication time, SQL query can return document
ID and its version, web interfaces (REST, WebService) often support scenario: getObjectsList
which gives you document IDs and almost always some document information like modyfication
time, version, owner, etc and separate method for fetching whole document.

Your proposition to have all-in-one is not as good because: like I said earlier common interfaces
have separate methods for fetching lists and single documents and you would have to first
fetch the list and then for every document fetch its conent. Another reason is that in real
world documents are not changed very often and fetching their content every time is much not
needed overhead.

And last but not least - what I mean by "old enough" - when you call addSeedDocuments there
are several scenarios but in most cases this method can provide new documents, updated documents
and often all other documents that still exists. There are still some documents that were
deleted and addSeedDocuemnts mostly will not return them. they are injected to reindexing
process from database  of previously indexed document, and when getDocumentVersion returns
null - they are removed. That is clear and this is what I mainly meant: getDocumentVesrions
could be used to fetch versions for documents that are already in our database, but addSeedDocuemnts
did not returned them (either because they were deleted or they were just not modified and
addSeedDocuments just return new and modified documents)

So I was thinking of such (re)indexing process:
1. mark all already indexed document to re-index
2. call addSeedDocuments which can provide versions for documents or not
3. call getDocumentVersions for all documents that were not added by addSeedDocuments with
version (this means that it should be called also for documents added by addSeedDocuemnts
but without version - this is the backward compatibility)
4. call processDocuments as usual.

now - if addSeedDocuments does not provide versions at all this process is pretty same as
it is working now. If addSeedDocuments provides versions for some(all) documents - those are
excluded from calls to getDocumentVersions.

>From connector side the difference could be just in calling overloaded ISeedingActivity::addSeedDocument
method with second argument:
addSeedDocument(idValue) or addSeedDocument(idValue, version)
of course I understand it means much more hidden work on the other side of this interface
:)

What do think about it?
                
> Extended seeding interface which provides document versions
> -----------------------------------------------------------
>
>                 Key: CONNECTORS-567
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-567
>             Project: ManifoldCF
>          Issue Type: Wish
>            Reporter: Maciej Lizewski
>
> There are some cases when seeding function can provide document version with data it
already has.
> Current data flow needs one call to addSeedDocuments, then call to getDocumentVersions,
which essentialy must fetch same data, and after that one more call to processDocuments. The
last one probably needs separate call because it needs to fetch document body, however seeding
and getting versions in many cases work on very same data (and probably duplicating requests
to repository).
> Now - reducing number of needed request to repository by eliminating getDocumentVersions
call for document which have version returned by addSeedDocuments could significantly reduce
load.
> getDocumentVersions would still be called for older docuemnts (not returned by addSeedDocuments)
to check if they were modified or deleted.
> This is only proposition...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message