manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-989) Support virtual child document model
Date Wed, 09 Jul 2014 12:20:05 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056162#comment-14056162
] 

Karl Wright commented on CONNECTORS-989:
----------------------------------------

I made some trunk modifications that should make implementation of this feature more straightforward,
and created a branch: branches/CONNECTORS-989-2.

> Support virtual child document model
> ------------------------------------
>
>                 Key: CONNECTORS-989
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-989
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>
> In some cases, documents that are indexed may be virtual children of those that are queued.
 A good example of this is RSS feeds where the data being indexed all comes from the feed.
> In order to implement this, the following changes would be required:
> (1) IProcessActivity.ingestDocument() has a variant which allows you to include a virtual
child document identifier in addition to the main document identifier.
> (2) IIncrementalIngester's addOrReplaceDocument receives TWO document keys -- one for
main (queued) document identifier, one for child virtual document identifier.
> (3) IIncrementalIngester has two new methods: beginDocument() and endDocument(), both
of which take a main (queued) document identifier as an argument.
> (4) ingeststatus table has two additional columns: a state, and a child key.
> (5) The flow is: at beginDocument() time, put all records relating to a document into
a "processing" state.  Documents that are seen have their state changed.  Documents never
>    encountered are deleted at the end.
> (6) Incremental decisions not to update an output record STILL will require that the
record be touched and its state set.
> (7) DocumentIngest records for the entire set of children will be fetched when the document
is queued.
> (8) The getDocumentVersions() method must be modified to allow return of version strings
for all children, although there can be "shortcuts" as well (where a single version
>     string applies to all children.)
> (9) The decision about whether to refetch a document is based on the returned version
strings and on those fetched by the stuffer thread.
> (10) Similarly, processDocuments() receives version strings for all virtual children.
> (11) There is no need to actively reset the state of document records on restart; the
current logic should be robust enough to be able to generate the required deletions.
> (12) Deleting a document deletes ALL child virtual documents.  This happens within the
incremental ingester.
> (13) Requeuing interval must be computed across all children, taking the minimum, since
there's no requirement that an ingeststatus record exist for the parent.
> (14) All other logic, including making sure only one agent operates on a url at a time,
is the same.
> (15) Interrupting the delete phase is safe because next time the doc is processed the
records will be removed.
> Analysis:
> - The critical thing is making the non-virtual case no worse.
> - For a virtual child document, instead of one db access, there are two.
> - For document records that are not changed, there are two additional writes that were
not needed before.
> - There's an additional index (or the document key index has another subfield).
> - If the queries written can be done in such a way as to treat the standard (no child
document) case specially, we may be able to avoid much impact; only two index queries per
document returning zero rows each
> - If we handle the standard case using the same mechanism, the WorkerThread logic dealing
with deletions can go away.
> Summary:
> - Additional database overhead in the non-virtual indexing case consists of one additional
write and one additional zero-row query, OR two additional zero-row queries.
> - Additional database overhead in the non-virtual skip case consists of two additional
writes, OR two additional zero-row queries.
> - The overhead is low but is significant and will impact overall framework performance
> - The up-sides are as follows: (a) handling an important but infrequent case better;
(b) less connector involvement in indexing (e.g., IProcessActivity.deleteDocument() does nothing
now, and can be deprecated).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message