manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aeham Abushwashi (JIRA)" <>
Subject [jira] [Updated] (CONNECTORS-1364) Better bin naming in the Shared Drive Connector
Date Thu, 12 Jan 2017 14:27:52 GMT


Aeham Abushwashi updated CONNECTORS-1364:
    Attachment: CONNECTORS-1364.git.patch

Patch attached. 
In addition to configurable bin names in the jcifs connection, I’ve made the number of docs
requested by the priority thread configurable. This was previously hard-coded at 1000.

> Better bin naming in the Shared Drive Connector
> -----------------------------------------------
>                 Key: CONNECTORS-1364
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: JCIFS connector
>    Affects Versions: ManifoldCF 1.9
>            Reporter: Aeham Abushwashi
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.7
>         Attachments: CONNECTORS-1364.git.patch
> Hello and happy new year!
> Bin naming in the Shared Drive Connector makes assumptions that are not always valid.

> As I understand it, Manifold uses bins to prevent overloading data sources. In the SDC,
server name is designated as bin name. All jobs created against a particular server will be
treated as one unit when documents are prioritised, which can severely disadvantage some jobs
(e.g. late starters). 
> Moreover, this is incompatible with some common enterprise server topologies. In Windows
DFS, which is widely used in large enterprises, what the SDC thinks of as a server name, isn’t
actually a physical resource. It’s a namespace that can span many servers and shares. In
this case, it doesn’t make sense to throttle simply on the root ‘server’ name. In other
environments, a powerful storage server can be more than capable of handling high crawl load;
overzealous throttling can end up limiting/hurting Manifold’s performance there.
> I’m struggling to find a single solution that fits all so I’m leaning towards passing
in to the repo connection config some sort of server topology flag or throttling depth flag
as a hint that ShareDriveConnector#getBinNames can use to decide whether the bin name should
be server, server+share or server+share+root_folder. Share and root_folder would need to be
explicitly passed in the repo config too or extracted from the documentIdentifier arg in getBinNames
(assuming it's reliable).
> Thoughts?

This message was sent by Atlassian JIRA

View raw message