manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aeham Abushwashi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CONNECTORS-1364) Better bin naming in the Shared Drive Connector
Date Fri, 06 Jan 2017 12:04:58 GMT

     [ https://issues.apache.org/jira/browse/CONNECTORS-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aeham Abushwashi updated CONNECTORS-1364:
-----------------------------------------
    Description: 
Hello and happy new year!

Bin naming in the Shared Drive Connector makes assumptions that are not always valid. 

As I understand it, Manifold uses bins to prevent overloading data sources. In the SDC, server
name is designated as bin name. All jobs created against a particular server will be treated
as one unit when documents are prioritised, which can severely disadvantage some jobs (e.g.
late starters). 
Moreover, this is incompatible with some common enterprise server topologies. In Windows DFS,
which is widely used in large enterprises, what the SDC thinks of as a server name, isn’t
actually a physical resource. It’s a namespace that can span many servers and shares. In
this case, it doesn’t make sense to throttle simply on the root ‘server’ name. In other
environments, a powerful storage server can be more than capable of handling high crawl load;
overzealous throttling can end up limiting/hurting Manifold’s performance there.

I’m struggling to find a single solution that fits all so I’m leaning towards passing
in to the repo connection config some sort of server topology flag or throttling depth flag
as a hint that ShareDriveConnector#getBinNames can use to decide whether the bin name should
be server, server+share or server+share+root_folder. Share and root_folder would need to be
explicitly passed in the repo config too or extracted from the documentIdentifier arg in getBinNames
(assuming it's reliable).

Thoughts?

  was:
Hello and happy new year!

Bin naming in the Shared Drive Connector makes assumptions that are not always valid. 

As I understand it, Manifold uses bins to prevent overloading data sources. In the SDC, server
name is designated as bin name. All jobs created against a particular server will be treated
as one unit when documents are prioritised, which can severely disadvantage some jobs (e.g.
late starters). 
Moreover, this is incompatible with some common enterprise server topologies. In Windows DFS,
which is widely used in large enterprises, what the SDC thinks of as a server name, isn’t
actually a physical resource. It’s a namespace that can span many servers and shares. In
this case, it doesn’t make sense to throttle simply on the root ‘server’ name. In other
environments, a powerful storage server can be more than capable of handling high crawl load;
overzealous throttling can end up limiting/hurting Manifold’s performance there.

I’m struggling to find a single solution that fits all so I’m leaning towards passing
in to the repo connection config some sort of server topology flag or throttling depth flag
as a hint that ShareDriveConnector#getBinNames can use to decide whether the bin name should
be server, server+share or server+share+root_folder. Share and root_folder would need to be
explicitly passed in the repo config too.

Thoughts?


> Better bin naming in the Shared Drive Connector
> -----------------------------------------------
>
>                 Key: CONNECTORS-1364
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1364
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: JCIFS connector
>    Affects Versions: ManifoldCF 1.9
>            Reporter: Aeham Abushwashi
>
> Hello and happy new year!
> Bin naming in the Shared Drive Connector makes assumptions that are not always valid.

> As I understand it, Manifold uses bins to prevent overloading data sources. In the SDC,
server name is designated as bin name. All jobs created against a particular server will be
treated as one unit when documents are prioritised, which can severely disadvantage some jobs
(e.g. late starters). 
> Moreover, this is incompatible with some common enterprise server topologies. In Windows
DFS, which is widely used in large enterprises, what the SDC thinks of as a server name, isn’t
actually a physical resource. It’s a namespace that can span many servers and shares. In
this case, it doesn’t make sense to throttle simply on the root ‘server’ name. In other
environments, a powerful storage server can be more than capable of handling high crawl load;
overzealous throttling can end up limiting/hurting Manifold’s performance there.
> I’m struggling to find a single solution that fits all so I’m leaning towards passing
in to the repo connection config some sort of server topology flag or throttling depth flag
as a hint that ShareDriveConnector#getBinNames can use to decide whether the bin name should
be server, server+share or server+share+root_folder. Share and root_folder would need to be
explicitly passed in the repo config too or extracted from the documentIdentifier arg in getBinNames
(assuming it's reliable).
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message