manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hank williams <hank...@gmail.com>
Subject Re: A hopfully a few simple question about ManifoldCF and SharePoint
Date Mon, 23 Mar 2015 13:43:57 GMT
Karl,

At this point it seems like perhaps ManifoldCF may not be the right tool.

I think the best solution is to have our server log into SharePoint using
Kerberos or OAuth, and to provide our engine links to the content available
to the logged in user. This is, in essence, a single user crawl of a
sharepoint site I guess (we are not interested in other data sources). From
what I gather based on your responses, ManifoldCF wouldnt help much here,
but this does not seem like an extraordinarily complicated task (at least
from the perspective of someone who's never played with any of this stuff!).

So my question is, is my assumption that its not "an extraordinarily
complicated task" correct, and if not, are there folks in the ManifoldCF
community (or other communities) that you know of might be available as
consultants to create that module?

Best ,
Hank

On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <daddywri@gmail.com> wrote:

> "If output connectors have access to the access tokens then I am
> presuming a custom output connector could look and say, "oh this document
> is accessible to these specific people", but is that a reasonable
> assumption?"
>
> The problem is that you don't know what is in those access tokens.  If you
> knew beyond question that the only thing you'd ever index was stuff that
> (for instance) came from SharePoint, maybe you could make it work.  But if
> you add other connection types, then you'd have to modify your output
> connector for each one.
>
> The other thing you should think about is that usually access tokens
> correspond to *groups* of users rather than individual users.  There is no
> obvious mapping then that you can use to turn that into a list of
> corresponding users.  I believe that when the SharePoint connector is
> configured for "Active Directory" authorization, it maps to individual
> SIDs, but as you might expect the list of SIDs for a given document can be
> quite large, which is why we went to the SharePoint/Native authorization
> model as our default.
>
> Karl
>
>
> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <hank777@gmail.com> wrote:
>
>> This is *super* helpful. I think perhaps I am seeing how to handle this.
>>
>> Regarding #2, since our database is proprietary, there would be no
>> existing output connection type so in any case we would need to create our
>> own.
>>
>> But #1 is clearly an issue. My first thought is that the answer would be
>> to just read everything (not limited by permissions) and then to use a
>> custom output connector to "place" copies in the right accounts. If output
>> connectors have access to the access tokens then I am presuming a custom
>> output connector could look and say, "oh this document is accessible to
>> these specific people", but is that a reasonable assumption?
>>
>>
>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> "So my question is, notwithstanding that this is not the "typical" way
>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>> malleable enough to work or is it designed to do something so different
>>> from what we need that it would be useless. I guess the key question is
>>> really, can we tell ManifoldCF to limit results to those visible to a
>>> specific user and would there be any performance or other unexpected
>>> downsides to doing that."
>>>
>>> Hi Hank,
>>>
>>> There is nothing specific about the ManifoldCF *framework* that prevents
>>> you from doing what you suggest.  But there are problems, as follows:
>>>
>>> (1) Most out-of-the-box repository connection types, including the
>>> SharePoint type, do not give you any ability to limit crawls to a specific
>>> user.  Instead, because they are intended to support a very different
>>> security model, they fetch a document's access tokens, which are described
>>> by the book chapter I pointed you to.
>>> (2) If you modified the SharePoint repository connection type in the
>>> manner you suggest, you would still need to create a custom output
>>> connection type to drop the content into your per-user database instances.
>>> The alternative would be to use an appropriate out-of-the-box output
>>> connection type, if there is one, and have N jobs for N users.
>>>
>>> Hope that answers your question.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <hank777@gmail.com>
>>> wrote:
>>>
>>>> Thanks Karl.
>>>>
>>>> I will most certainly be reading the document you linked to in great
>>>> detail. It looks like stuff I need to know.
>>>>
>>>> That said, we have a given technology that we have developed and that
>>>> we will be using. It creates a separate index for each user. The technology
>>>> has vastly greater utility than just for sharepoint and Its been in
>>>> development for about six years . (in fact this sharepoint thing is a
>>>> recent add-on request.)
>>>>
>>>> So my question is, notwithstanding that this is not the "typical" way
>>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>>> malleable enough to work or is it designed to do something so different
>>>> from what we need that it would be useless. I guess the key question is
>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>> specific user and would there be any performance or other unexpected
>>>> downsides to doing that.
>>>>
>>>> Hank
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Hank,
>>>>>
>>>>> "Our project involves a database that has a private secure user space
>>>>> for each user. Our database is built on Lucene and indexes every object
in
>>>>> the database. Each user presumably has some number of SharePoint sites
that
>>>>> they have access to. We want to index each sharepoint object (file or
>>>>> sharepoint page) as we find it, for each user. The user then ends up
with
>>>>> an index of just the objects that they have perrmissions for. But to
do
>>>>> that we need to, for each user crawl all of the sharepoint sites that
they
>>>>> have access to. Permissions to each sharepoint site are managed by K
>>>>> erberos."
>>>>>
>>>>> This is not the typical ManifoldCF model.  In the typical case, there
>>>>> is ONE lucene search engine (not N), and any searches that take place
apply
>>>>> security restrictions internally based on the user's security information,
>>>>> as obtained from the ManifoldCF authority service, which is in turn
>>>>> querying SharePoint.
>>>>>
>>>>> You can read more about the standard authorization setup here:
>>>>>
>>>>>
>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <hank777@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I am embarking on an effort for which ManifoldCF may  be an
>>>>>> appropriate tool. I am a total noob, having just discovered this
project
>>>>>> and have a few questions that I am hoping someone can answer so that
I can
>>>>>> begin to gain some confidence about the way things work. Basically
I am
>>>>>> trying to make sure I understand, at a top level, how ManifoldCF
works.
>>>>>>
>>>>>> Our project involves a database that has a private secure user space
>>>>>> for each user. Our database is built on Lucene and indexes every
object in
>>>>>> the database. Each user presumably has some number of SharePoint
sites that
>>>>>> they have access to. We want to index each sharepoint object (file
or
>>>>>> sharepoint page) as we find it, for each user. The user then ends
up with
>>>>>> an index of just the objects that they have perrmissions for. But
to do
>>>>>> that we need to, for each user crawl all of the sharepoint sites
that they
>>>>>> have access to. Permissions to each sharepoint site are managed by
K
>>>>>> erberos.
>>>>>>
>>>>>> So the questions are:
>>>>>>
>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list
of
>>>>>> users and relevant Kerberos appropriate authentication tokens or
keys (just
>>>>>> learning about Kerberos), and get back a list of indexable objects/URIs
>>>>>> (HTML, .docx, pptx, etc.)?
>>>>>>
>>>>>> b. Is this the right way to think about it?
>>>>>>
>>>>>> c. If so, is there any example code or documentation that would
>>>>>> explain how I do this?
>>>>>>
>>>>>> d. Does manifoldCF provide any information to help indicate whether
>>>>>> the given object has changed, or is that something we need to figure
out by
>>>>>> manually comparing the old and new documents in our code?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message