manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hank williams <hank...@gmail.com>
Subject Re: A hopfully a few simple question about ManifoldCF and SharePoint
Date Mon, 23 Mar 2015 14:59:22 GMT
Thanks Karl.

We're not indexing into solr. Its our own technology. What we are really
looking for it sounds like to me is "SharePoint-from-java" experience and
writing web apps that talk to sharepoint.

Well, I'll just keep looking.

Best,
Hank

On Mon, Mar 23, 2015 at 10:10 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Hank,
>
> I can't really recommend any consulting firms specifically skilled with
> using bits and pieces of ManifoldCF to build a whole new solution.  If you
> are indexing into Solr, maybe you can contact a Solr consulting firm, e.g.
> LucidImagination etc.  You *could* try a firm like Zaizi (based in London),
> but I can't be sure they'd find the job amenable either.
>
> Karl
>
> On Mon, Mar 23, 2015 at 9:43 AM, hank williams <hank777@gmail.com> wrote:
>
>> Karl,
>>
>> At this point it seems like perhaps ManifoldCF may not be the right tool.
>>
>> I think the best solution is to have our server log into SharePoint using
>> Kerberos or OAuth, and to provide our engine links to the content available
>> to the logged in user. This is, in essence, a single user crawl of a
>> sharepoint site I guess (we are not interested in other data sources). From
>> what I gather based on your responses, ManifoldCF wouldnt help much here,
>> but this does not seem like an extraordinarily complicated task (at least
>> from the perspective of someone who's never played with any of this stuff!).
>>
>> So my question is, is my assumption that its not "an extraordinarily
>> complicated task" correct, and if not, are there folks in the ManifoldCF
>> community (or other communities) that you know of might be available as
>> consultants to create that module?
>>
>> Best ,
>> Hank
>>
>> On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> "If output connectors have access to the access tokens then I am
>>> presuming a custom output connector could look and say, "oh this document
>>> is accessible to these specific people", but is that a reasonable
>>> assumption?"
>>>
>>> The problem is that you don't know what is in those access tokens.  If
>>> you knew beyond question that the only thing you'd ever index was stuff
>>> that (for instance) came from SharePoint, maybe you could make it work.
>>> But if you add other connection types, then you'd have to modify your
>>> output connector for each one.
>>>
>>> The other thing you should think about is that usually access tokens
>>> correspond to *groups* of users rather than individual users.  There is no
>>> obvious mapping then that you can use to turn that into a list of
>>> corresponding users.  I believe that when the SharePoint connector is
>>> configured for "Active Directory" authorization, it maps to individual
>>> SIDs, but as you might expect the list of SIDs for a given document can be
>>> quite large, which is why we went to the SharePoint/Native authorization
>>> model as our default.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <hank777@gmail.com>
>>> wrote:
>>>
>>>> This is *super* helpful. I think perhaps I am seeing how to handle this.
>>>>
>>>> Regarding #2, since our database is proprietary, there would be no
>>>> existing output connection type so in any case we would need to create our
>>>> own.
>>>>
>>>> But #1 is clearly an issue. My first thought is that the answer would
>>>> be to just read everything (not limited by permissions) and then to use a
>>>> custom output connector to "place" copies in the right accounts. If output
>>>> connectors have access to the access tokens then I am presuming a custom
>>>> output connector could look and say, "oh this document is accessible to
>>>> these specific people", but is that a reasonable assumption?
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> "So my question is, notwithstanding that this is not the "typical"
>>>>> way ManifoldCF works, can we use it in the way that I am describing.
Is it
>>>>> malleable enough to work or is it designed to do something so different
>>>>> from what we need that it would be useless. I guess the key question
is
>>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>>> specific user and would there be any performance or other unexpected
>>>>> downsides to doing that."
>>>>>
>>>>> Hi Hank,
>>>>>
>>>>> There is nothing specific about the ManifoldCF *framework* that
>>>>> prevents you from doing what you suggest.  But there are problems, as
>>>>> follows:
>>>>>
>>>>> (1) Most out-of-the-box repository connection types, including the
>>>>> SharePoint type, do not give you any ability to limit crawls to a specific
>>>>> user.  Instead, because they are intended to support a very different
>>>>> security model, they fetch a document's access tokens, which are described
>>>>> by the book chapter I pointed you to.
>>>>> (2) If you modified the SharePoint repository connection type in the
>>>>> manner you suggest, you would still need to create a custom output
>>>>> connection type to drop the content into your per-user database instances.
>>>>> The alternative would be to use an appropriate out-of-the-box output
>>>>> connection type, if there is one, and have N jobs for N users.
>>>>>
>>>>> Hope that answers your question.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <hank777@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Karl.
>>>>>>
>>>>>> I will most certainly be reading the document you linked to in great
>>>>>> detail. It looks like stuff I need to know.
>>>>>>
>>>>>> That said, we have a given technology that we have developed and
that
>>>>>> we will be using. It creates a separate index for each user. The
technology
>>>>>> has vastly greater utility than just for sharepoint and Its been
in
>>>>>> development for about six years . (in fact this sharepoint thing
is a
>>>>>> recent add-on request.)
>>>>>>
>>>>>> So my question is, notwithstanding that this is not the "typical"
way
>>>>>> ManifoldCF works, can we use it in the way that I am describing.
Is it
>>>>>> malleable enough to work or is it designed to do something so different
>>>>>> from what we need that it would be useless. I guess the key question
is
>>>>>> really, can we tell ManifoldCF to limit results to those visible
to a
>>>>>> specific user and would there be any performance or other unexpected
>>>>>> downsides to doing that.
>>>>>>
>>>>>> Hank
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Hank,
>>>>>>>
>>>>>>> "Our project involves a database that has a private secure user
>>>>>>> space for each user. Our database is built on Lucene and indexes
every
>>>>>>> object in the database. Each user presumably has some number
of SharePoint
>>>>>>> sites that they have access to. We want to index each sharepoint
object
>>>>>>> (file or sharepoint page) as we find it, for each user. The user
then ends
>>>>>>> up with an index of just the objects that they have perrmissions
for. But
>>>>>>> to do that we need to, for each user crawl all of the sharepoint
sites that
>>>>>>> they have access to. Permissions to each sharepoint site are
managed by K
>>>>>>> erberos."
>>>>>>>
>>>>>>> This is not the typical ManifoldCF model.  In the typical case,
>>>>>>> there is ONE lucene search engine (not N), and any searches that
take place
>>>>>>> apply security restrictions internally based on the user's security
>>>>>>> information, as obtained from the ManifoldCF authority service,
which is in
>>>>>>> turn querying SharePoint.
>>>>>>>
>>>>>>> You can read more about the standard authorization setup here:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <hank777@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am embarking on an effort for which ManifoldCF may  be
an
>>>>>>>> appropriate tool. I am a total noob, having just discovered
this project
>>>>>>>> and have a few questions that I am hoping someone can answer
so that I can
>>>>>>>> begin to gain some confidence about the way things work.
Basically I am
>>>>>>>> trying to make sure I understand, at a top level, how ManifoldCF
works.
>>>>>>>>
>>>>>>>> Our project involves a database that has a private secure
user
>>>>>>>> space for each user. Our database is built on Lucene and
indexes every
>>>>>>>> object in the database. Each user presumably has some number
of SharePoint
>>>>>>>> sites that they have access to. We want to index each sharepoint
object
>>>>>>>> (file or sharepoint page) as we find it, for each user. The
user then ends
>>>>>>>> up with an index of just the objects that they have perrmissions
for. But
>>>>>>>> to do that we need to, for each user crawl all of the sharepoint
sites that
>>>>>>>> they have access to. Permissions to each sharepoint site
are managed by K
>>>>>>>> erberos.
>>>>>>>>
>>>>>>>> So the questions are:
>>>>>>>>
>>>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and
a list
>>>>>>>> of users and relevant Kerberos appropriate authentication
tokens or keys
>>>>>>>> (just learning about Kerberos), and get back a list of indexable
>>>>>>>> objects/URIs (HTML, .docx, pptx, etc.)?
>>>>>>>>
>>>>>>>> b. Is this the right way to think about it?
>>>>>>>>
>>>>>>>> c. If so, is there any example code or documentation that
would
>>>>>>>> explain how I do this?
>>>>>>>>
>>>>>>>> d. Does manifoldCF provide any information to help indicate
whether
>>>>>>>> the given object has changed, or is that something we need
to figure out by
>>>>>>>> manually comparing the old and new documents in our code?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message