manifoldcf-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Connectors Framework > FAQ
Date Thu, 07 Oct 2010 18:22:00 GMT
Space: Apache Connectors Framework (
Page: FAQ (

Edited by Karl Wright:
h1. Frequently asked questions

h3. Security model

*Q. What exactly are the ACCESS_TOKEN and DENY_TOKEN values that are sent to an output connector,
and presumably stored in the index?*

*A.* The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent
a contract between an LCF authority connection and the LCF repository connection that picks
up the documents (from wherever).  These tokens thus have no real meaning outside of LCF.
 You must regard them as opaque.

The contract, however, states that if you use the ACF authority service to obtain tokens for
an authenticated user, you will get back a set that is CONSISTENT with the tokens that were
attached to the documents ACF sent to Solr for indexing in the first place.  So, you don't
have to worry about it, and that's kind of the idea.  So you imagine the following flow:

1. Use ACF to fetch documents and send them to Solr
2. When searching, use the ACF authority service to get the desired user's access tokens
3. Either filter the results, or modify the query, to be sure the access tokens all match
up properly

For the AD authority, the ACF access tokens consist, in part, of the user's SIDs.  For other
authorities, the access tokens are wildly different.  You really don't want to know what's
in them, since that's the job of the LCF authority to determine.

LCF is not, by the way, joined at the hip with AD.  However, in practice, most enterprises
in the world use some form of AD single signon for their web applications, and even if they're
using some repository with its own idea of security, there's a mapping between the AD users
and the repository's users.  Doing that mapping is also the job of the ACF authority for that

*Q. What is the relationship between stored data (documents) and authority access/deny attributes?
 Do you have any examples of what an access_token value might contain?*

*A.* Documents have access/deny attributes; authorities simply provide the list of tokens
that belong to an authenticated user.  Thus, there's no access/deny for an authority; that's
attached to the document (as it is in real-world repositories).
Let's run a quick example, using Active Directory and a Windows file system.  Suppose that
you have a directory with documents in it, call it DirectoryA, and the directory allows read
access to the following SIDs:
These SIDs correspond to active directory groups, let's call them Group1 and Group2, respectively.
DirectoryB also has documents in it, and those documents have just the SID S-123-456-76890
attached, because only Group1 can read its contents.
Now, pretend that someone has created an LCF Active Directory authority connection (in the
ACF UI), which is called "myAD", and this connection is set up to talk to the governing AD
domain controller for this Windows file system.  We now know enough to describe the document
indexing process:
* Each file in DirectoryA will have the following __ALLOW_TOKEN__document attributes inside
Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
* Each file in DirectoryB will have the following __ALLOW_TOKEN__document attributes inside
Solr: "myAD:S-123-456-76890"
Now, suppose that a user (let's call him "Peter") is authenticated with the AD domain controller.
 Peter belongs to Group2, so his SIDs are (say):
S-1-1-0 (the 'everyone' SID)
S-323-999-12345 (his own personal user SID)
S-23-64-12345 (the SID he gets because he belongs to group 2)
We want to look up the documents in the search index that he can see.  So, we ask the LCF
authority service what his tokens are, and we get back:
"myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
The documents we should return in his search are the ones matching his search criteria, PLUS
the intersection of his tokens with the document ALLOW tokens, MINUS the intersection of his
tokens with the document DENY tokens (there aren't any involved in this example).  So only
files that have one of his three tokens as an ALLOW attribute would be returned.
Note that what we are attempting to do in this case is enforce AD's security with the search
results we present.  There is no need to define a whole new security mechanism, because AD
already has one that people use.

*Q. Do the ACF authority connections authenticate users?*

*A.* The authority connectors don't perform authentication at this time.  In fact, ACF has
nothing to do with authentication at all - just authorization.  It is almost never the case
that somebody wants to provide multiple credentials in order to be able see their results.
 Most enterprises who have multiple repositories authenticate against AD and then map AD user
names to repository user names in order to access those repositories.  For a pure-java authentication
solution, we are currently recommending JAAS plus sun's kerb5 login module (
for handling the "authenticate against AD" case, which covers some 95%+ of the real world
authentication needed out there.  We may have more complete recommendations in the future.

*Q. I have a question regarding how multiple identifiers for a given user is handled in the
authority service.  Let say that I want to get the access tokens for the user John Smith against
all the authority connectors defined in ACF.  Let say that John is known as john.smith in
AD, known as j.smith in document and so on.  If I'm not wrong, the only parameter used to
identify a user in the authority service is "username".  I'm wondering how user id reconciliation
is performed inside the authority service in that case? Is there something done about that
or is it a work that should be performed externally?*

*A.* The user name mapping is the job of the individual authority.  So, for example, the Documentum
authority would be responsible for any user name mapping that would need to be done prior
to looking up the tokens for that user within Documentum, and the LiveLink authority needs
to do something similar for mapping to LiveLink user names.
It turns out that most enterprises that have coexisting repositories of disparate kinds make
an effort to keep their user name spaces consistent across these repositories.  Otherwise,
enterprise-wide single signon would be impossible.  In the cases where the convention for
mapping is ad-hoc (e.g. LiveLink), the authority connectors included with LCF were built with
a simple regular-expression-based mapping feature, which you get to configure right in the
crawler ui as part of defining the authority connection.
Many repository companies also have added AD synchronization features as their products have
matured.  Documentum is one such repository, where the repository software establishes a feature
for operating with AD.  For those repositories, we did not add a mapping function, because
it would typically be unnecessary if the repository integrator followed the recommended best
practices for deploying that repository.

*Q. I don't like the idea of storing document access tokens in an index.  What happens if/when
you want to add explicit user access to some [group of] documents? (i.e. not via a group)*

*A.* In ACF, you would change the permissions on the appropriate resource, and then you run
your ACF job again to update those permissions.  Since ACF is an incremental crawler, it is
smart enough to only re-index those documents whose permissions have changed, which makes
it a fairly fast operation on most repositories.  Also, in my experience, this is a relatively
infrequent kind of situation, and most enterprises are pretty resilient against there being
a reasonable delay in getting document permissions updated in an index.

However, if this is still a concern, remember that your main alternative is to go directly
to the repository on every document as you filter a resultset.  That's slow for most situations.
 Performance might be improved with caching, but only if you knew that the same results would
be returned for multiple queries.  So no solution is perfect.

*Q. I don't like the idea of storing document access tokens in an index.  What happens if
you need to revoke a users rights, or change a user's group affinity?*

*A.* The access tokens for a user are obtained from the authorities in real time, so there
is no delay.  Only access tokens attached to documents require a job run to be changed.

*Q. I don't like the idea of storing Active Directory SIDs in an index.  They might be changed.*

*A.* Once again, this is a very infrequent occurrence, and when it does happen, ACF is well
equipped to handle the re-indexing in the most efficient way possible.

*Q. How has ManifoldCF performed (the example configuration) on what kind of hardware?*

*A.* The example, running on Derby, has not had performance tests run against it.  The example
running with PostgreSQL 8.3 on a Dell laptop with disk encryption is capable of doing a file
system crawl at 35 documents/second.  A real server will, of course, run significantly faster.
 At MetaCarta, we discovered that almost always the repository being crawled was the bottleneck.
 Only exceptions are RSS and Web crawls.

Change your notification preferences:

View raw message