incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Connectors Framework > FAQ
Date Mon, 04 Apr 2011 13:35:00 GMT
Space: Apache Connectors Framework (https://cwiki.apache.org/confluence/display/CONNECTORS)
Page: FAQ (https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ)


Edited by Erlend GarĂ¥sen:
---------------------------------------------------------------------
h1. Frequently asked questions

h3. Security model

*Q. What exactly are the ACCESS_TOKEN and DENY_TOKEN values that are sent to an output connector,
and presumably stored in the index?*

*A.* The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent
a contract between an ManifoldCF authority connection and the ManifoldCF repository connection
that picks up the documents (from wherever).  These tokens thus have no real meaning outside
of ManifoldCF.  You must regard them as opaque.

The contract, however, states that if you use the ManifoldCF authority service to obtain tokens
for an authenticated user, you will get back a set that is CONSISTENT with the tokens that
were attached to the documents ManifoldCF sent to Solr for indexing in the first place.  So,
you don't have to worry about it, and that's kind of the idea.  So you imagine the following
flow:

1. Use ManifoldCF to fetch documents and send them to Solr
2. When searching, use the ManifoldCF authority service to get the desired user's access tokens
3. Either filter the results, or modify the query, to be sure the access tokens all match
up properly

For the AD authority, the ManifoldCF access tokens consist, in part, of the user's SIDs. 
For other authorities, the access tokens are wildly different.  You really don't want to know
what's in them, since that's the job of the ManifoldCF authority to determine.

ManifoldCF is not, by the way, joined at the hip with AD.  However, in practice, most enterprises
in the world use some form of AD single signon for their web applications, and even if they're
using some repository with its own idea of security, there's a mapping between the AD users
and the repository's users.  Doing that mapping is also the job of the ManifoldCF authority
for that repository.

*Q. What is the relationship between stored data (documents) and authority access/deny attributes?
 Do you have any examples of what an access_token value might contain?*

*A.* Documents have access/deny attributes; authorities simply provide the list of tokens
that belong to an authenticated user.  Thus, there's no access/deny for an authority; that's
attached to the document (as it is in real-world repositories).
 
Let's run a quick example, using Active Directory and a Windows file system.  Suppose that
you have a directory with documents in it, call it DirectoryA, and the directory allows read
access to the following SIDs:
 
S-123-456-76890
S-23-64-12345
 
These SIDs correspond to active directory groups, let's call them Group1 and Group2, respectively.
 
DirectoryB also has documents in it, and those documents have just the SID S-123-456-76890
attached, because only Group1 can read its contents.
 
Now, pretend that someone has created an ManifoldCF Active Directory authority connection
(in the ManifoldCF UI), which is called "myAD", and this connection is set up to talk to the
governing AD domain controller for this Windows file system.  We now know enough to describe
the document indexing process:
 
* Each file in DirectoryA will have the following __ALLOW_TOKEN__document attributes inside
Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
* Each file in DirectoryB will have the following __ALLOW_TOKEN__document attributes inside
Solr: "myAD:S-123-456-76890"
 
Now, suppose that a user (let's call him "Peter") is authenticated with the AD domain controller.
 Peter belongs to Group2, so his SIDs are (say):
 
S-1-1-0 (the 'everyone' SID)
S-323-999-12345 (his own personal user SID)
S-23-64-12345 (the SID he gets because he belongs to group 2)
 
We want to look up the documents in the search index that he can see.  So, we ask the ManifoldCF
authority service what his tokens are, and we get back:
 
"myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
 
The documents we should return in his search are the ones matching his search criteria, PLUS
the intersection of his tokens with the document ALLOW tokens, MINUS the intersection of his
tokens with the document DENY tokens (there aren't any involved in this example).  So only
files that have one of his three tokens as an ALLOW attribute would be returned.
 
Note that what we are attempting to do in this case is enforce AD's security with the search
results we present.  There is no need to define a whole new security mechanism, because AD
already has one that people use.


*Q. Do the ManifoldCF authority connections authenticate users?*

*A.* The authority connectors don't perform authentication at this time.  In fact, ManifoldCF
has nothing to do with authentication at all - just authorization.  It is almost never the
case that somebody wants to provide multiple credentials in order to be able see their results.
 Most enterprises who have multiple repositories authenticate against AD and then map AD user
names to repository user names in order to access those repositories.  For a pure-java authentication
solution, we are currently recommending JAAS plus sun's kerb5 login module (com.sun.security.auth.module.Krb5LoginModule)
for handling the "authenticate against AD" case, which covers some 95%+ of the real world
authentication needed out there.  We may have more complete recommendations in the future.

*Q. I have a question regarding how multiple identifiers for a given user is handled in the
authority service.  Let say that I want to get the access tokens for the user John Smith against
all the authority connectors defined in ManifoldCF.  Let say that John is known as john.smith
in AD, known as j.smith in document and so on.  If I'm not wrong, the only parameter used
to identify a user in the authority service is "username".  I'm wondering how user id reconciliation
is performed inside the authority service in that case? Is there something done about that
or is it a work that should be performed externally?*

*A.* The user name mapping is the job of the individual authority.  So, for example, the Documentum
authority would be responsible for any user name mapping that would need to be done prior
to looking up the tokens for that user within Documentum, and the LiveLink authority needs
to do something similar for mapping to LiveLink user names.
 
It turns out that most enterprises that have coexisting repositories of disparate kinds make
an effort to keep their user name spaces consistent across these repositories.  Otherwise,
enterprise-wide single signon would be impossible.  In the cases where the convention for
mapping is ad-hoc (e.g. LiveLink), the authority connectors included with ManifoldCF were
built with a simple regular-expression-based mapping feature, which you get to configure right
in the crawler ui as part of defining the authority connection.
 
Many repository companies also have added AD synchronization features as their products have
matured.  Documentum is one such repository, where the repository software establishes a feature
for operating with AD.  For those repositories, we did not add a mapping function, because
it would typically be unnecessary if the repository integrator followed the recommended best
practices for deploying that repository.

*Q. I don't like the idea of storing document access tokens in an index.  What happens if/when
you want to add explicit user access to some [group of] documents? (i.e. not via a group)*

*A.* In ManifoldCF, you would change the permissions on the appropriate resource, and then
you run your ManifoldCF job again to update those permissions.  Since ManifoldCF is an incremental
crawler, it is smart enough to only re-index those documents whose permissions have changed,
which makes it a fairly fast operation on most repositories.  Also, in my experience, this
is a relatively infrequent kind of situation, and most enterprises are pretty resilient against
there being a reasonable delay in getting document permissions updated in an index.

However, if this is still a concern, remember that your main alternative is to go directly
to the repository on every document as you filter a resultset.  That's slow for most situations.
 Performance might be improved with caching, but only if you knew that the same results would
be returned for multiple queries.  So no solution is perfect.

*Q. I don't like the idea of storing document access tokens in an index.  What happens if
you need to revoke a users rights, or change a user's group affinity?*

*A.* The access tokens for a user are obtained from the authorities in real time, so there
is no delay.  Only access tokens attached to documents require a job run to be changed.

*Q. I don't like the idea of storing Active Directory SIDs in an index.  They might be changed.*

*A.* Once again, this is a very infrequent occurrence, and when it does happen, ManifoldCF
is well equipped to handle the re-indexing in the most efficient way possible.

*Q. How has ManifoldCF performed (the example configuration) on what kind of hardware?*

*A.* The example, running on Derby, has not had performance tests run against it.  The example
running with PostgreSQL 8.3 on a Dell laptop with disk encryption is capable of doing a file
system crawl at 35 documents/second.  A real server will, of course, run significantly faster.
 At MetaCarta, we discovered that almost always the repository being crawled was the bottleneck.
 Only exceptions are RSS and Web crawls.
On a crawl that is executing optimally, the system will be CPU-bound.  If you are seeing low
rates of CPU utilization, it may mean you have inadequate disk performance.  There are also
known bugs with Derby that result in the Derby database deadlocking and recovering, which
also leads to very poor system utilization.

*Q. How do I use PostgreSQL with the quick-start example?*

*A.* First, install PostgreSQL, and remember the superuser database name and password (usually
"postgres" is the name).  Then, change the properties.xml file in the following way:
Change:
<property name="org.apache.manifoldcf.databaseimplementationclass" value="org.apache.manifoldcf.core.database.DBInterfaceDerby"/>
to:
<property name="org.apache.manifoldcf.databaseimplementationclass" value="org.apache.manifoldcf.core.database.DBInterfacePostgreSQL"/>
Add:
<property name="org.apache.manifoldcf.dbsuperusername" value="postgres"/>
<property name="org.apache.manifoldcf.dbsuperuserpassword" value="*******"/>

Then, start the quick-start example normally, and everything should initialize properly.

*Q. How do you configure Eclipse to build the ManifoldCF project?*

*A.* Here are the steps using Eclipse 3.4:
# Install Subclipse for Eclipse 3.x, follow the steps from http://subclipse.tigris.org/servlets/ProjectProcess?pageID=p4wYuA
# In Eclipse, switch to the "SVN Repository Exploring" perspective
# Add a new SVN repository using the URL http://svn.apache.org/repos/asf/incubator/lcf/trunk
# Right click on the svn repo and select "Check Out"
  #* May want to change the default name from truck to ManifoldCF, if you don't change the
name Eclipse will ask for a project type, pick General/Project.
# Wait for the source to extract
# Switch to Java Perspective and right click on the project that was added (referred to as
MCF in the rest of the steps) and select "Properties"
# Select "Builders" and click New
# Select "Ant Builder" and click Ok
# Give your builder a name, like ManifoldCF Ant Builder
# In the "Buildfile" section, press the "Browse Workspace" button
# Select the MCF project, drill down to "modules" subfolder and select "build.xml" file then
press Ok
# In the "Base Directory" section, press the "Browse Workspace" button
# Expand the MCF project and select "modules" then press Ok
# Note, you can further configure the different targets if you wish for a clean, regular,
and auto build
# Press Ok in the "Edit launch configuration properties" to complete the Eclipse configuration
# Make sure you have the system variable JAVA_HOME pointing to your jdk, also you need the
jdk bin directory listed in your path so java doc would work
# Now you can issue "Project/Build Project" and watch the console for the ant output


The build will also run through the junit tests which increases the build time.  For those
who like to do incremental build as they code, you may want to configure a "build" target
without the final unit test ("run-tests" in "all" target), which reduces build time from 5
to 1 minute.

*Q. What is the proper setting for number of worker threads?*

*A.* The number of worker threads, number of delete threads, number of expiration threads,
database pool size, and maximum number of database handles (in PostgreSQL) are related as
follows for the Quick Start:

(num_worker + num_delete + num_expiration + 10) < database_pool_size <= maximum_database_handles
- 2

The formula is somewhat different if you have multiple ManifoldCF processes, e.g. you are
running the crawler separately from the web applications.  In that case you need to add up
ALL the processes, because each of them will have their own pool of the designated size:

database_pool_size * num_processes <= maximum_database_handles - 2

The overall idea is so you don't run out of database handles in the pool (which can cause
ManifoldCF to deadlock even), and you don't run out of real database handles either (which
will cause a database error that stops your jobs).  The value of "2" adjustment is simply
so you can get into the database while ManifoldCF is running using tools like psql, and can
do things like vacuuming.

The first four values are all properties you can (and should!) set in properties.xml.  They
are described in the "how to build and deploy" document on the site.  The last property requires
you to configure the database (probably postgresql).  There are also general instructions
for doing that in "how to build and deploy".

The relationship between worker threads and all of the other kinds depends on your usage.
 Generally, though, 10 expiration threads and 10 deletion threads are fine, since they do
less of the overall work involved.

*Q. How can I use the Quick Start example with PostgreSQL?*

*A.* All you have to do is edit the quick start's properties.xml file as follows:

# Change the property "org.apache.manifoldcf.databaseimplementationclass" to have a value
of "org.apache.manifoldcf.core.database.DBInterfacePostgreSQL".
# Add a property "org.apache.manifoldcf.dbsuperusername" that has a value that is the name
of your PostgreSQL super user.
# Add a property "org.apache.manifoldcf.dbsuperuserpassword" that has a value that is the
password for your PostgreSQL super user.
# Change the property "org.apache.manifoldcf.crawler.threads" to have a value consistent with
your PostgreSQL configuration.
# Change the property "org.apache.manifoldcf.database.maxhandles" to have a value consistent
with your PostgreSQL configuration.

Then, just run the Quick Start normally, and it will create the database instance within PostgreSQL
instead of within Derby.

*Q. How can I connect to the Derby instance when the Quick Start is running?*

*A.* Sometimes it is very useful to be able to look into the Derby database while the ManifoldCF
Quick Start is active.  All you have to do to set this up is as follows:

# Start Quick Start using "java -Dderby.drda.startNetworkServer=true -jar start.jar".
# Start the Derby ij tool from the same directory, using "java -cp lib\derbyclient.jar;lib\derbytools.jar
org.apache.derby.tools.ij", or the Unix equivalent.
# In ij, connect to the database using the command "connect 'jdbc:derby://localhost:1527/dbname';".

The Derby ij command will then let you perform whatever query you like.


h3. Supported Documentation Platforms

*Q. Is there support planned the the Atlassian suite? (Confluence, JIRA, Crucible, Bamboo)*

*A.* This is one of a class of questions, namely "are you currently planning to add a connector
for X".  Open source software is like pot-luck; the more you bring to it, the more you'll
get out of it.  ManifoldCF is designed to make it straightforward to write new connectors,
and contributions of all sorts are strongly encouraged.  Even if you aren't sure you can develop
a full connector on your own, folks involved with the project are happy to help you.  There
is also a book being written, ManifoldCF in Action, which has as one of its goals getting
people to the point of being able to write their own connectors.  Parts of it are available
already - you can check it out here: [http://www.manning.com/wright]

To answer the specific question, connectors have been requested for the following:

# Atlassian
# CMIS
# Oracle with OLS
# The generic Content Management Java API spec, JSR whatever-it-is
# Enhanced SharePoint 2010, with site discovery

The only one I'm aware of that is being worked right now is SharePoint 2010 with site discovery.
 No current plans exist to implement connectors of other stripes, because none of the committers
have access to such systems at this time.  If you need such a connector, and you have access
to such a system, you are strongly encouraged to join the connectors-user list, post your
query there, and then maybe we can work out a development plan.

h3. Solr integration

*Q. How do I extract and index the contents of documents such as MS Word using Solr 1.4.1?
I'm getting a lazy loading error.*

*A.* There are a couple of bugs related to Solr 1.4.1 which makes it difficult to parse such
documents. Before you try the workarounds below, you should read more about the [ExtractingRequestHandler|http://wiki.apache.org/solr/ExtractingRequestHandler]
in order to understand the underlying technology. This handler uses Apache Tika to extract
contents from a broad variety of document formats.

Solr 1.4.1 ships with version 0.4 of Tika which will only allow you to extract the document's
metadata, not the content itself. Therefore you need to upgrade Tika to version 0.8. 

Generally, you have two options. You may get the latest version of Solr from trunk which ships
with Tika 0.8 at the time of writing, but this is not recommended if you plan to use Solr
in a production environment. Alternatively, you may download the following branch instead
which also includes version 0.8 of Tika:
[http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/]

In either case it is recommended to download the latest version from trunk anyway since you
need some updated libraries, even thought you choose the branch version (download both versions
in that case).

If you preferred to use the latest version from trunk, you're done. Otherwise you need to
complete a few more steps. First, step into the contrib/extraction directory and type ant
in order to build the Solr Cell jar file. Then copy it to a dedicated directory intended for
external libraries which Solr requires, for example to <solr_home>/lib. Remember to
specify this folder in your solrconfig.xml file so Solr knows where to look for external libraries.
You will find sufficient information about this inside the configuration file.

Finally you need updated Tika dependencies such as PDFBox. The latest version from trunk should
include sufficient updated libraries, so just copy all the jar files located in the contrib/extraction/lib
folder and place them into your external library folder.

If you also need to specify different date formats as described in the ExtractingRequestHandler
documentation, you must install the following patch as well:
[https://issues.apache.org/jira/secure/attachment/12434831/SOLR-1756.patch]

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
  

Mime
View raw message