Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 34123 invoked from network); 29 Apr 2010 14:46:03 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Apr 2010 14:46:03 -0000 Received: (qmail 90934 invoked by uid 500); 29 Apr 2010 14:46:02 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 90893 invoked by uid 500); 29 Apr 2010 14:46:02 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 90886 invoked by uid 99); 29 Apr 2010 14:46:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Apr 2010 14:46:02 +0000 X-ASF-Spam-Status: No, hits=-0.8 required=10.0 tests=AWL,HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of karl.wright@nokia.com designates 192.100.122.233 as permitted sender) Received: from [192.100.122.233] (HELO mgw-mx06.nokia.com) (192.100.122.233) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Apr 2010 14:45:56 +0000 Received: from vaebh106.NOE.Nokia.com (vaebh106.europe.nokia.com [10.160.244.32]) by mgw-mx06.nokia.com (Switch-3.3.3/Switch-3.3.3) with ESMTP id o3TEjQVF031515 for ; Thu, 29 Apr 2010 17:45:31 +0300 Received: from esebh102.NOE.Nokia.com ([172.21.138.183]) by vaebh106.NOE.Nokia.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 29 Apr 2010 17:45:13 +0300 Received: from smtp.mgd.nokia.com ([65.54.30.8]) by esebh102.NOE.Nokia.com over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Thu, 29 Apr 2010 17:45:12 +0300 Received: from NOK-EUMSG-01.mgdnok.nokia.com ([65.54.30.86]) by nok-am1mhub-04.mgdnok.nokia.com ([65.54.30.8]) with mapi; Thu, 29 Apr 2010 16:45:12 +0200 From: To: Date: Thu, 29 Apr 2010 16:45:10 +0200 Subject: RE: FW: Solr and LCF security at query time Thread-Topic: FW: Solr and LCF security at query time Thread-Index: Acrno8xLFV1KGwVURDWn83mRdKWjQQABaQoQ Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_CF3CE3EFBCA3564185DF065952A267C85305385106NOKEUMSG01mgd_" MIME-Version: 1.0 X-OriginalArrivalTime: 29 Apr 2010 14:45:13.0013 (UTC) FILETIME=[92FF1E50:01CAE7AA] X-Nokia-AV: Clean --_000_CF3CE3EFBCA3564185DF065952A267C85305385106NOKEUMSG01mgd_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable If we aren't talking about a repository of some kind, then we aren't talkin= g about using LCF. If your design point is about applying security to NFS = via an acl-xml file, your uploaded contribution will do that just fine (alt= hough I think you might want to use Filters in some places you are currentl= y using Querys, according to what I've learned over the past day or two). If a repository with security is involved, there's no benefit I can see to = building yet another security mechanism above and beyond the one that the r= epository would provide. It's double the administration, and in that light= only makes sense at all if there's no native security mechanism present in= whatever your data source is. There are certainly a number of "repositori= es" with this characteristic, though - the web, rss feeds, file systems, et= c. Karl ________________________________ From: ext Peter Sturge [mailto:peter.sturge@googlemail.com] Sent: Thursday, April 29, 2010 9:56 AM To: dev@lucene.apache.org Subject: Re: FW: Solr and LCF security at query time Hi Karl, - There's a significant extra load on the repository, because every search = result has to be checked against the repository in real time By repository, do you mean, for example, NTFS? You certainly wouldn't want,= or need to do that at all, particularly for environments where the reposit= ory isn't available. That's kind of the point of having the acl decoupled. - It will perform very poorly on queries were there are a lot of matching d= ocuments, but the search user can't see most of them The performance of the filter queries would be no worse (or better) than an= y other of similar length/complexity. Essentially, the filter queries betwe= en the two models are just using a different set of attributes (acl-specifi= c vs. intrinsic to the document). If someone felt they needed to build lots= of super-long complex filter queries to define a set of allowed/denied doc= uments, their general search performance is probably not going to be great = anyway, and would be remedied by organizing the data more efficiently (whic= h is a good idea in any case). Thanks, Peter On Thu, Apr 29, 2010 at 1:10 PM, > wrote: Putting access control lookup at search-result time has the following benef= its: - It sees changes right away, when the underlying repository changes Here are the drawbacks, as far as I can see: - There's a significant extra load on the repository, because every search = result has to be checked against the repository in real time - It will perform very poorly on queries were there are a lot of matching d= ocuments, but the search user can't see most of them Having only one general solution means that you have to pick one or the oth= er of the two models. We opted for the model we did because the drawbacks = were potentially severe, especially under conditions of high demand. The r= epository load question is not a trivial one, because it scales as the numb= er of results returned, which is a potentially gigantic number. However, I am perfectly fine with supporting both models. Your suggested s= olution will work for some classes of problem. It seems to me that in orde= r to support it you will need a parallel infrastructure to do that. We cou= ld develop that infrastructure within LCF, but it's a bit of work to do: (1) Output an "internal repository document security identifier" into the i= ndex, in addition to tokens. This id is not the same at all as the documen= t's URI, which is what literal.id is currently set to, s= o a new solr schema field would need to be made for this. All output conne= ctors would need to be modified to do this, and all repository connectors a= s well. (2) Since the security identifier would be valid within the context of a gi= ven repository connection, the "authority service" code that tries to verif= y visibility of a document given the authenticated user name and security i= dentifier would need to look up the correct repository connection and call = a method within it - which currently doesn't exist. So we'd need to write = such a method for all connectors that have security. (3) Since this service would have a high load, and only be used under one p= articular model, I'd suggest actually defining a whole new webapp for it, s= o it can be distributed/controlled independently. Karl ________________________________ From: ext Peter Sturge [mailto:peter.sturge@googlemail.com] Sent: Thursday, April 29, 2010 5:35 AM To: connectors-user@incubator.apache.org Cc: dev@lucene.apache.org; connectors-dev@inc= ubator.apache.org; lucene-dev@a= pache.org Subject: Re: FW: Solr and LCF security at query time Hi Karl, I guess it comes down to - any solution is ultimately going to place access= control on a search and not on data, so there isn't much to be gained by b= inding the access control to the data. Whatever attributes exist at index t= ime to build an acl will still be there at query time, so by making the acl= search-bound, the acl is decoupled from the data, allowing it to be used i= n any use case scenario. Here's a typical sampling of use cases where the decoupling of acl from dat= a is required: One customer has a 'shop-search' requirement where, logged-in users' acces= s to various shops changes daily, sometimes 4 or 5 times a day. There are s= everal hundred such shops and 10s of millions of documents, and the indexin= g part doesn't have ownership of any of the 'source' documents. Another example is a customer who has multiple sites and multiple AD domain= s. They have one domain for the UK, but a completely separate domain for Gi= braltar. When data is replicated to remote servers accessed by Gibraltar s= taff, these users have no SID information in the other domain. An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, du= e to departmental history, they have no fewer than 85 AD domains! This of c= ourse is a nightmare in itself, but trying to tie access information to dat= a at storage time is virtually impossible in this environment. The thing I'm trying to understand is that the decoupled approach works equ= ally well for the requirements where you do have acl information at index t= ime. I guess I'm not understanding the advantages to making schema changes = and binding acl to data, when there's really no need. I particularly like y= our idea of using LCF as the facilitator of storing/retrieving such decoupl= ed data (as opposed to just an xml file). It sounds like there's even a use= r interface for 'non-technical' staff to make acl configuration changes. Th= at's really cool, and ultimately an elegant solution that will fit present = and future needs. Kind regards, Peter On Thu, Apr 29, 2010 at 1:24 AM, > wrote: Hi Peter, I'm more than happy to hear your customer's requirements, so no problem the= re. It does seem to me that they are a bit different than what I've seen. = I think there is plenty of room for different flavors of solution, so plea= se by all means go ahead and propose your take on it! Karl ________________________________________ From: ext Peter Sturge [peter.sturge@googlemail.com] Sent: Wednesday, April 28, 2010 8:07 PM To: dev@lucene.apache.org Cc: connectors-user@incubator.apache.org; connectors-dev@incubator.apache.org; lucene-dev@apache.org Subject: Re: FW: Solr and LCF security at query time Hi Karl, I wasn't trying to to put pay to your design proposal, really the opposite = - to highlight requirements that have found to be necessary for customers/u= sers, and to hopefully get the best functionality for the product. If you f= eel I've put you out on any of the issues raised, then I apologize for that= , it was certainly not my intention. Peter --_000_CF3CE3EFBCA3564185DF065952A267C85305385106NOKEUMSG01mgd_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
If we aren't talking about a repository of some ki= nd, then=20 we aren't talking about using LCF.  If your design point is about appl= ying=20 security to NFS via an acl-xml file, your uploaded contribution will do tha= t=20 just fine (although I think you might want to use Filters in some places yo= u are=20 currently using Querys, according to what I've learned over the past d= ay or=20 two).
 
If a repository with security is involved, there's= no=20 benefit I can see to building yet another security mechanism above and beyo= nd=20 the one that the repository would provide.  It's double=20 the administration, and in that light only makes sense at all if there= 's no=20 native security mechanism present in whatever your data source is.  Th= ere=20 are certainly a number of "repositories" with this characteristic, though -= the=20 web, rss feeds, file systems, etc.
 
Karl


From: ext Peter Sturge=20 [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 29, 2= 010=20 9:56 AM
To: dev@lucene.apache.org
Subject: Re: FW: Solr= and=20 LCF security at query time

Hi Karl,

- There's a significant extra load on the reposito= ry,=20 because every search result has to be checked against the repository in rea= l=20 time

By repository, do you mean, for example, = NTFS?=20 You certainly wouldn't want, or need to do that at all, particularly for=20 environments where the repository isn't available. That's kind of the point= of=20 having the acl decoupled.

- It will=20 perform very poorly on queries were there are a lot of matching documents, = but=20 the search user can't see most of=20 them

The performance of the filter queries would = be no=20 worse (or better) than any other of similar length/complexity. Essentially,= the=20 filter queries between the two models are just using a different set of=20 attributes (acl-specific vs. intrinsic to the document). If someone felt th= ey=20 needed to build lots of super-long complex filter queries to define a set o= f=20 allowed/denied documents, their general search performance is probably not = going=20 to be great anyway, and would be remedied by organizing the data more=20 efficiently (which is a good idea in any=20 case).


Thanks,
Peter


On Thu, Apr 29, 2010 at 1:10 PM, &= lt;karl.wright@nokia.com>= =20 wrote:
Putting=20 access control lookup at search-result time has the following=20 benefits:
 
- It sees=20 changes right away, when the underlying repository changes<= /DIV>
 
Here are=20 the drawbacks, as far as I can see:
 
- There's=20 a significant extra load on the repository, because every search result h= as to=20 be checked against the repository in real time
- It will=20 perform very poorly on queries were there are a lot of matching documents= , but=20 the search user can't see most of them
 
Having=20 only one general solution means that you have to pick one or the other of= the=20 two models.  We opted for the model we did because the drawbacks wer= e=20 potentially severe, especially under conditions of high demand.  The= =20 repository load question is not a trivial one, because it scales as the n= umber=20 of results returned, which is a potentially gigantic=20 number.
 
However, I=20 am perfectly fine with supporting both models.  Your suggested solut= ion=20 will work for some classes of problem.  It seems to me that in order= to=20 support it you will need a parallel infrastructure to do that.  We c= ould=20 develop that infrastructure within LCF, but it's a bit of work to=20 do:
 
(1)=20 Output an "internal repository document security identifier" into th= e=20 index, in addition to tokens.  This id is not the same at all a= s the=20 document's URI, which is what literal.id is currently set to, so a new solr schema = field=20 would need to be made for this.  All output connectors would need to= be=20 modified to do this, and all repository connectors as=20 well.
(2) Since=20 the security identifier would be valid within the context of a given=20 repository connection, the "authority service" code that tries to verify= =20 visibility of a document given the authenticated user name and security=20 identifier would need to look up the correct repository connection and ca= ll a=20 method within it - which currently doesn't exist.  So we'd need to w= rite=20 such a method for all connectors that have security.
(3) Since=20 this service would have a high load, and only be used under one particula= r=20 model, I'd suggest actually defining a whole new webapp for it, so it can= be=20 distributed/controlled independently.
 
Karl

 

From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursda= y,=20 April 29, 2010 5:35 AM
To: connectors-user@incubator.apache.org
Cc: dev@lucene.apache.o= rg;=20 connectors-dev@incubator.apache.org; lucene-dev@apache.o= rg

Subject: Re: FW: Solr and LCF security at quer= y=20 time

Hi Karl,

I guess it comes down to - any solution is=20 ultimately going to place access control on a search and not on data, so = there=20 isn't much to be gained by binding the access control to the data. Whatev= er=20 attributes exist at index time to build an acl will still be there at que= ry=20 time, so by making the acl search-bound, the acl is decoupled from the da= ta,=20 allowing it to be used in any use case scenario.

Here's a typical= =20 sampling of use cases where the decoupling of acl from data is=20 required:

One customer has a  'shop-search' requirement where= ,=20 logged-in users' access to various shops changes daily, sometimes 4 or 5 = times=20 a day. There are several hundred such shops and 10s of millions of docume= nts,=20 and the indexing part doesn't have ownership of any of the 'source' docum= ents.=20

Another example is a customer who has multiple sites and multiple= AD=20 domains. They have one domain for the UK, but a completely separate domai= n for=20 Gibraltar. When data is replicated to  remote servers accessed by=20 Gibraltar staff, these users have no SID information in the other=20 domain.

An 'interesting' example of this at the extreme is 34rkl4y= s=20 Bank, where, due to departmental history, they have no fewer than 85 AD=20 domains! This of course is a nightmare in itself, but trying to tie acces= s=20 information to data at storage time is virtually impossible in this=20 environment.

The thing I'm trying to understand is that the decoup= led=20 approach works equally well for the requirements where you do have acl=20 information at index time. I guess I'm not understanding the advantages t= o=20 making schema changes and binding acl to data, when there's really no nee= d. I=20 particularly like your idea of using LCF as the facilitator of=20 storing/retrieving such decoupled data (as opposed to just an xml file). = It=20 sounds like there's even a user interface for 'non-technical' staff to ma= ke=20 acl configuration changes. That's really cool, and ultimately an elegant= =20 solution that will fit present and future needs.


Kind=20 regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <karl.wright@nokia.com> wrote:
Hi=20 Peter,

I'm more than happy to hear your customer's requirements,= so=20 no problem there.  It does seem to me that they are a bit differen= t=20 than what I've seen.  I think there is plenty of room for differen= t=20 flavors of solution, so please by all means go ahead and propose your t= ake=20 on it!

Karl

________________________________________
From: = ext=20 Peter Sturge [peter.sturge@googlemail.com]
Sent: Wednesd= ay,=20 April 28, 2010 8:07 PM
Cc: connectors-user@incubator.apache.org; connectors-dev@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design propo= sal,=20 really the opposite - to highlight requirements that have found to be=20 necessary for customers/users, and to hopefully get the best functional= ity=20 for the product. If you feel I've put you out on any of the issues rais= ed,=20 then I apologize for that, it was certainly not my=20 intention.

Peter



--_000_CF3CE3EFBCA3564185DF065952A267C85305385106NOKEUMSG01mgd_--