lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Using payloads and user provided data in score
Date Thu, 23 Jul 2015 16:30:25 GMT
bq: Your "ugly problem" is my situation I think ;)

No, your problem is much worse ;(

The _contents_ of fields are restricted, which is
horrible.

OK, here's another idea out of waaaaaaay left field: Payloads.

It hinges on there being an OK number of possible combinations
which seems to be the case here. "OK" here means < 1B say. It
also hinges on being able to pre-calculate the access rights for
each term as you index it.

Then you attach a payload to each term which is, in effect, the
authorization token for that term that expresses your possibilities,
A, B, A&B, A|B, whatever. Payloads are simply a float that
gets carried along with the term and is accessible at scoring
time.

Now at scoring time, you "drop out" any terms that have "bad"
auth tokens. WARNING: this is totally off the top of my head,
so I'm sure there are gotchas in here. Like does returning 0
from the scoring negate the search.....

No clue whether this can work for you, but here's some sample
code that could give you an idea of how it all works:
https://lucidworks.com/blog/end-to-end-payload-example-in-solr/

Good Luck. You're going places Solr wasn't designed to deal
with so whatever you do will be "exciting". And you're right,
creating huge clauses will be a performance issue, the payloads
thing may help you tame that.

Best,
Erick

On Thu, Jul 23, 2015 at 7:30 AM, Jamie Johnson <jej2003@gmail.com> wrote:
> Sorry for being vague, I'll try to explain more.  In my use case a
> particular field does not have a security control, it's the data in the
> field.  So for instance if I had a schema with a field called name, there
> could be data that should be secured at A, B, A&B, A|B, etc within that
> field.  So again it's not the field that has this control it's the data in
> the field.  My thought based on your suggestion was to dynamically generate
> the fields based on the authorizations, this way the user would only see
> name, but it would get translated to the fields in the index that they can
> see.  So at index time if a field was added to the solr document that said
> name:foo with authorizations A&B I would need to translate that to
> name_A&B_txt:foo.  Then subsequently on search I would check what fields in
> the index the user should be able to see and rewrite queries that said
> name:foo to name_A&B_txt:foo (assuming the user can see A&B).
>
> We do not explicitly control the fields the user or calling application has
> access to because I don't want to expose the name_A&B_txt:foo fields to
> calling applications, they know that a field "name" exists, based on that I
> need to translate a name:foo query into the appropriately controlled
> version.  Does that make sense?
>
> My biggest concern with this (beyond the query rewrite) is how it will
> impact scoring (especially in the case information is available with
> multiple markings, i.e. name_A_txt has a value of foo and name_B_txt has a
> value of foo and the user has authorizations A and B) and possibly bumping
> up against the maximum clause limit as we expand the query.
>
> These reasons were why I thought it best to use payloads to make terms with
> authorizations a user can't see not impact the score and then resolve the
> actual object the user can see using a store that already supports this
> type of access pattern (specifically Accumulo in this case).
>
> Your "ugly problem" is my situation I think ;)
>
> On Thu, Jul 23, 2015 at 12:06 AM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> I'm not quite getting it here. I'm guessing that you do not
>> allow fielded queries or you strictly control the fields a user
>> sees to pick from. Otherwise your security stuff goes out the
>> window, say you have a drop-down list of fields to choose from
>> or something.
>>
>> Assuming you do NOT have such a thing, the user is just typing
>> words in a box, then you have to figure out, once at the
>> app layer, what fields they have access to and just append a
>> qf=field_secure1,field_secure2.....
>> parameter to the query.
>>
>> That's it. You do not have to rewrite the user query at all, the q
>> parameter is just passed through as is.
>>
>> bq:  I guess in a search component I could look up all of the fields
>> that are in the index and only run queries against fields they should be
>> able to see once I know what is in the index (this is what you're
>> suggesting right?).
>>
>> Kind of, except not in a search component. You have to have modeled
>> the access rights somewhere, so I'm not getting why you can't just use
>> that model to generate the list of restricted fields the user has access
>> to.
>> You haven't explained that model other than to say it's "complex". So I
>> have no clue whether you're talking about not _knowing_ what fields are
>> in the docs in the first place (quite possible with dynamic fields) or
>> whether you do know the complete field list but calculating the user's
>> access
>> rights to which fields is complex.
>>
>> But I should emphasize again that my assumption is that once calculated,
>> this list is invariant so it does not need to be done for every request.
>> Indeed,
>> what I'm envisioning is not writing any Solr code at all, all done in
>> the app layer.
>>
>> As far as extra work, there isn't any as far as Solr is concerned.
>> It's exactly as though you were specifying this in, say, the request
>> handler. So I don't get your concern about lots and lots of fields.
>> Now, I'm assuming a simple document model with some number
>> of fields. The access rights to which of those fields a user can
>> see may be a complex calculation, but again you only need to do it
>> once. For that matter, you could pre-calculate that set of fields
>> or otherwise cache it.
>>
>> Now, this breaks down if the document model isn't that simple,
>> say the same field in doc1 can be seen by userX, but userX
>> can't see the _same_ field in doc2. That's an ugly problem...
>>
>> And let's further say there are a number of fields that _everyone_
>> can see. They can be placed in an <appends> section of the request
>> handler so you don't have to specify them for each request.
>>
>> Best,
>> Erick
>>
>> On Wed, Jul 22, 2015 at 4:12 PM, Jamie Johnson <jej2003@gmail.com> wrote:
>> > Looks like this may be what I'm looking for
>> >
>> > *SolrRequestInfo*
>> >
>> > I have not tried this yet but looks promising.
>> >
>> > Assuming this works, thinking about your suggestion I would need to
>> rewrite
>> > the users query with the appropriate fields, are there any utilities for
>> > doing this?  I'd be looking to rewrite a fielded query like +field:value
>> > possibly to something like +(field.secure:value field.secure2:value)
>> >
>> > Again thanks for suggestions
>> > On Jul 22, 2015 5:20 PM, "Jamie Johnson" <jej2003@gmail.com> wrote:
>> >
>> >> I answered my own question, looks like the field infos are always read
>> >> within the IndexSearcher so that cost is already being paid.
>> >>
>> >> I would potentially have to duplicate information in multiple fields if
>> it
>> >> was present at multiple authorization levels, is there a limit to the
>> >> number of fields within a document?  I'm also concerned this might skew
>> my
>> >> search results as terms that had more authorizations would appear in
>> more
>> >> fields and would result in more matches on query.  I'll play with this a
>> >> little but I am still wondering about my original question.
>> >>
>> >> On Wed, Jul 22, 2015 at 4:45 PM, Jamie Johnson <jej2003@gmail.com>
>> wrote:
>> >>
>> >>> I had thought about this in the past, but thought it might be too
>> >>> expensive.  I guess in a search component I could look up all of the
>> fields
>> >>> that are in the index and only run queries against fields they should
>> be
>> >>> able to see once I know what is in the index (this is what you're
>> >>> suggesting right?).
>> >>>
>> >>> My concern would be that the number of fields per document would grow
>> too
>> >>> large to support this.  Our controls aren't simple like user or admin
>> they
>> >>> are complex combinations of authorizations so I would think there
>> might be
>> >>> a large number of fields that are generated using this approach.  Would
>> >>> retrieving all field infos from Solr be expensive on each request to
>> see
>> >>> what they should be able to query?
>> >>>
>> >>> On Wed, Jul 22, 2015 at 4:19 PM, Erick Erickson <
>> erickerickson@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> Why don't you handle it all at the app level? Here's what I mean:
>> >>>>
>> >>>> I'm assuming that you're using edismax here, but the same principle
>> >>>> applies if not.
>> >>>>
>> >>>> Your handler (say the "/select" handler) has a "qf" parameter which
>> >>>> defines
>> >>>> the fields that are searched over in the absence of a field qualifier,
>> >>>> e.g.
>> >>>> q=whatever&qf=title,description
>> >>>>
>> >>>> causes the search term to be looked for in the two fields "title"
and
>> >>>> "description"
>> >>>> You can also set up the qf fields in the "/select" handler as one
of
>> >>>> the items in
>> >>>> the <defaults> section....
>> >>>>
>> >>>> But, the qf param in the <defaults> section is just that...
a default.
>> >>>> So individual
>> >>>> queries can override it. What I have in mind is that you'd look
up the
>> >>>> user's
>> >>>> field-access list and append that list as necessary to the query
and
>> >>>> just pass it
>> >>>> on through.
>> >>>>
>> >>>> Things to watch out for:
>> >>>> 1> if the user specifies a field, you'll have to strip that off
if
>> >>>> they don't have rights,
>> >>>> i.e. q=field1:whatever whenever
>> >>>> ignores the qf parameter for "whatever" but does respect the qf
param
>> >>>> for "whenever".
>> >>>> 2> If you have some kind of date field say that you want to facet
>> >>>> over, you'd have
>> >>>> to control that.
>> >>>> 3> if you have a "bag of words" where you use copyField to add
a bunch
>> >>>> of field's
>> >>>> data to an uber-field then the user can infer some things from that
>> >>>> info, so you probably
>> >>>> don't want to be careful about what copyFields you use.
>> >>>>
>> >>>> Best,
>> >>>> Erick
>> >>>>
>> >>>> On Wed, Jul 22, 2015 at 12:21 PM, Jamie Johnson <jej2003@gmail.com>
>> >>>> wrote:
>> >>>> > I am looking for a way to prevent fields that users shouldn't
be
>> able
>> >>>> to
>> >>>> > know exist from contributing to the score.  The goal is to
provide a
>> >>>> way to
>> >>>> > essentially hide certain fields from requests based on an access
>> level
>> >>>> > provided on the query.  I have managed to make terms that users
>> >>>> shouldn't
>> >>>> > be able to see not impact the score by implementing a custom
>> Similarity
>> >>>> > class that looks at the terms payloads and returns 0 for the
score
>> if
>> >>>> they
>> >>>> > shouldn't know the field exists.  The issue however is that
I don't
>> >>>> have
>> >>>> > access to the request at this point so getting the users access
>> level
>> >>>> is
>> >>>> > proving problematic.  Is there a way to get the current request
>> that is
>> >>>> > being processed via some thread local variable or something
similar
>> >>>> that
>> >>>> > Solr maintains?  If not is there another approach that I could
be
>> >>>> using to
>> >>>> > access information from the request within my Similarity
>> >>>> implementation?
>> >>>> > Any thoughts on this would be greatly appreciated.
>> >>>> >
>> >>>> > -Jamie
>> >>>>
>> >>>
>> >>>
>> >>
>>

Mime
View raw message