lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven White <swhite4...@gmail.com>
Subject Re: When is too many fields in "qf" is too many?
Date Thu, 28 May 2015 21:58:40 GMT
Hi Folks,

First, thanks for taking the time to read and reply to this subject, it is
much appreciated, I have yet to come up with a final solution that
optimizes Solr.  To give you more context, let me give you the big picture
of how the application and the database is structured for which I'm trying
to enable Solr search on.

Application: Has the concept of "views".  A view contains one or more
object types.  An object type may exist in any view.  An object type has
one or more field groups.  A field group has a set of fields.  A field
group can be used with any object type of any view.  Notice how field
groups are free standing, that they can be "linked" to an object type of
any view?

Here is a diagram of the above:

FieldGroup-#1 == Field-1, Field-2, Field-5, etc.
FieldGroup-#2 == Field-1, Field-5, Field-6, Field-7, Field-8, etc.
FieldGroup-#3 == Field-2, Field-5, Field-8, etc.

View-#1 == ObjType-#2 (using FieldGroup-#1 & #3)  +  ObjType-#4 (using
FieldGroup-#1)  +  ObjType-#5 (using FieldGroup-#1, #2, #3, etc).

View-#2 == ObjType-#1 (using FieldGroup-#3, #15, #16, #19, etc.)  +
 ObjType-#4 (using FieldGroup-#1, #4, #19, etc.)  +  etc.

View-#3 == ObjType-#1 (using FieldGroup-#1, & #8)  +  etc.

Do you see where this is heading?  To make it even a bit more interesting,
ObjType-#4 (which is in view-#1 and #2 per the above) which in both views,
it uses FieldGroup-#1, in one view it can be configured to have its own
fields off FieldGroup-#1.

With the above setting, a user is assigned a view and can be moved around
views but cannot be in multiple views at the same time.  Based on which
view that user is in, that user will see different fields of ObjType-#1
(the example I gave for FieldGroup-#1) or even not see an object type that
he was able to see in another view.

If I have not lost you with the above, you can see that per view, there can
be may fields.  To make it even yet more interesting, a field in
FieldGroup-#1 may have the exact same name as a field in another FieldGroup
and the two could be of different type (one is date, the other is string
type).  Thus when I build my Solr doc object (and create list of Solr
fields) those fields must be prefixed with the FieldGroup name otherwise I
could end up overwriting the type of another field.

Are you still with me?  :-)

Now you see how a view can end up with many fields (over 3500 in my case),
but a doc I post to Solr for indexing will have on average 50 fields, worse
case maybe 200 fields.  This is fine, and it is not my issue but I want to
call it out to get it out of our way.

Another thing I need to mention is this (in case it is not clear from the
above).  Users create and edit records in the DB by an instance of
ObjType-#N.  Those object types that are created do NOT belong to a view,
in fact they do NOT have any view concept in them.  They simply have the
concept of what fields the user can see / edit based on which view that
user is in.  In effect, in the DB, we have instances of object types data.

One last thing I should point out is that views, and field groups are
dynamic.  This month, View-#3 may have ObjType-#1, but next month it may
not or a new object type may be added to it.

Still with me?  If so, you are my hero!!  :-)

So, I setup my Solr schema.xml to include all fields off each field group
that exists in the database like so:

    <field name="FieldGroup-1.Headline" type="text" multiValued="true"
indexed="true" stored="false" required="false"/>
    <field name="FieldGroup-1.Summary" type="text" multiValued="true"
indexed="true" stored="false" required="false"/>
    <field name="FieldGroup-1. ... ... ... ... />
    <field name="FieldGroup-2.Headline" type="text" multiValued="true"
indexed="true" stored="false" required="false"/>
    <field name="FieldGroup-2.Summary" type="text" multiValued="true"
indexed="true" stored="false" required="false"/>
    <field name="FieldGroup-2.Date" type="text" multiValued="true"
indexed="true" stored="false" required="false"/>
    <field name="FieldGroup-2. ... ... ... ... />
    <field name="FieldGroup-3. ... ... ... ... />
    <field name="FieldGroup-4. ... ... ... ... />

You got the idea.  Each record of an object type I index contains ALL the
fields off that that object type REGARDLESS which view that object type is
set to be in (remember, all that views does is let you configure the list
of fields visible / accessible in that view).

Next, in Solr I created request handlers per view.  The request handler
utilizes "qf" to list all fields that are viewable for that view.  When a
user logs into the application, I know which view that user is in so I
issue a search request against that view in effect the search is against
the list of fields of that view.

Why not create a per view pseudo Solr field and copyField into it the
fields data and than use that single field as the "qf" vs. 100's of filed?
Two reasons:

1) Like I said above, views are dynamic.  On a monthly basic, a object
types or even field groups can be added / removed from a view.  If I was
using copyField it means I have to reindex my entire database to reflect a
view change even when the actual data has not changed.

2) My Solr index size will now be larger.  I have to create a pseudo Solr
field to copyField to it for each view in my database.

I have also considered creating multiple cores per view, but that still
doesn't solve the above two issues, requiring reindex and increasing the
index size.

Now that you see what my backend application is like, let me know if you
have any ideas on how you would solve this puzzle.

And if you have read this all the way to the end, I solute you!!

Steve


On Thu, May 28, 2015 at 4:23 PM, Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> I would reconsider the strategy of mashing so many different record types
> into one Solr collection. Sure, you get some advantage from denormalizing
> data, but if the downside cost gets too high, it may not make so much
> sense.
>
> I'd consider a collection per record type, or at least group similar record
> types, and then query as many collections - in parallel - as needed for a
> given user. That should also assure that a query for a given record type
> should be much faster as well.
>
> Surely you should be able to examine the query in the app and determine
> what record types it might apply to.
>
> When in doubt, make your schema as clean and simple as possible. Simplicity
> over complexity.
>
>
> -- Jack Krupansky
>
> On Thu, May 28, 2015 at 12:06 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> > Gotta agree with Jack here. This is an insane number of fields, query
> > performance on any significant corpus will be "fraught" etc. The very
> > first thing I'd look at is having that many fields. You have 3,500
> > different fields! Whatever the motivation for having that many fields
> > is the place I'd start.....
> >
> > Best,
> > Erick
> >
> > On Thu, May 28, 2015 at 5:50 AM, Jack Krupansky
> > <jack.krupansky@gmail.com> wrote:
> > > This does not even pass a basic smell test for reasonability of
> matching
> > > the capabilities of Solr and the needs of your application. I'd like to
> > > hear from others, but I personally would be -1 on this approach to
> > misusing
> > > qf. I'd simply say that you need to go back to the drawing board, and
> > that
> > > your primary focus should be on working with your application product
> > > manager to revise your application requirements to more closely match
> the
> > > capabilities of Solr.
> > >
> > > To put it simply, if you have more than a dozen fields in qf, you're
> > > probably doing something wrong. In this case horribly wrong.
> > >
> > > Focus on designing your app to exploit the capabilities of Solr, not to
> > > misuse them.
> > >
> > > In short, to answer the original question, more than a couple dozen
> > fields
> > > in qf is indeed too many. More than a dozen raises a yellow flag for
> me.
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, May 28, 2015 at 8:13 AM, Steven White <swhite4141@gmail.com>
> > wrote:
> > >
> > >> Hi Charles,
> > >>
> > >> That is what I have done.  At the moment, I have 22 request handlers,
> > some
> > >> have 3490 field items in "qf" (that's the most and the qf line spans
> > over
> > >> 95,000 characters in solrconfig.xml file) and the least one has 1341
> > >> fields.  I'm working on seeing if I can use copyField to copy the data
> > of
> > >> that view's field into a single pseudo-view-field and use that pseudo
> > field
> > >> for "qf" of that view's request handler.  The I still have outstanding
> > with
> > >> using copyField in this way is that it could lead to a complete
> > re-indexing
> > >> of all the data in that view when a field is adding / removing from
> that
> > >> view.
> > >>
> > >> Thanks
> > >>
> > >> Steve
> > >>
> > >> On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles <
> > >> Charles.Reitzel@tiaa-cref.org> wrote:
> > >>
> > >> > One request handler per view?
> > >> >
> > >> > I think if you are able to make the actual view in use for the
> current
> > >> > request a single value (vs. all views that the user could use over
> > time),
> > >> > it would keep the qf list down to a manageable size (e.g. specified
> > >> within
> > >> > the request handler XML).   Not sure if this is feasible for  you,
> > but it
> > >> > seems like a reasonable approach given the use case you describe.
> > >> >
> > >> > Just a thought ...
> > >> >
> > >> > -----Original Message-----
> > >> > From: Steven White [mailto:swhite4141@gmail.com]
> > >> > Sent: Tuesday, May 26, 2015 4:48 PM
> > >> > To: solr-user@lucene.apache.org
> > >> > Subject: Re: When is too many fields in "qf" is too many?
> > >> >
> > >> > Thanks Doug.  I might have to take you on the hangout offer.  Let
me
> > >> > refine the requirement further and if I still see the need, I will
> let
> > >> you
> > >> > know.
> > >> >
> > >> > Steve
> > >> >
> > >> > On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull <
> > >> > dturnbull@opensourceconnections.com> wrote:
> > >> >
> > >> > > How you have tie is fine. Setting tie to 1 might give you
> reasonable
> > >> > > results. You could easily still have scores that are just always
> an
> > >> > > order of magnitude or two higher, but try it out!
> > >> > >
> > >> > > BTW Anything you put in teh URL can also be put into a request
> > handler.
> > >> > >
> > >> > > If you ever just want to have a 15 minute conversation via
> hangout,
> > >> > > happy to chat with you :) Might be fun to think through your
prob
> > >> > together.
> > >> > >
> > >> > > -Doug
> > >> > >
> > >> > > On Tue, May 26, 2015 at 1:42 PM, Steven White <
> swhite4141@gmail.com
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Doug,
> > >> > > >
> > >> > > > I'm back to this topic.  Unfortunately, due to my DB structer,
> and
> > >> > > business
> > >> > > > need, I will not be able to search against a single field
(i.e.:
> > >> > > > using copyField).  Thus, I have to use list of fields via
"qf".
> > >> > > > Given this, I see you said above to use "tie=1.0" will that,
> more
> > or
> > >> > > > less, address this scoring issue?  Should "tie=1.0" be set
on
> the
> > >> > request handler like so:
> > >> > > >
> > >> > > >   <requestHandler name="/select" class="solr.SearchHandler">
> > >> > > >      <lst name="defaults">
> > >> > > >        <str name="echoParams">explicit</str>
> > >> > > >        <int name="rows">20</int>
> > >> > > >        <str name="defType">edismax</str>
> > >> > > >        <str name="qf">F1 F2 F3 F4 ... ... ...</str>
> > >> > > >        <float name="tie">1.0</float>
> > >> > > >        <str name="fl">_UNIQUE_FIELD_,score</str>
> > >> > > >        <str name="wt">xml</str>
> > >> > > >        <str name="indent">true</str>
> > >> > > >      </lst>
> > >> > > >   </requestHandler>
> > >> > > >
> > >> > > > Or must "tie" be passed as part of the URL?
> > >> > > >
> > >> > > > Thanks
> > >> > > >
> > >> > > > Steve
> > >> > > >
> > >> > > >
> > >> > > > On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull <
> > >> > > > dturnbull@opensourceconnections.com> wrote:
> > >> > > >
> > >> > > > > Yeah a copyField into one could be a good space/time
tradeoff.
> > It
> > >> > > > > can
> > >> > > be
> > >> > > > > more manageable to use an all field for both relevancy
and
> > >> > > > > performance,
> > >> > > > if
> > >> > > > > you can handle the duplication of data.
> > >> > > > >
> > >> > > > > You could set tie=1.0, which effectively sums all the
matches
> > >> > > > > instead
> > >> > > of
> > >> > > > > picking the best match. You'll still have cases where
one
> > field's
> > >> > > > > score might just happen to be far off of another, and
thus
> > >> > > > > dominating the summation. But something easy to try
if you
> want
> > to
> > >> > > > > keep playing with dismax.
> > >> > > > >
> > >> > > > > -Doug
> > >> > > > >
> > >> > > > > On Wed, May 20, 2015 at 2:56 PM, Steven White
> > >> > > > > <swhite4141@gmail.com>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi Doug,
> > >> > > > > >
> > >> > > > > > Your blog write up on relevancy is very interesting,
I
> didn't
> > >> > > > > > know
> > >> > > > this.
> > >> > > > > > Looks like I have to go back to my drawing board
and figure
> > out
> > >> > > > > > an alternative solution: somehow get those
> group-based-fields
> > >> > > > > > data into
> > >> > > a
> > >> > > > > > single field using copyField.
> > >> > > > > >
> > >> > > > > > Thanks
> > >> > > > > >
> > >> > > > > > Steve
> > >> > > > > >
> > >> > > > > > On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull
<
> > >> > > > > > dturnbull@opensourceconnections.com> wrote:
> > >> > > > > >
> > >> > > > > > > Steven,
> > >> > > > > > >
> > >> > > > > > > I'd be concerned about your relevance with
that many qf
> > fields.
> > >> > > > Dismax
> > >> > > > > > > takes a "winner takes all" point of view
to search. Field
> > >> > > > > > > scores
> > >> > > can
> > >> > > > > vary
> > >> > > > > > > by an order of magnitude (or even two) despite
the
> attempts
> > of
> > >> > > query
> > >> > > > > > > normalization. You can read more here
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dis
> > >> > >
> > max-why-your-incorrect-assumptions-about-dismax-are-hurting-search-rel
> > >> > > evancy/
> > >> > > > > > >
> > >> > > > > > > I'm about to win the "blashphemer" merit
badge, but ad-hoc
> > >> > > all-field
> > >> > > > > like
> > >> > > > > > > searching over many fields is actually a
good use case for
> > >> > > > > > Elasticsearch's
> > >> > > > > > > cross field queries.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fiel
> > >> > > ds_queries.html
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-f
> > >> > > ield-search-is-a-lie/
> > >> > > > > > >
> > >> > > > > > > It wouldn't be hard (and actually a great
feature for the
> > >> > > > > > > project)
> > >> > > to
> > >> > > > > get
> > >> > > > > > > the Lucene query associated with cross field
search into
> > Solr.
> > >> > > > > > > You
> > >> > > > > could
> > >> > > > > > > easily write a plugin to integrate it into
a query parser:
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > https://github.com/elastic/elasticsearch/blob/master/src/main/java/org
> > >> > > /apache/lucene/queries/BlendedTermQuery.java
> > >> > > > > > >
> > >> > > > > > > Hope that helps
> > >> > > > > > > -Doug
> > >> > > > > > > --
> > >> > > > > > > *Doug Turnbull **| *Search Relevance Consultant
|
> OpenSource
> > >> > > > > Connections,
> > >> > > > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > >> > > > > > > Author: Relevant Search <http://manning.com/turnbull>
> from
> > >> > > > > > > Manning Publications This e-mail and all
contents,
> including
> > >> > > > > > > attachments, is considered
> > >> > > to
> > >> > > > > be
> > >> > > > > > > Company Confidential unless explicitly stated
otherwise,
> > >> > > > > > > regardless of whether attachments are marked
as such.
> > >> > > > > > > On Wed, May 20, 2015 at 8:27 AM, Steven White
<
> > >> > > swhite4141@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi everyone,
> > >> > > > > > > >
> > >> > > > > > > > My solution requires that users in group-A
can only
> search
> > >> > > against
> > >> > > > a
> > >> > > > > > set
> > >> > > > > > > of
> > >> > > > > > > > fields-A and users in group-B can only
search against a
> > set
> > >> > > > > > > > of
> > >> > > > > > fields-B,
> > >> > > > > > > > etc.  There can be several groups, as
many as 100 even
> > more.
> > >> > > > > > > > To
> > >> > > > meet
> > >> > > > > > > this
> > >> > > > > > > > need, I build my search by passing in
the list of fields
> > via
> > >> > > "qf".
> > >> > > > > > What
> > >> > > > > > > > goes into "qf" can be large: as many
as 1500 fields and
> > each
> > >> > > field
> > >> > > > > name
> > >> > > > > > > > averages 15 characters long, in effect
the data passed
> via
> > >> "qf"
> > >> > > > will
> > >> > > > > be
> > >> > > > > > > > over 20K characters.
> > >> > > > > > > >
> > >> > > > > > > > Given the above, beside the fact that
a search for
> "apple"
> > >> > > > > translating
> > >> > > > > > > to a
> > >> > > > > > > > 20K characters passing over the network,
what else
> within
> > >> > > > > > > > Solr
> > >> > > and
> > >> > > > > > > Lucene I
> > >> > > > > > > > should be worried about if any?  Will
I hit some kind
> of a
> > >> > limit?
> > >> > > > > Will
> > >> > > > > > > > each search now require more CPU cycles?
 Memory?  Etc.
> > >> > > > > > > >
> > >> > > > > > > > If the network traffic becomes an issue,
my alternative
> > >> > > > > > > > solution
> > >> > > is
> > >> > > > > to
> > >> > > > > > > > create a /select handler for each group
and in that
> > handler
> > >> > > > > > > > list
> > >> > > > the
> > >> > > > > > > fields
> > >> > > > > > > > under "qf".
> > >> > > > > > > >
> > >> > > > > > > > I have considered creating pseudo-fields
for each group
> > and
> > >> > > > > > > > then
> > >> > > > use
> > >> > > > > > > > copyField into that group.  During search,
I than can
> "qf"
> > >> > > against
> > >> > > > > that
> > >> > > > > > > one
> > >> > > > > > > > field.  Unfortunately, this is not ideal
for my solution
> > >> > > > > > > > because
> > >> > > > the
> > >> > > > > > > fields
> > >> > > > > > > > that go into each group dynamically
change (at least
> once
> > a
> > >> > > month)
> > >> > > > > and
> > >> > > > > > > when
> > >> > > > > > > > they do change, I have to re-index everything
(this I
> have
> > >> > > > > > > > to
> > >> > > > avoid)
> > >> > > > > to
> > >> > > > > > > > sync that group-field.
> > >> > > > > > > >
> > >> > > > > > > > I'm using "qf" with edismax and my Solr
version is 5.1.
> > >> > > > > > > >
> > >> > > > > > > > Thanks
> > >> > > > > > > >
> > >> > > > > > > > Steve
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > >> > > Connections,
> > >> > > > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > >> > > > > Author: Relevant Search <http://manning.com/turnbull>
from
> > Manning
> > >> > > > > Publications This e-mail and all contents, including
> > attachments,
> > >> > > > > is considered to
> > >> > > be
> > >> > > > > Company Confidential unless explicitly stated otherwise,
> > >> > > > > regardless of whether attachments are marked as such.
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> > >> > > Connections, LLC | 240.476.9983 |
> > http://www.opensourceconnections.com
> > >> > > Author: Relevant Search <http://manning.com/turnbull> from
> Manning
> > >> > > Publications This e-mail and all contents, including attachments,
> is
> > >> > > considered to be Company Confidential unless explicitly stated
> > >> > > otherwise, regardless of whether attachments are marked as such.
> > >> > >
> > >> >
> > >> >
> > *************************************************************************
> > >> > This e-mail may contain confidential or privileged information.
> > >> > If you are not the intended recipient, please notify the sender
> > >> > immediately and then delete it.
> > >> >
> > >> > TIAA-CREF
> > >> >
> > *************************************************************************
> > >> >
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message