lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Help with denormalizing issues
Date Wed, 07 Oct 2009 21:38:19 GMT
The separate sku do not become one long text string. They are separate
values in the same field. The relevance calculation is completely
separate per value.

The performance problem with the field collapsing patch is that it
does the same thing as a facet or sorting operation: it does a sweep
through the index and builds a data structure whose size depends on
the index. Faceting is not cached directly but still works very
quickly the second time. Sorting has its own cache and is very slow (N
log N) the first time and very fast afterwards. The field collapsing
patch does not cache any of its work and is almost as slow the second
time as the first time.

On 10/7/09, Eric Reeves <ereeves@eline.com> wrote:
> Hi again, I'm gonna try this again with more focus this time :D
>
> 1) Ideally what we would like to do, is plug in an additional mechanism to
> filter the initial result set, because we can't find a way to implement our
> filtering needs as filter queries against a single index.  We would want to
> do this while maintaining support for paging.  Looking through the codebase
> it looks as if this would not be possible without major surgery, due to the
> paging support being implemented deep inside private methods of
> SolrIndexSearcher.  Does this sound accurate?
>
> 2) If we pursue the other option of indexing skus and collapsing the results
> based on product id using the field collapsing patch, is there any validity
> to my concerns about indexing the same content multiple times skewing the
> scoring?
>
> 3) Does anyone have experience using the field collapsing patch, and have
> any idea how much additional overhead it incurs?
>
> Thanks,
> Eric
>
> -----Original Message-----
> From: Eric Reeves
> Sent: Monday, October 05, 2009 6:19 PM
> To: solr-user@lucene.apache.org
> Subject: Help with denormalizing issues
>
> Hi there,
>
> I'm evaluating Solr as a replacement for our current search server, and am
> trying to determine what the best strategy would be to implement our
> business needs.  Our problem is that we have a catalog schema with products
> and skus, one to many.  The most relevant content being indexed is at the
> product level, in the name and description fields.  However we are
> interested in filtering by sku attributes, and in particular making multiple
> filters apply to a single sku.  For example, find a product that contains a
> sku that is both blue and on sale.  No approach I've tried at collapsing the
> sku data into the product document works for this.  If we put the data in
> separate fields, there's no way to apply multiple filters to the same sku.
> and if we concatenate all of the relevant sku data into a single multivalued
> field then as I understand it, this is just indexed as one large field with
> extra whitespace between the individual entries, so there's still no way to
> enforce that an AND filter query applies to the same sku.
>
> One approach I was considering was to create separate indexes for products
> and skus, and store the product IDs in the sku documents.  Then we could
> apply our own filters to the initially generated list, based on unique query
> parameters.  I thought creating a component between query and facet would be
> a good place to add such a filter, but further research seems to indicate
> that this would break paging and sorting.  The only other thing I can think
> of would be to subclass QueryComponent itself, which looks rather
> daunting-the process() method has no hooks for this sort of thing, it seems
> I would have to copy the entire existing implementation and add them myself,
> which looks to be a fair chunk of work and brittle to changes in the trunk
> code.  Ideally it would be nice to be able to handle certain fq parameters
> in a completely different way, perhaps using a custom query parser, but I
> haven't wrapped my head around how those work.  Does any of this sound
> remotely doable?  Any advice?
>
> The other suggestion we are looking at was given to us by our current search
> provider, which is to index the skus themselves.  It looks as if we may be
> able to make this work using the field collapsing patch from SOLR-236.  I
> have some concerns about this approach though: 1) It will make for a much
> larger index and longer indexing times (products can have 10 or more skus in
> our catalog).  2) Because the indexing will be copying the description and
> name from the product it will be indexing the same content more than once,
> and the number of times per product will vary based on the number of skus.
> I'm concerned that this may skew the scoring algorithm, in particular the
> inverse frequency part.  3) I'm not sure about the performance of the field
> collapsing patch, I've read contradictory reports on the web.
>
> I apologize if this is a bit rambling.  If anyone has any advice for our
> situation it would be very helpful.
>
> Thanks,
> Eric
>


-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message