lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar" <shalinman...@gmail.com>
Subject Re: Solr + Complex Legacy Schema -- Best Practices?
Date Fri, 11 Apr 2008 09:06:27 GMT
We had similar requirements in our project too. I won't go into details of
our schema but we had five to six tables containing information on which we
needed to search or show facets.

The first step here is to get your schema right. Multivalued fields let you
create extremely de-normalized documents in Solr, so it may be worth trying
to fit information from all tables into one schema first. Your prod_store
table could be used here to extract price ranges for a product across
stores, on which you can provide drill-downs (facet queries). We used such a
approach in our use-case. We used DataImportHandler's transformer features
to check for and remove invalid/duplicate data before adding it to Solr.

If I understand you correctly, ultimately you want to show individual
products as result of the query. So start by thinking about one product (one
row in prod_detail), and write separate queries for each table which will
get data for that one row (using product_id). Try to reduce the levels of
nesting queries so that you need to fire less number of queries to construct
a Solr document for a given product.

On Fri, Apr 11, 2008 at 2:31 AM, Tkach <ntkach@peapod.com> wrote:

> I realize this is a really vague sort of question with a lot of what-ifs,
> so feel free to just say we'll just have to try implementing one version,
> test, and see if the results are acceptable. :)
>
> Well, our searches are really more along the lines of searching on product
> "details" (brand/key words/names), so that part's fairly straightforward.  I
> guess the main complication is how product keywords/details are mapped onto
> products through several different kinds of "groups", which makes it harder
> to be able to just flatten them out easily since not all products have
> records in all (or even many) places in the groups.
>
> Basically we have users enter some text in our search box on the site.  It
> could be anything from a product name (such as Diet Coke) to a brand name
> (such as Kraft) to words describing the product (such as cheese).  From
> there we do a search (see details below), and come up with a list of either
> some categories to "drill down into" (such as Diet Sodas or Coke) or else we
> might have a small enough result here to just list the products outright.
>  Among other things we're looking to use Solr/Lucene to tighten things up
> here a little, letting the search engine deal with "grouping" (as in the
> implicit groups formed when you search on something like "pepsi") and taking
> some of these "category" tables out.  We're looking into faceting and
> spell-checking too, but those are more secondary concerns.
>
> There are 6 tables, I'll call them prod_detail, prod_store, prod_dict,
> prod_sku, prod_grp, and prod_brand.
>
> prod_detail has a series of records (unique across all stores) on each
> product such as size, name, and a couple of text fields for words describing
> the product (we'd be searching on these desc fields and the name).
>  prod_detail does have a nice, neat unique integer key, product_id. It also
> has an integer field that is a foreign key (figuratively, not an actual DB
> constraint) pointing to a record in prod_brand.
>
> prod_brand is a table that associates a brand with a group of products.
>  It has a product_group, store_id, brand_id, and a desc field of words
> describing the brand that we'd want to be able to search.  However it's only
> unique across the combination of product_group and brand_id, and store_id.
>  The brand_id here does correspond to a brand_id in the prod_detail table.
>  prod_brand has a many-to-many mapping with prod_detail (one brand can
> correspond to many products and one product can be in many brands).
>
> prod_store associates a product detail record with details for a given
> store (mostly pricing).  It's keyed off of a product_id and store_id
> combination.  This one has no fields we'd be wanting to search directly.  It
> has a many-to-one mapping with prod_detail (many prod_store records use one
> prod_detail).
>
> prod_dict associates some key words/phrases with a group of products based
> on a combination of store_id and product_group_id.  We'd be wanting to
> search the key words here too.
>
> prod_sku associates a product detail record with a product group for a
> given store.  There are no fields here we'd be searching on.  It's just used
> as a lookup for group <-> prod mappings.
>
> Finally, prod_grp associates a product group with a couple of key words
> for a given store.  We'd be wanting to search these key words as well.
>
> Hopefully that makes some sense.  We plan to just do a dump of the
> information regularly (say, daily) and index that.  The question is, given
> the constraints of the data, would we be better off doing a multi-core setup
> and dealing with multiple (potential) hits to the server for lookups, or
> would making a big join (including all sorts of duplicate information) and
> making the server/client code deal with sorting out what goes where?  I'm
> certainly open to researching this more on my own (I know it's a huge,
> loaded question for a mailing list) if anyone can suggest someplace to get a
> better picture of efficient information retrieval.
>
> The more I look at this, I'm kind of thinking it makes more sense to do a
> multi-core since doing a huge join just makes us go from nice, neat data to
> a big join with all sort of duplicates/nulls, back to neat data again, but
> that's just a guess at this point.
>
> I'll have to be sure to check out this DataImportHandler.  It definitely
> does sound close to the sort of thing we're looking at.
>
> ----- Original Message -----
> From: "Chris Hostetter" <hossman_lucene@fucit.org>
> To: "solr-user" <solr-user@lucene.apache.org>
> Sent: Wednesday, April 9, 2008 1:00:33 AM GMT -06:00 US/Canada Central
> Subject: Re: Solr + Complex Legacy Schema -- Best Practices?
>
>
> : I just was wondering, has anybody dealt with trying to "translate" the
> : data from a big, legacy DB schema to a Solr installation?  What I mean
>
> there's really no general answer to that question -- it all comes down to
> what you want to query on, and what kinds of results you want to get
> out... if you want your queries to result in lists of "products" then you
> should have one Document per product -- if you want to be able to query on
> the text of user reviews then you need to flatten all the user reviews for
> each product into the Document for each product.  Sometimes you'll want
> two types of Documents ... one Document per product, containing all the
> text of all the user reviews, and one Document per user review, with the
> Product information duplicated in each so you can search for...
>
>  q=reviewtext:solr&fq=doctype:product&fq=productype:camera
>
> ...to get a list of all the products that are cameras that contain the
> word solr in the text of *a* review, or you can search for...
>
>  q=reviewtext:solr&fq=doctype:review&fq=producttype:camera
>
> ...to get a list of all the reviews that contain the word solr, and
> are about products that are cameras.
>
> Your use cases and goals will be differnet then everyone elses.
>
>
> -Hoss
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message