lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search
Date Mon, 04 May 2009 18:27:09 GMT
Christian,

I suppose each ASIN represents a product by key,value pairs and
an HTS code?

In that case you may want to denormalize to index each ASIN as
a lucene document. Then search for the most similar products in your queries
by key/value pairs, using the your key as a lucene field.
Such keys would likely not need a stored norm in the lucene index.
The result of the query would be a series of HTS codes
(non unique), weighted by the score value. To get a score
for each HTS code, you might need your own HitCollector
and a field cache for the HTS codes.

You'll probably need to use a custom (who are the users again?)
lucene similarity function to lower the weight for the description,
and to increase the influence of the coordination factor so more
matches in different keys have a bigger influence on the result.

And have a look at Solr before starting to code this. The facets
there might be of help during interactive retrieval. Your application
is not really a web shop, but there are (at least) some overlaps.

Regards,
Paul Elschot


On Monday 04 May 2009 19:16:10 Christian Bongiorno wrote:
> I am trying to build a search (have been experimenting with using Lucene)
> and someone suggested contacting your team
> 
> Background:
> Currently the service I am working on applies taxing/duties to products for
> international shipping by looking up something called an HTS code (a
> universally recognized taxation code for duty/tariff). We already have
> almost a million items classified by HTS code. As many as 50k items fall
> into the same HTS code.
> 
> For purposes of HTS classification
> Description is only important if no other field exists. But taxation is
> based on things like material (leather, cloth, etc) and product
> (shoes/bags/toys). Color is of fair relevancy as well (to a customs official
> black boots or brown make no difference; it wasn’t made here so it must be
> taxed)
> 
> The idea is to turn our entire existing knowledge base into an index, then
> when we get a new item that needs classification, we “search” for the
> “Document(hts)” that best matches by using the new item attributes for the
> item to be classified as the search query.
> 
> The document structure, as I see it, should be:
> 
> Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
> {Key,value},{Key,value},…} …}
> 
> There are 1788 documents. Up to 50k ASINs and their attributes may fall into
> a single document.
> 
> On some fields, they are straightforward and very good indicators of match.
> Such as
> 
> Material -> “leather”
> Gender -> “women”
> 
> Others are fuzzier
> 
> Description -> “Stylish full calf leather boots. Sleek Italian leather,
> designer”
> 
> So for a query of:
> “Material” -> ”Leather”
> “Gender” -> ”womAn”
> “Description” -> ”Short leather shoes, Made in Denmark”
> 
> I would expect a very high match here since the first 2 fields, which don’t
> vary much, are good indicators for HTS.
> 
> I have searched through the archives and I don't see anything like what I am
> looking for.
> 
> Basically, every item will have attributes which I am treating as
> "Field(item.key, item.value)". I think that's the right approach but
> multi-field query queries your terms across all fields in the search. That
> isn't what I need. I very clearly know my fields and values and that should
> give me enormous leverage when querying if I could build a query to do that
> 
> 
> Christian
> 
> -- 
> Christian Bongiorno
> 


Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message