lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Bongiorno <>
Subject multi-field index and search (Not MultiFieldQuery). Help setting up index and search
Date Mon, 04 May 2009 17:16:10 GMT
I am trying to build a search (have been experimenting with using Lucene)
and someone suggested contacting your team

Currently the service I am working on applies taxing/duties to products for
international shipping by looking up something called an HTS code (a
universally recognized taxation code for duty/tariff). We already have
almost a million items classified by HTS code. As many as 50k items fall
into the same HTS code.

For purposes of HTS classification
Description is only important if no other field exists. But taxation is
based on things like material (leather, cloth, etc) and product
(shoes/bags/toys). Color is of fair relevancy as well (to a customs official
black boots or brown make no difference; it wasn’t made here so it must be

The idea is to turn our entire existing knowledge base into an index, then
when we get a new item that needs classification, we “search” for the
“Document(hts)” that best matches by using the new item attributes for the
item to be classified as the search query.

The document structure, as I see it, should be:

Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
{Key,value},{Key,value},…} …}

There are 1788 documents. Up to 50k ASINs and their attributes may fall into
a single document.

On some fields, they are straightforward and very good indicators of match.
Such as

Material -> “leather”
Gender -> “women”

Others are fuzzier

Description -> “Stylish full calf leather boots. Sleek Italian leather,

So for a query of:
“Material” -> ”Leather”
“Gender” -> ”womAn”
“Description” -> ”Short leather shoes, Made in Denmark”

I would expect a very high match here since the first 2 fields, which don’t
vary much, are good indicators for HTS.

I have searched through the archives and I don't see anything like what I am
looking for.

Basically, every item will have attributes which I am treating as
"Field(item.key, item.value)". I think that's the right approach but
multi-field query queries your terms across all fields in the search. That
isn't what I need. I very clearly know my fields and values and that should
give me enormous leverage when querying if I could build a query to do that


Christian Bongiorno

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message