lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search
Date Mon, 04 May 2009 18:40:36 GMT
Hmmmm, tricky. Let's see if I understand your problem.

Basically, you have a bunch of HSTs that have had
some number of items arbitrarily assigned to them, and
you want to see if you can make Lucene behave as a kind
of expert system to help you classify the next item.

I *think* you'd get better results by indexing each item
along with its HST code as a separate document. Because
what you really want to ask is "given the attributes of my
new item, what other item is "most similar" to it and then
present the HSTs from these items to the classifier
(perhaps a person?).

I'm going to assume further that the HST code has
some data associated with it that describes the
class, and that these need to be available to
the user to see if your suggestions are appropriate.
You could either index the HSTs in another index
OR index them in the same index but simply store
the data (don't index it) and the HST documents won't
interfere with your searches on "similar items".

Mostly, this is just trying to see if I understand what
you're trying to accomplish. This may be gibberish, but
it's a start <G>.

Best
Erick


On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <christian@bongiorno.org
> wrote:

> I am trying to build a search (have been experimenting with using Lucene)
> and someone suggested contacting your team
>
> Background:
> Currently the service I am working on applies taxing/duties to products for
> international shipping by looking up something called an HTS code (a
> universally recognized taxation code for duty/tariff). We already have
> almost a million items classified by HTS code. As many as 50k items fall
> into the same HTS code.
>
> For purposes of HTS classification
> Description is only important if no other field exists. But taxation is
> based on things like material (leather, cloth, etc) and product
> (shoes/bags/toys). Color is of fair relevancy as well (to a customs
> official
> black boots or brown make no difference; it wasn’t made here so it must be
> taxed)
>
> The idea is to turn our entire existing knowledge base into an index, then
> when we get a new item that needs classification, we “search” for the
> “Document(hts)” that best matches by using the new item attributes for the
> item to be classified as the search query.
>
> The document structure, as I see it, should be:
>
> Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
> {Key,value},{Key,value},…} …}
>
> There are 1788 documents. Up to 50k ASINs and their attributes may fall
> into
> a single document.
>
> On some fields, they are straightforward and very good indicators of match.
> Such as
>
> Material -> “leather”
> Gender -> “women”
>
> Others are fuzzier
>
> Description -> “Stylish full calf leather boots. Sleek Italian leather,
> designer”
>
> So for a query of:
> “Material” -> ”Leather”
> “Gender” -> ”womAn”
> “Description” -> ”Short leather shoes, Made in Denmark”
>
> I would expect a very high match here since the first 2 fields, which don’t
> vary much, are good indicators for HTS.
>
> I have searched through the archives and I don't see anything like what I
> am
> looking for.
>
> Basically, every item will have attributes which I am treating as
> "Field(item.key, item.value)". I think that's the right approach but
> multi-field query queries your terms across all fields in the search. That
> isn't what I need. I very clearly know my fields and values and that should
> give me enormous leverage when querying if I could build a query to do that
>
>
> Christian
>
> --
> Christian Bongiorno
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message