lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Bongiorno <christ...@bongiorno.org>
Subject Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search
Date Mon, 04 May 2009 19:33:56 GMT
Yeah, you definitely got the idea. You're the second person to recommend
putting each item in it's own document and just store the HTS code (which is
easy for me). The HTS code actually comes with no extra info. I mean, there
is info, but we don't store any of it.

I will try as you and Paul have recommended. Once done, then I would need a
MultiFieldQuery? Forgive me but the queries confuse me.

Rebuilding my index will take some time, but I appreciate everyone's help

Christian

On Mon, May 4, 2009 at 11:40 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> Hmmmm, tricky. Let's see if I understand your problem.
>
> Basically, you have a bunch of HSTs that have had
> some number of items arbitrarily assigned to them, and
> you want to see if you can make Lucene behave as a kind
> of expert system to help you classify the next item.
>
> I *think* you'd get better results by indexing each item
> along with its HST code as a separate document. Because
> what you really want to ask is "given the attributes of my
> new item, what other item is "most similar" to it and then
> present the HSTs from these items to the classifier
> (perhaps a person?).
>
> I'm going to assume further that the HST code has
> some data associated with it that describes the
> class, and that these need to be available to
> the user to see if your suggestions are appropriate.
> You could either index the HSTs in another index
> OR index them in the same index but simply store
> the data (don't index it) and the HST documents won't
> interfere with your searches on "similar items".
>
> Mostly, this is just trying to see if I understand what
> you're trying to accomplish. This may be gibberish, but
> it's a start <G>.
>
> Best
> Erick
>
>
> On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <
> christian@bongiorno.org
> > wrote:
>
> > I am trying to build a search (have been experimenting with using Lucene)
> > and someone suggested contacting your team
> >
> > Background:
> > Currently the service I am working on applies taxing/duties to products
> for
> > international shipping by looking up something called an HTS code (a
> > universally recognized taxation code for duty/tariff). We already have
> > almost a million items classified by HTS code. As many as 50k items fall
> > into the same HTS code.
> >
> > For purposes of HTS classification
> > Description is only important if no other field exists. But taxation is
> > based on things like material (leather, cloth, etc) and product
> > (shoes/bags/toys). Color is of fair relevancy as well (to a customs
> > official
> > black boots or brown make no difference; it wasn’t made here so it must
> be
> > taxed)
> >
> > The idea is to turn our entire existing knowledge base into an index,
> then
> > when we get a new item that needs classification, we “search” for the
> > “Document(hts)” that best matches by using the new item attributes for
> the
> > item to be classified as the search query.
> >
> > The document structure, as I see it, should be:
> >
> > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
> > {Key,value},{Key,value},…} …}
> >
> > There are 1788 documents. Up to 50k ASINs and their attributes may fall
> > into
> > a single document.
> >
> > On some fields, they are straightforward and very good indicators of
> match.
> > Such as
> >
> > Material -> “leather”
> > Gender -> “women”
> >
> > Others are fuzzier
> >
> > Description -> “Stylish full calf leather boots. Sleek Italian leather,
> > designer”
> >
> > So for a query of:
> > “Material” -> ”Leather”
> > “Gender” -> ”womAn”
> > “Description” -> ”Short leather shoes, Made in Denmark”
> >
> > I would expect a very high match here since the first 2 fields, which
> don’t
> > vary much, are good indicators for HTS.
> >
> > I have searched through the archives and I don't see anything like what I
> > am
> > looking for.
> >
> > Basically, every item will have attributes which I am treating as
> > "Field(item.key, item.value)". I think that's the right approach but
> > multi-field query queries your terms across all fields in the search.
> That
> > isn't what I need. I very clearly know my fields and values and that
> should
> > give me enormous leverage when querying if I could build a query to do
> that
> >
> >
> > Christian
> >
> > --
> > Christian Bongiorno
> >
>



-- 
Christian Bongiorno

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message