lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search
Date Mon, 04 May 2009 19:51:19 GMT
MultiFieldQuery essentially (if I have this right) forms a "cross product".
I.e.
it is NOT required to specify specific values for discrete fields. MFQ helps
form queries expressing something like "does any term appear in any field
in a hit" or "Does every term appear in some field of a hit, regardless of
which
field and not necessarily the same field" (depending upon whether the
default operator is OR or AND). You can get something of the same effect
by creating a special field that is the concatenation of all the other
fields
and searching that concatenated field with and/or (except that MFQ does
interesting things with boosting).

But if you know exactly what terms you require in which field, the
standard query parser is fine. i.e. +material:leather +gender:female
will look for "leather" ONLY in material and "female" ONLY in gender.

HTH
Erick.

P.S. A tip for you: If anything I say contradicts something Paul says,
listen to Paul <G>...


On Mon, May 4, 2009 at 3:33 PM, Christian Bongiorno <christian@bongiorno.org
> wrote:

> Yeah, you definitely got the idea. You're the second person to recommend
> putting each item in it's own document and just store the HTS code (which
> is
> easy for me). The HTS code actually comes with no extra info. I mean, there
> is info, but we don't store any of it.
>
> I will try as you and Paul have recommended. Once done, then I would need a
> MultiFieldQuery? Forgive me but the queries confuse me.
>
> Rebuilding my index will take some time, but I appreciate everyone's help
>
> Christian
>
> On Mon, May 4, 2009 at 11:40 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Hmmmm, tricky. Let's see if I understand your problem.
> >
> > Basically, you have a bunch of HSTs that have had
> > some number of items arbitrarily assigned to them, and
> > you want to see if you can make Lucene behave as a kind
> > of expert system to help you classify the next item.
> >
> > I *think* you'd get better results by indexing each item
> > along with its HST code as a separate document. Because
> > what you really want to ask is "given the attributes of my
> > new item, what other item is "most similar" to it and then
> > present the HSTs from these items to the classifier
> > (perhaps a person?).
> >
> > I'm going to assume further that the HST code has
> > some data associated with it that describes the
> > class, and that these need to be available to
> > the user to see if your suggestions are appropriate.
> > You could either index the HSTs in another index
> > OR index them in the same index but simply store
> > the data (don't index it) and the HST documents won't
> > interfere with your searches on "similar items".
> >
> > Mostly, this is just trying to see if I understand what
> > you're trying to accomplish. This may be gibberish, but
> > it's a start <G>.
> >
> > Best
> > Erick
> >
> >
> > On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <
> > christian@bongiorno.org
> > > wrote:
> >
> > > I am trying to build a search (have been experimenting with using
> Lucene)
> > > and someone suggested contacting your team
> > >
> > > Background:
> > > Currently the service I am working on applies taxing/duties to products
> > for
> > > international shipping by looking up something called an HTS code (a
> > > universally recognized taxation code for duty/tariff). We already have
> > > almost a million items classified by HTS code. As many as 50k items
> fall
> > > into the same HTS code.
> > >
> > > For purposes of HTS classification
> > > Description is only important if no other field exists. But taxation is
> > > based on things like material (leather, cloth, etc) and product
> > > (shoes/bags/toys). Color is of fair relevancy as well (to a customs
> > > official
> > > black boots or brown make no difference; it wasn’t made here so it must
> > be
> > > taxed)
> > >
> > > The idea is to turn our entire existing knowledge base into an index,
> > then
> > > when we get a new item that needs classification, we “search” for the
> > > “Document(hts)” that best matches by using the new item attributes for
> > the
> > > item to be classified as the search query.
> > >
> > > The document structure, as I see it, should be:
> > >
> > > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
> > > {Key,value},{Key,value},…} …}
> > >
> > > There are 1788 documents. Up to 50k ASINs and their attributes may fall
> > > into
> > > a single document.
> > >
> > > On some fields, they are straightforward and very good indicators of
> > match.
> > > Such as
> > >
> > > Material -> “leather”
> > > Gender -> “women”
> > >
> > > Others are fuzzier
> > >
> > > Description -> “Stylish full calf leather boots. Sleek Italian leather,
> > > designer”
> > >
> > > So for a query of:
> > > “Material” -> ”Leather”
> > > “Gender” -> ”womAn”
> > > “Description” -> ”Short leather shoes, Made in Denmark”
> > >
> > > I would expect a very high match here since the first 2 fields, which
> > don’t
> > > vary much, are good indicators for HTS.
> > >
> > > I have searched through the archives and I don't see anything like what
> I
> > > am
> > > looking for.
> > >
> > > Basically, every item will have attributes which I am treating as
> > > "Field(item.key, item.value)". I think that's the right approach but
> > > multi-field query queries your terms across all fields in the search.
> > That
> > > isn't what I need. I very clearly know my fields and values and that
> > should
> > > give me enormous leverage when querying if I could build a query to do
> > that
> > >
> > >
> > > Christian
> > >
> > > --
> > > Christian Bongiorno
> > >
> >
>
>
>
> --
> Christian Bongiorno
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message