lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search
Date Mon, 04 May 2009 23:53:15 GMT
Yes, SHOULD is what you want I think here.

Best
Erick


On Mon, May 4, 2009 at 6:41 PM, Christian Bongiorno <christian@bongiorno.org
> wrote:

> You mean to use
> BooleanQuery bq = new BooleanQuery();
> bq.add(new TermQuery(new
> Term("key","value")),BooleanClause.Occur.MUST_OCCUR));
> // above is eric's suggestion.
>
> If so, doesn't that mean if they don't all match I won't get a result?
> Wouldn't it be better to use SHOULD_OCCUR? The documentation doesn't give
> extra insight on that
>
> As for fields where I expect looser matches, such as description, I should
> boost the other fields.
>
> Thanks again
>
> On Mon, May 4, 2009 at 1:32 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>
> > In the case of such queries with keywords (not analyzed tokens), I would
> > create directly the appropinquate TermQuerys and combine with
> BooleanQuery.
> > QueryParser is normally not for program-internal queries, more for
> queries
> > the user has entered. For your use-case, it seems better to just create
> the
> > correct Query Objects using standard instantiation from the Java code.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Monday, May 04, 2009 9:51 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: multi-field index and search (Not MultiFieldQuery). Help
> > > setting up index and search
> > >
> > > MultiFieldQuery essentially (if I have this right) forms a "cross
> > > product".
> > > I.e.
> > > it is NOT required to specify specific values for discrete fields. MFQ
> > > helps
> > > form queries expressing something like "does any term appear in any
> field
> > > in a hit" or "Does every term appear in some field of a hit, regardless
> > of
> > > which
> > > field and not necessarily the same field" (depending upon whether the
> > > default operator is OR or AND). You can get something of the same
> effect
> > > by creating a special field that is the concatenation of all the other
> > > fields
> > > and searching that concatenated field with and/or (except that MFQ does
> > > interesting things with boosting).
> > >
> > > But if you know exactly what terms you require in which field, the
> > > standard query parser is fine. i.e. +material:leather +gender:female
> > > will look for "leather" ONLY in material and "female" ONLY in gender.
> > >
> > > HTH
> > > Erick.
> > >
> > > P.S. A tip for you: If anything I say contradicts something Paul says,
> > > listen to Paul <G>...
> > >
> > >
> > > On Mon, May 4, 2009 at 3:33 PM, Christian Bongiorno
> > > <christian@bongiorno.org
> > > > wrote:
> > >
> > > > Yeah, you definitely got the idea. You're the second person to
> > recommend
> > > > putting each item in it's own document and just store the HTS code
> > > (which
> > > > is
> > > > easy for me). The HTS code actually comes with no extra info. I mean,
> > > there
> > > > is info, but we don't store any of it.
> > > >
> > > > I will try as you and Paul have recommended. Once done, then I would
> > > need a
> > > > MultiFieldQuery? Forgive me but the queries confuse me.
> > > >
> > > > Rebuilding my index will take some time, but I appreciate everyone's
> > > help
> > > >
> > > > Christian
> > > >
> > > > On Mon, May 4, 2009 at 11:40 AM, Erick Erickson <
> > erickerickson@gmail.com
> > > > >wrote:
> > > >
> > > > > Hmmmm, tricky. Let's see if I understand your problem.
> > > > >
> > > > > Basically, you have a bunch of HSTs that have had
> > > > > some number of items arbitrarily assigned to them, and
> > > > > you want to see if you can make Lucene behave as a kind
> > > > > of expert system to help you classify the next item.
> > > > >
> > > > > I *think* you'd get better results by indexing each item
> > > > > along with its HST code as a separate document. Because
> > > > > what you really want to ask is "given the attributes of my
> > > > > new item, what other item is "most similar" to it and then
> > > > > present the HSTs from these items to the classifier
> > > > > (perhaps a person?).
> > > > >
> > > > > I'm going to assume further that the HST code has
> > > > > some data associated with it that describes the
> > > > > class, and that these need to be available to
> > > > > the user to see if your suggestions are appropriate.
> > > > > You could either index the HSTs in another index
> > > > > OR index them in the same index but simply store
> > > > > the data (don't index it) and the HST documents won't
> > > > > interfere with your searches on "similar items".
> > > > >
> > > > > Mostly, this is just trying to see if I understand what
> > > > > you're trying to accomplish. This may be gibberish, but
> > > > > it's a start <G>.
> > > > >
> > > > > Best
> > > > > Erick
> > > > >
> > > > >
> > > > > On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <
> > > > > christian@bongiorno.org
> > > > > > wrote:
> > > > >
> > > > > > I am trying to build a search (have been experimenting with
using
> > > > Lucene)
> > > > > > and someone suggested contacting your team
> > > > > >
> > > > > > Background:
> > > > > > Currently the service I am working on applies taxing/duties
to
> > > products
> > > > > for
> > > > > > international shipping by looking up something called an HTS
code
> > (a
> > > > > > universally recognized taxation code for duty/tariff). We already
> > > have
> > > > > > almost a million items classified by HTS code. As many as 50k
> items
> > > > fall
> > > > > > into the same HTS code.
> > > > > >
> > > > > > For purposes of HTS classification
> > > > > > Description is only important if no other field exists. But
> > taxation
> > > is
> > > > > > based on things like material (leather, cloth, etc) and product
> > > > > > (shoes/bags/toys). Color is of fair relevancy as well (to a
> customs
> > > > > > official
> > > > > > black boots or brown make no difference; it wasn't made here
so
> it
> > > must
> > > > > be
> > > > > > taxed)
> > > > > >
> > > > > > The idea is to turn our entire existing knowledge base into
an
> > > index,
> > > > > then
> > > > > > when we get a new item that needs classification, we "search"
for
> > > the
> > > > > > "Document(hts)" that best matches by using the new item
> attributes
> > > for
> > > > > the
> > > > > > item to be classified as the search query.
> > > > > >
> > > > > > The document structure, as I see it, should be:
> > > > > >
> > > > > > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},.}, {ASIN2:
> > > > > > {Key,value},{Key,value},.} .}
> > > > > >
> > > > > > There are 1788 documents. Up to 50k ASINs and their attributes
> may
> > > fall
> > > > > > into
> > > > > > a single document.
> > > > > >
> > > > > > On some fields, they are straightforward and very good indicators
> > of
> > > > > match.
> > > > > > Such as
> > > > > >
> > > > > > Material -> "leather"
> > > > > > Gender -> "women"
> > > > > >
> > > > > > Others are fuzzier
> > > > > >
> > > > > > Description -> "Stylish full calf leather boots. Sleek Italian
> > > leather,
> > > > > > designer"
> > > > > >
> > > > > > So for a query of:
> > > > > > "Material" -> "Leather"
> > > > > > "Gender" -> "womAn"
> > > > > > "Description" -> "Short leather shoes, Made in Denmark"
> > > > > >
> > > > > > I would expect a very high match here since the first 2 fields,
> > > which
> > > > > don't
> > > > > > vary much, are good indicators for HTS.
> > > > > >
> > > > > > I have searched through the archives and I don't see anything
> like
> > > what
> > > > I
> > > > > > am
> > > > > > looking for.
> > > > > >
> > > > > > Basically, every item will have attributes which I am treating
as
> > > > > > "Field(item.key, item.value)". I think that's the right approach
> > but
> > > > > > multi-field query queries your terms across all fields in the
> > > search.
> > > > > That
> > > > > > isn't what I need. I very clearly know my fields and values
and
> > that
> > > > > should
> > > > > > give me enormous leverage when querying if I could build a query
> to
> > > do
> > > > > that
> > > > > >
> > > > > >
> > > > > > Christian
> > > > > >
> > > > > > --
> > > > > > Christian Bongiorno
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Christian Bongiorno
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Christian Bongiorno
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message