Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 39441 invoked from network); 4 May 2009 19:51:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 May 2009 19:51:52 -0000 Received: (qmail 84103 invoked by uid 500); 4 May 2009 19:51:50 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 84027 invoked by uid 500); 4 May 2009 19:51:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 84017 invoked by uid 99); 4 May 2009 19:51:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2009 19:51:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.219.179 as permitted sender) Received: from [209.85.219.179] (HELO mail-ew0-f179.google.com) (209.85.219.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2009 19:51:43 +0000 Received: by ewy27 with SMTP id 27so4617039ewy.5 for ; Mon, 04 May 2009 12:51:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=igJrz2x0fp3kaPPdv9+jXBJIFQp7suSdZ55vNv5FiVg=; b=rFLHd8ZeMamoEZe6muEe+WkfOQ0SsUuvELALQ07B1q4iRP6oFcoVjQFO+MCNPx1L/S QY5UAdbCWoZo5/FaobTQ1ciPx4+xdGOk1WLZb3vxYkapFcOTRvaqTfYn8U8NtVsT5CeP 3LDrhM/qG/LqfbfVuUSM6/KujIvYGH8+u8Jxs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=hhjKN2qXUcngvk4nglH/BQAOFkvPSSSV73xoS7sDpUpQ/lxjdcpHMB6nNW6pjtGhaP QG0VFAW5Wx2/T2B7rQNhItdvnCs5Vf1J06TQ+0Aoj4Exo3I6p3+v6DHL8M2baQrTWwAA fYTUd3ZLnqA2x92QCAHTHnZqgJVB2gBe7hj3s= MIME-Version: 1.0 Received: by 10.220.70.213 with SMTP id e21mr9880032vcj.93.1241466680086; Mon, 04 May 2009 12:51:20 -0700 (PDT) In-Reply-To: <11a518030905041233h371b8a5fhd62d6add5461e5a0@mail.gmail.com> References: <11a518030905041016o6e4b87bey89e511b39d974a7e@mail.gmail.com> <359a92830905041140w1fcabfdfpaf6e1d46cfe8065d@mail.gmail.com> <11a518030905041233h371b8a5fhd62d6add5461e5a0@mail.gmail.com> Date: Mon, 4 May 2009 15:51:19 -0400 Message-ID: <359a92830905041251r22c0d198sfcfdd3982c717c89@mail.gmail.com> Subject: Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search From: Erick Erickson To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e647178077933704691b7c88 X-Virus-Checked: Checked by ClamAV on apache.org --0016e647178077933704691b7c88 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable MultiFieldQuery essentially (if I have this right) forms a "cross product". I.e. it is NOT required to specify specific values for discrete fields. MFQ help= s form queries expressing something like "does any term appear in any field in a hit" or "Does every term appear in some field of a hit, regardless of which field and not necessarily the same field" (depending upon whether the default operator is OR or AND). You can get something of the same effect by creating a special field that is the concatenation of all the other fields and searching that concatenated field with and/or (except that MFQ does interesting things with boosting). But if you know exactly what terms you require in which field, the standard query parser is fine. i.e. +material:leather +gender:female will look for "leather" ONLY in material and "female" ONLY in gender. HTH Erick. P.S. A tip for you: If anything I say contradicts something Paul says, listen to Paul ... On Mon, May 4, 2009 at 3:33 PM, Christian Bongiorno wrote: > Yeah, you definitely got the idea. You're the second person to recommend > putting each item in it's own document and just store the HTS code (which > is > easy for me). The HTS code actually comes with no extra info. I mean, the= re > is info, but we don't store any of it. > > I will try as you and Paul have recommended. Once done, then I would need= a > MultiFieldQuery? Forgive me but the queries confuse me. > > Rebuilding my index will take some time, but I appreciate everyone's help > > Christian > > On Mon, May 4, 2009 at 11:40 AM, Erick Erickson >wrote: > > > Hmmmm, tricky. Let's see if I understand your problem. > > > > Basically, you have a bunch of HSTs that have had > > some number of items arbitrarily assigned to them, and > > you want to see if you can make Lucene behave as a kind > > of expert system to help you classify the next item. > > > > I *think* you'd get better results by indexing each item > > along with its HST code as a separate document. Because > > what you really want to ask is "given the attributes of my > > new item, what other item is "most similar" to it and then > > present the HSTs from these items to the classifier > > (perhaps a person?). > > > > I'm going to assume further that the HST code has > > some data associated with it that describes the > > class, and that these need to be available to > > the user to see if your suggestions are appropriate. > > You could either index the HSTs in another index > > OR index them in the same index but simply store > > the data (don't index it) and the HST documents won't > > interfere with your searches on "similar items". > > > > Mostly, this is just trying to see if I understand what > > you're trying to accomplish. This may be gibberish, but > > it's a start . > > > > Best > > Erick > > > > > > On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno < > > christian@bongiorno.org > > > wrote: > > > > > I am trying to build a search (have been experimenting with using > Lucene) > > > and someone suggested contacting your team > > > > > > Background: > > > Currently the service I am working on applies taxing/duties to produc= ts > > for > > > international shipping by looking up something called an HTS code (a > > > universally recognized taxation code for duty/tariff). We already hav= e > > > almost a million items classified by HTS code. As many as 50k items > fall > > > into the same HTS code. > > > > > > For purposes of HTS classification > > > Description is only important if no other field exists. But taxation = is > > > based on things like material (leather, cloth, etc) and product > > > (shoes/bags/toys). Color is of fair relevancy as well (to a customs > > > official > > > black boots or brown make no difference; it wasn=92t made here so it = must > > be > > > taxed) > > > > > > The idea is to turn our entire existing knowledge base into an index, > > then > > > when we get a new item that needs classification, we =93search=94 for= the > > > =93Document(hts)=94 that best matches by using the new item attribute= s for > > the > > > item to be classified as the search query. > > > > > > The document structure, as I see it, should be: > > > > > > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},=85}, {ASIN2: > > > {Key,value},{Key,value},=85} =85} > > > > > > There are 1788 documents. Up to 50k ASINs and their attributes may fa= ll > > > into > > > a single document. > > > > > > On some fields, they are straightforward and very good indicators of > > match. > > > Such as > > > > > > Material -> =93leather=94 > > > Gender -> =93women=94 > > > > > > Others are fuzzier > > > > > > Description -> =93Stylish full calf leather boots. Sleek Italian leat= her, > > > designer=94 > > > > > > So for a query of: > > > =93Material=94 -> =94Leather=94 > > > =93Gender=94 -> =94womAn=94 > > > =93Description=94 -> =94Short leather shoes, Made in Denmark=94 > > > > > > I would expect a very high match here since the first 2 fields, which > > don=92t > > > vary much, are good indicators for HTS. > > > > > > I have searched through the archives and I don't see anything like wh= at > I > > > am > > > looking for. > > > > > > Basically, every item will have attributes which I am treating as > > > "Field(item.key, item.value)". I think that's the right approach but > > > multi-field query queries your terms across all fields in the search. > > That > > > isn't what I need. I very clearly know my fields and values and that > > should > > > give me enormous leverage when querying if I could build a query to d= o > > that > > > > > > > > > Christian > > > > > > -- > > > Christian Bongiorno > > > > > > > > > -- > Christian Bongiorno > --0016e647178077933704691b7c88--