Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 47985 invoked from network); 6 Oct 2006 19:34:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 Oct 2006 19:34:34 -0000 Received: (qmail 55961 invoked by uid 500); 6 Oct 2006 19:34:28 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 55893 invoked by uid 500); 6 Oct 2006 19:34:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 55882 invoked by uid 99); 6 Oct 2006 19:34:27 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Oct 2006 12:34:27 -0700 Authentication-Results: idunn.apache.osuosl.org header.from=erickerickson@gmail.com; domainkeys=good X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=DNS_FROM_RFC_ABUSE,HTML_MESSAGE DomainKey-Status: good X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 Received: from [64.233.166.180] ([64.233.166.180:25822] helo=py-out-1112.google.com) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id D2/2A-24193-CBFA6254 for ; Fri, 06 Oct 2006 12:34:21 -0700 Received: by py-out-1112.google.com with SMTP id s49so1125871pyc for ; Fri, 06 Oct 2006 12:34:17 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Qm4IJ1p2cXevK2sTRwzzw85R4RV7sOMusRlQ0TRG7UcDJJQKlVJ2qQkvWe2RbspnZ90dRAqjPZedlQ3TEk7l2+AotHzegqqy7yFasE/czwLu8/xqP47zU9N0O1aHriakr+UQROyyGa6yPrUBvYx1ppmBgYwOJ6haCnTLkU2/tnE= Received: by 10.35.66.12 with SMTP id t12mr6564713pyk; Fri, 06 Oct 2006 12:34:16 -0700 (PDT) Received: by 10.35.9.18 with HTTP; Fri, 6 Oct 2006 12:34:16 -0700 (PDT) Message-ID: <359a92830610061234jaaadd8fue761f83adaabd3df@mail.gmail.com> Date: Fri, 6 Oct 2006 15:34:16 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Design Consideration for lucene index In-Reply-To: <91C65A732E71A54ABD1F74F4C14D7D0201A4B11E@mail01.kittyhawk.funmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_9665_15086747.1160163256763" References: <2ACDC7AA-3206-4AA4-9FCA-D72971B59079@ehatchersolutions.com> <91C65A732E71A54ABD1F74F4C14D7D0201A4B11E@mail01.kittyhawk.funmail.com> X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------=_Part_9665_15086747.1160163256763 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline If you're *sure* that your database solution isn't adequate .... see below. On 10/6/06, smathews@funmobility.com wrote: > > I am a newbie to the lucene search area. I would like to best way to do > the following using lucene in terms of efficiency and the size of the > index. > > Question : #1 > I have a table that contains some tags. These tags are tagged against > multiple images that are in a different table (potentially 20 to 30,000 > images). If I am searching for a tag phrase and get the corresponding > images, the approach that I was thinking is to join these two tables and > index the result set. > For example: > Tag(abc)- ImageId1, Tag(abc)-ImageId2, Tag(abc)-ImageId3 etc. Hence this > is a fairly fat joint. Assuming that we are doing like this how is the > performance on lucene? If it is a bad design, what should be a better > way of doing this? Looking forward to your valuable suggestions. So, really, you're de-normalizing your database into an index. It seems that what you're really doing here is, for each tag, storing a list of images. Then, given a tag, you want all the images. What do you think about something like this.... doc = new Document(); doc.add("ID", "Tag(abc)", STORED, UNTOKENIZED); (note, IDs are often best untokenized, since you really don't want to split them up). doc.add("images", "ImageId1", STORED, NO); (not indexed, but stored). doc.add("images", "ImageId2", STORED, NO); . . . writer.add(doc); Now, to get the images associated with a tag, you just search for the doc whose ID is your tag, get the doc and read the stored images field. You'll have to parse the image IDs out, but that should be trivial. The search should be extremely fast since one and only one "document" matches. There's no problem storing multiple data into the same document field. Or you could assemble the whole list of IDs into a string and add the "images" field only once. or..... You can vary this as you see fit. For instance, you could store each image in its own field in the doc. There are ways to enumerate the fields in a given document, so once your search was satisfied by tag id, you'd be off and running. doc.add("image1", "ImageId1", STORED, NO); (not indexed, but stored). doc.add("image2", "ImageId2", STORED, NO); NOTE: there is no requirement that each document in a lucene index have the same number or name of fields. In fact, you could create an index that for which no two documents had any field in common. Not, perhaps, a *useful* index, but you could do it. If your head is in the DB table world, this may not immediately occur to you .... Don't know if this helps, but I thought I'd mention it. Question : #2 > I need to search the multiple fields from a table. The search phrase > needs to look for the fields DESCRIPTION1 and DESCRIPTION2 in the table. > I have done something like this: > while (rs.next()) { > Document doc = new Document(); > doc.add(new Field("ID", String.valueOf(rs.getInt("ID")), > Field.Store.YES, Field.Index.UN_TOKENIZED)); > doc.add(new Field("Description1", rs.getString("Description1"), > Field.Store.YES, Field.Index.TOKENIZED)); > doc.add(new Field("Description2", rs.getString("Description2"), > Field.Store.YES, Field.Index.TOKENIZED)); > String content = rs.getString("Description1") + " " + > rs.getString("Description2") > doc.add(new Field("cContent", content, Field.Store.YES, > Field.Index.TOKENIZED)); > list[0].add(doc); > } > > Do I need to do the cContent part for searching? Is this increasing the > size of the index? Is it better to create a dynamic query that looks for > the description1 description2 field or use the cContent? No, you do not need the cContent part for searching. Yes, it'll increase the size of your index to include both (how could it not?). Whether you should store description1 and description2, or just the combination of the two depends upon whether you ever expect to need to distinguish between them during searching. All other things being equal, I tend to favor leaving them in two distinct fields, as I don't believe there's a noticable penalty for searching both, and you preserve information. OTOH, it depends also on how you want to search your data. Let's say you want to ask "Are terms A and B in the description fields?" If you store them as distinct fields, you need to form something like if (A is in description1 or description2) and (B is indescription1 or description2). Whereas if they are combined, all you have to ask is if (A and B are in combined). So, let's assume that you have two description fields "because we had to split them up to fit them in fixed length columns in the DB". Putting them back together actually makes the index representation of the problem truer to the real problem space, so that's yet another consideration..... Hope this helps Erick Please help me in figuring out these things. > Thanks > > Mathews > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_9665_15086747.1160163256763--