Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 4718 invoked from network); 8 Jan 2007 02:22:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Jan 2007 02:22:39 -0000 Received: (qmail 41255 invoked by uid 500); 8 Jan 2007 02:22:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 40420 invoked by uid 500); 8 Jan 2007 02:22:37 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 40409 invoked by uid 99); 8 Jan 2007 02:22:37 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Jan 2007 18:22:37 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 64.233.182.190 as permitted sender) Received: from [64.233.182.190] (HELO nf-out-0910.google.com) (64.233.182.190) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Jan 2007 18:22:26 -0800 Received: by nf-out-0910.google.com with SMTP id a27so289311nfc for ; Sun, 07 Jan 2007 18:22:05 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=QaHprTeJuPpDFNaruJ1MHhlKcE15lTnw5PnUurAD02k2vjcYqWo0dpbX0gRpp3NrXH8Wuc4DKxDph+Ct35cZQpINrnjR/h8SsbHXSyadEJw3DEU6Bmh85aJAA3iKaajp0XVEgee+RC+bxgSt/PkNJNisWUthIgOmieMnXpC64BY= Received: by 10.82.139.17 with SMTP id m17mr2530757bud.1168222925332; Sun, 07 Jan 2007 18:22:05 -0800 (PST) Received: by 10.82.162.9 with HTTP; Sun, 7 Jan 2007 18:22:04 -0800 (PST) Message-ID: <359a92830701071822u2d118d50s9d03ace3e763cf08@mail.gmail.com> Date: Sun, 7 Jan 2007 21:22:04 -0500 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: When to use HitCollector? In-Reply-To: <45A17FD5.2080102@mac.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_53018_12838774.1168222924878" References: <45A1423D.3060503@mac.com> <359a92830701071449h6fbbee7dt3c391aace79917a@mail.gmail.com> <45A17FD5.2080102@mac.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_53018_12838774.1168222924878 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline I've had to struggle with the reflex from RDBMS days that make me want to normalize data, which sounds like what tripped you up. It's just plain WRONG to repeat the data and eat up space . Sound familiar? Except, of course, it's not wrong in a search application. At least not necessarily... Good luck. If you can get away with it, your might have a decent chance of getting the index rebuilt in about the same time you could write more efficient logic for your current system. Best Erick On 1/7/07, Michael J. Prichard wrote: > > Hey Erick (and List). > > Yeah you have it pretty correct. I am totally kicking myself in the ass > because I wrote all the indexing stuff! Ugh...one of our indexes is > about 8GB so it will take a while to rebuild. I figure I can put a > temporary fix out and then refactor the index with the right info and > make it more efficient. > > I basically wrote the logic to search email with a FilteredQuery so it > can remove what I don't want and then a seperate document query. This > is all in a BooleanQuery. I then wrote my own code to parse through > the hits. It is not the best thing in the world but it seems to work ok. > > I appreciate your help! > > Thanks, > -Michael > > Erick Erickson wrote: > > > I'm a little fuzzy on the structure of your index, but here's a stab.... > > > > First, let me see if I understand your problem... > > For an e-mail, you have a body and an attachment that are indexed as > > separate lucene documents. > > For the body, you include from, to, cc (in other words, meta-data) > > For the attachment, you do NOT include the meta-data. > > For both the body and the attachment, you have an ID for the parent > > e-mail > > that is the same for a body and attachment if they are from the same > > e-mail > > (otherwise I don't see how you "determine the email to display" for an > > attachment). > > > > You've got a couple of problems here. Anything you do to break up the > > clauses into separate queries will do bad things for your relevance > > scoring. > > That is, one query on the body and one on the attachment will give you > > two > > lists that you'll then have to manually reconcile if relevancy matters. > > > > Depending upon how many emails and attachments you get hits for, you > > could > > do something like > > 1> search for the body elements with the to/from/cc. Use the return > > (perhaps > > with a HitCollector (definitely NOT a Hits object)) to assemble a clause > > like ID=52343 or ID=985 or ID=8910 .... and the re-submit some query > like > > > > "text contains "search data" and (ID=52343 or ID=985 or ID=8910 ....)" > > > > BEWARE that, depending upon how many e-mails you get, you'll run afoul > of > > TooManyClauses exceptions. The default is 1,024 but you can make it as > > big > > as memory/time allows. And, as you say, this is temporary until you > > reconstruct your index. > > > > If this is totally irrelevant, perhaps you could add some more > detail.... > > > > Best > > Erick > > > > > > On 1/7/07, Michael J. Prichard wrote: > > > >> > >> I have an index which has email and their attachments indexed. This is > >> ok but the issue I am having it when I am trying to filter the > >> searches. For example I can search the content of the email and the > >> document (i.e. the attachment) and return the right > results. Basically, > >> if it is a document I check the DB to see its parent and determine the > >> email to display. The problem comes in when I try to use to, from > >> and/or cc in my searches. It will only return emails since we did not > >> index those fields along with the attachments. Ideally we would > reindex > >> and add those but I need a temporary fix until we can do that. SO...I > >> tried a few various things including a basic search and then filtering > >> on my own but that seriously slowed our interface since I had to check > >> each result, etc. SO...I broke the query into two...search the docs > and > >> emails seperately and only check the documents on return. That is ok. > >> > >> I was wondering...would HitCollector be something i should use. > >> Basically have the searcher check documents to make sure they are ok to > >> go (i.e. to, from. etc is correct)? > >> > >> Make sense? > >> > >> Thanks! > >> Michael > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_53018_12838774.1168222924878--