Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 31143 invoked from network); 8 Apr 2011 04:18:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Apr 2011 04:18:02 -0000 Received: (qmail 21704 invoked by uid 500); 8 Apr 2011 04:18:00 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 21459 invoked by uid 500); 8 Apr 2011 04:18:00 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 21447 invoked by uid 99); 8 Apr 2011 04:17:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Apr 2011 04:17:58 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HK_RANDOM_ENVFROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of teddyyyy123@gmail.com designates 209.85.213.176 as permitted sender) Received: from [209.85.213.176] (HELO mail-yx0-f176.google.com) (209.85.213.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Apr 2011 04:17:52 +0000 Received: by yxd5 with SMTP id 5so1699825yxd.35 for ; Thu, 07 Apr 2011 21:17:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=kYkh8Ah7yNT7U2PTfY/A+RsK9ipGT4rmsf1QE9RW/sI=; b=cFMUbxxt8VG80NBtpcpggyvdHrlC8O3V8EDNaTudjs4DQwoHq/+ztIVpgG8jbIkVhH ovdmtpOkMN9XvTQa4eb7+Hp2AeirnHo8hVJtkgooWKyYpP/KHPPRy9xoS+w9TCyEYf9Y tVqPio9u00yXyUtjJGblQjWWpjJLi3pgU0OcE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=Y2a30PWLSRpX0ahZzlmalGplXpBr98BjUXMrHK4TUs+IxyXEMDe5jIPTa8JRY6F6Fv VrPCNXOxVCd4dWGgGW3nvptlEvWRtqyufqYIzI608k/mKwRuVHGmOEm7QyDujgMW2nOh cHceb+8X23TCERW/LtmwGBhiQQWUl00bFwAIc= MIME-Version: 1.0 Received: by 10.236.190.232 with SMTP id e68mr2196156yhn.187.1302236251295; Thu, 07 Apr 2011 21:17:31 -0700 (PDT) Received: by 10.236.109.138 with HTTP; Thu, 7 Apr 2011 21:17:31 -0700 (PDT) Date: Thu, 7 Apr 2011 21:17:31 -0700 Message-ID: Subject: some basic questions on how Lucene/search engines work From: Yang To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 I'm new to lucene/search engine , and have been struggling with these questions recently. I'd appreciate a lot of you could shed some light on this. let's say I do a query on dog greyhound note that I did not quote them, i.e. this is not a phrase search. what happens under the hood ? which term does Lucene use to look up the inverted Index ? I read somewhere that Lucene uses the term with the higher IDF (i.e. the more distinguishing term), i.e. in this case "greyhound", but what about dog? does Lucene traverse down the doclist of "dog" at all? if I provide multiple terms in my query, generally how does Lucene decide how many doclists to travel down? I read that Lucene uses a combination of "binary model" and VSM, then it seems that in the above case, it finds the full doclist of dog , and that of "greyhound", (the binary model part), then find the common docs from the two doclists, then order them by scores ( the VSM part). is it true that the FULL doclists are fetched first? or is some pruning done on the individual doclists? I see the talk in http://www.slideshare.net/abial/eurocon2010 that talks about pruning and tiered search, but is this the default behavior of Lucene? how are the doclists sorted? (by idf ?? --- sorry I'm just beginning to sift through a lot of docs online, somehow got this impression but can't form a precise conclusion) also generally, could you please provide some good articles on how lucene/search engines work? I've read the "anatomy of a search engine" (google Sergey Brin & Larry Page paper), "introduction to information retrieval (Manning et al ) " , "Lucene in action" .... Thanks Yang --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org