Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 32485 invoked from network); 3 Apr 2008 12:24:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Apr 2008 12:24:17 -0000 Received: (qmail 50848 invoked by uid 500); 3 Apr 2008 12:24:11 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 50812 invoked by uid 500); 3 Apr 2008 12:24:10 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 50801 invoked by uid 99); 3 Apr 2008 12:24:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2008 05:24:10 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [217.147.24.8] (HELO devel.uw.ru) (217.147.24.8) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2008 12:23:19 +0000 Received: from [192.168.1.103] (router2.hsol.uw.ru [217.147.24.3]) by devel.uw.ru (8.13.8/8.13.8) with ESMTP id m33COrAx018476 for ; Thu, 3 Apr 2008 16:24:53 +0400 From: =?utf-8?b?0JjQu9GM0Y8g0JrQsNC30L3QsNGH0LXQtdCy?= Organization: UlterWest To: java-user@lucene.apache.org Subject: Implementing CMS search function using Lucene Date: Thu, 3 Apr 2008 16:24:15 +0400 User-Agent: KMail/1.9.9 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200804031624.16155.ilyak@office.uw.ru> X-Virus-Checked: Checked by ClamAV on apache.org Hello. We've designing a CMS in Java, and I've trying to implement site search function using lucene. The basic conception is that: - Site features numerous objects that we'd like to throw into index: pages, various text blocks on those pages, descriptions and keyword lists of those pages, static bits of html, goods sections with goods inside them, etc, etc. - There would be a search form that would be occasionally used by site visitors. - Visitors are highly unlikely to use advanced queries. I assume 95% queries would be either a few keywords or a phrase to search. We have to find the best matches for such queries. - The thing I want to introduce is "phrase in quotes" to search for exact phrase. Also, most our sites are in Russian, so some, even if rudimentary, support for Russian morphology is a plus. I've dug into examples and have a following set of questions: - Our objects are fairly structured, so I would like to introduce a lot of fields, something like five different for each object type. But, as far as I see, all Queries are going to search only one field. This is certainly bad because users surely want to search *all* the fields at once. The aren't going to bother with queries. Maybe I can add queries over every field joined by 'or' operation, but wouldn't that be too slow? I don't want it to work more than half second on reasonable sized index. Also, I don't want to hard-code exact list of fields, I might add them as I develop the system. Is this doable, would that work? Or I'll have to stuff all text content from object into one blob-field and query that? Which way is more reasonable? - Our objects have their hierarchy, e.g., blocks belong to page. Is there a way to make Lucene govern parent-child relation, somehow summing hits in all childs to find the best-matching parent? I assume, no, then is there a way for me to go thru matching documents list, reducing it by 'adding' blocks' scores to find the best matching page? - Is there a way to set weights for different fields? Let's say, content have a weight of 1, title have a weight of 5 and picture subscribe have a weight of 0.5. If no, can I do that by hand? - Is there something to support Russian morphology (it's all like "the last n letters of a word might change, we should match all forms") for either indexer or searcher? Maybe "inexact match", QueryParser's ~ operator, would be enough? I heard Nutch project have something like that, but I wonder if I would be able to reuse parts of Nutch, and I surely can't use Nutch as a whole. If there are another considerations, they're welcome. Thanks for your probable replies. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org