Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 43211 invoked from network); 30 Apr 2010 21:17:21 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Apr 2010 21:17:21 -0000 Received: (qmail 55800 invoked by uid 500); 30 Apr 2010 21:17:19 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 55752 invoked by uid 500); 30 Apr 2010 21:17:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 55744 invoked by uid 99); 30 Apr 2010 21:17:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Apr 2010 21:17:18 +0000 X-ASF-Spam-Status: No, hits=4.2 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,SPF_HELO_PASS,SPF_NEUTRAL,T_TO_NO_BRKTS_FREEMAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Apr 2010 21:17:11 +0000 Received: from ben.nabble.com ([192.168.236.152]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1O7xZq-00052n-Ek for java-user@lucene.apache.org; Fri, 30 Apr 2010 14:16:50 -0700 Date: Fri, 30 Apr 2010 14:16:50 -0700 (PDT) From: MitchK To: java-user@lucene.apache.org Message-ID: <1272662210449-768902.post@n3.nabble.com> In-Reply-To: References: Subject: Re: Relevancy Practices MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I found your thread at the Solr-user-list. However, it seems like your topic belongs more to Lucene in general? I copy my posting from there, so that everything is accessible by one thread. -------------------------------------------------------------------------- I think the problems one has to solve are depending on the usecases one has to deal with. It makes a difference whether I got much documents that are bloody similar but with different contexts and I have to determine what query applies to what context in what probability for which document - or if I have lots of editorialy managed documents with relatively clear contexts, because they offer human-created tags etc. I haven't made much experiences with Solr (and no experiences in a productive environment). However, those experiences I have made show that spliting the document's context in as small parts as possible is always a good idea. I don't mean splitting in a sense of making the part's of a document smaller. I mean that in a way of making it easier to decide which part of a document is more important than another. e.g.: I got a social network and every user is able to create his or her own blog - as a corporation I want to make them all searchable. It would be beneficial for high-quality search, if I am able to extract the introduction, the category (maybe added by the author). According to this: If this is not done by people, or not well done enough, than I need to do so algorithmically. e.g.: If I got a dictionary of person-names, than I could use the keepWordFilter to create a field I can facet *and* boost on. Let's say the user writes about Paris Hilton, Barrack Obama or any other well known person, than I can extract their names from the content in an easy way - of course this could be done better, but that's not the point here. If I search for "Obama's speech" all documents with "Obama" could get a boost. The difference between the solution without this keepWordFilter-feature would be, that Solr does not know that the most important word in this query is "Obama". It is only a shortcut of some ideas on how one can improve the relevancy with several features that Solr offers out-of-the-box. Some of them could be improved with external NLP-tools. My biggest problem with relevancy is, that I can't work with metadata computed on the fly or every hour out of the box (okay, you mentioned at the discussion on the dev-list that it may be possible, however I answered that the feature you talked about is not well documented, so that I don't know if it fits my needs or how to use it). How to avoid over- or under-tuning? Easily: Testing every change I made on scoring-factors against a lot of queries. If it looks good in 9 of 10 cases in a real good way, than the 10th case runs against a really bad query or could be solved with a facet or... there are a lot of ideas how to solve this. What I really want to say is: Test as much as you can and try to realize what your changes really mean (for example I can make a boost on the title of a document with a value of 1.000, every other field has got a boost-value between 1 and 10. I am relatively sure that this meets the needs for some queries but works catastrophal with the rest). It really helps to understand how Lucene's similarity works and what those factors mean in reality to your existing data. Maybe you need to change the smiliarity, because you don't want that the length of a document influences the score of it. Just some thougths. I don't think that I tell you much new stuff, however, if you got any questions or want to know more about this or that, please ask. Unfortunately I can't go to the ApacheCon, but hopefully it helps to give a good presentation. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Relevancy-Practices-tp765363p768902.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org