Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 17807 invoked from network); 1 Jun 2006 17:05:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 1 Jun 2006 17:05:57 -0000 Received: (qmail 28273 invoked by uid 500); 1 Jun 2006 17:05:54 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 28227 invoked by uid 500); 1 Jun 2006 17:05:54 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 28210 invoked by uid 99); 1 Jun 2006 17:05:53 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jun 2006 10:05:53 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [64.34.172.19] (HELO ohana.manawiz.com) (64.34.172.19) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jun 2006 10:05:52 -0700 Received: from [192.168.1.47] ([::ffff:72.234.69.162]) (AUTH: LOGIN chuck, TLS: TLSv1/SSLv3,256bits,AES256-SHA) by ohana.manawiz.com with esmtp; Thu, 01 Jun 2006 17:05:45 +0000 id 005AC19A.447F1E6A.000059D0 Message-ID: <447F1E57.80102@manawiz.com> Date: Thu, 01 Jun 2006 07:05:27 -1000 From: Chuck Williams Organization: Manawiz User-Agent: Thunderbird 1.5.0.2 (X11/20060516) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Lexicon access questions References: <20060601101029.20264.qmail@web25912.mail.ukl.yahoo.com> In-Reply-To: <20060601101029.20264.qmail@web25912.mail.ukl.yahoo.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N This approach comes to mind. You could model your semantic tags as tokens and index them at the same positions as the words or phrases to which they apply. This is particularly easy if you can integrate your taggers with your Analyzer. You would probably want to create one or more new Query subclasses to facilitate certain types of matching, making it easy to associate terms/phrases with different tags (e.g., OverlappingQuery). This approach would support generation of queries that are tag-dependent, but would not directly help using tags in a ranking algorithm for tag-independent queries. As an off-hand thought, you might be able to extend the idea to support this by naming your tags something like TERM_TAG where TERM is the term they apply to (best if the character used for '_' cannot occur in any term). Then something like a TaggedTermQuery could easily find the tags relevant to a term in the query and iterate their docs/positions in parallel with those of the term (rougly equilvaent to OverlappingQuery(term, PrefixQuery(term_*))). Top-of-mind thoughts, Chuck eks dev wrote on 06/01/2006 12:10 AM: > We have faced the following use case: > > In order to optimize performance and more importantly quality of search results we are forced to attach more attributes to particular words (Terms). Generic attributes like TF, IDF are usefull to model our "similarity" only up to some level. > > Examples: > 1. Is one Term first or last name, (e.g. we have comprehensive list of such words). This enables us to make smarter (faster and better queries) in case someone has multiple first names, it influences ranking... > 2. Agreement weight and Disagreement weigt of some words is modelled diferently. > 3. Semantic classes of words influence ranking (if something verb or noun changes search strategy and ranking radically) > > On top of that, we can afford to load all terms in memory, in order to alow fast string distance callculations and some limited pattern matching using some strange Trie-s. > > Today, we solve these things by implementing totally redundant data structures that keep some kind of map Term->ValuesObject, which is redundant to Lucene Lexicon storage. Instead of "one access gets all" we have two access terms using two diferent access paths, once using our dictionary and second time implicitly via Query or so... So we introduce performance/memory penalties. (Pls. do not forget, we need to access copy of analyzed document in order to attach "additional info" to Terms) > > I guess we are not the only ones to face such a case, as increase in precision above TF/IDF can be only achieved by introducing some "domain semantics" where available. For this, "attaching" domain specific info to Term would be perfect solution. Also, enabling flexible implementations for Lexicon access could give us some flexibility (e.g. implementation in mg4j goes in that direction) > > Could somebody imagine 2.x version of Lucene to have some Interface that needs to be implemented with clear contract, that would enable us to attach our implementation for accessing lexicon? > > Or even better, some hints how I can do it today :) > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org