Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7FC809E2F for ; Tue, 6 Dec 2011 14:21:28 +0000 (UTC) Received: (qmail 89713 invoked by uid 500); 6 Dec 2011 14:21:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 89668 invoked by uid 500); 6 Dec 2011 14:21:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 89660 invoked by uid 99); 6 Dec 2011 14:21:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Dec 2011 14:21:26 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of evanchastelet@gmail.com designates 74.125.83.48 as permitted sender) Received: from [74.125.83.48] (HELO mail-ee0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Dec 2011 14:21:18 +0000 Received: by eeab20 with SMTP id b20so4841428eea.35 for ; Tue, 06 Dec 2011 06:20:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=R+5gTk+GXXvjU10BR/Bx8M4FhRIaHK49AEF7gC8giCg=; b=EtBYxDB5hAkZ5xf1FHfjtx1L9UDR3WexzN5x5QAt1n0UKYwOsIeJUY5leLPPNc5n8Q uRSy4X3pHCoHVAF8oIsoaEhr02Z91HAV7MKRYoFyJB45J8cKMf+MEirYXVlPNs4rewTT 9+vDz4NTWMxG4MHdXfxXoO2eiyLqGpw3b4Xyg= Received: by 10.14.5.71 with SMTP id 47mr864025eek.120.1323181256960; Tue, 06 Dec 2011 06:20:56 -0800 (PST) Received: from [192.168.42.125] (d124169.upc-d.chello.nl. [213.46.124.169]) by mx.google.com with ESMTPS id h7sm34613068bkw.12.2011.12.06.06.20.54 (version=SSLv3 cipher=OTHER); Tue, 06 Dec 2011 06:20:55 -0800 (PST) Message-ID: <4EDE24C6.3040909@gmail.com> Date: Tue, 06 Dec 2011 15:20:54 +0100 From: "E. van Chastelet" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20111124 Thunderbird/8.0 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Spell check on a subset of an index ( 'namespace' aware spell checker) References: <4EBBC08E.4020902@gmail.com> <4ECD031A.7060309@gmail.com> In-Reply-To: <4ECD031A.7060309@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit I'm still struggling with this. I've tried to implement the solution mentioned in previous reply, but unfortunately there is a blocking issue with this: I cannot find a way to create another index from the source index in a way that the new index has the field values in it. The only way to copy document's field values from one to another index is to have stored fields. But stored fields hold "the original String in its entirety", and not the analyzed String, which I need. Is there another way to copy documents with (at least the spellcheck field) from the one to another index? Recap: I have a source index holding documents for different namespaces. These documents hold one field (analyzed) that should be used for spell checking. I want to construct an spellchecker index for each namespace separately. To accomplish this, I first get the list of namespaces (each document has a namespace field in the original index). Then, for each namespace, I get the list of documents that match this namespace. Then I'd like to use this subset to construct a spellchecker index. Regards, Elmer On 11/23/2011 03:28 PM, E. van Chastelet wrote: > I currently have an idea to get it done, but it's not a nice solution. > > If we have an index Q with all documents for all namespaces, we first > extract the list of all terms that appear for the field namespace in Q > (this field indicates the namespace of the document). > > Then, for each namespace n in the terms list: > - Get all docs from Q that match +namespace:n > - Construct a temporary index from these docs > - Use this temporary index to construct the dictionary, which the > SpellChecker can use as input. > - Call indexDictionary on SpellChecker to create spellcheck index for > current namespace. > - Delete temporary index > > We now have separate spell check indexes for each namespace. > > Any suggestions for a cleaner solution? > > Regards, > Elmer van Chastelet > > > > On 11/10/2011 01:16 PM, E. van Chastelet wrote: >> Hi all, >> >> In our project we like to have the ability to get search results >> scoped to one 'namespace' (as we call it). This can easily be >> achieved by using a filter or just an additional must-clause. >> For the spellchecker (and our autocompletion, which is a modified >> spellchecker), the story seems different. The spell checker index is >> created using a LuceneDictionary, which has a IndexReader as source. >> We would like to get (spellcheck/autocomplete) suggestions that are >> scoped to one namespace (i.e. field 'namespace' should have a >> particular value). >> With a single source index containing docs for all namespaces, it >> seems not possible to create a spellcheck index for each namespace >> the ordinary way. >> Q1: Is there a way to construct a LuceneDictionary from a subset of a >> single source index (all terms where namespace = %value%) ? >> >> Another, maybe better solution is to customize the spellchecker by >> adding an additional namespace field to the spellchecker index. At >> query-time, an additional must-clause is added, scoping the >> suggestions to one (or more) namespace(s). The advantage of this is >> to have a singleton spellchecker (or at least the index reader) for >> all namespaces. This also means less open files by our application >> (imagine if there are over 1000 namespaces). >> Q2: Will there be a significant penalty (say more than 50% slower) >> for the additional must-clause at query time? >> >> Q3: Or can you think of a better solution for this problem? :) >> >> How we currently do it: we currently use Lucene 3.1 with Hibernate >> Search and we actually already have auto completion and spell >> checking scoped to one namespace. This is currently achieved by using >> index sharding, so each namespace has its own index and reader, and >> another for spell check and auto completion. Unfortunately there are >> some downsides to this: >> - Our faceting engine has no good support for multiple indexes, so >> faceting only works on a single namespace >> - Needs administration for mapping namespace identifier (String) to >> index number (integer) >> - The number of shards (and thus name spaces) is currently hardcoded. >> At this moment it is set to 100, and this means Hibernate Search >> opens up 100 index readers/writers, while only n<100 are in use. and >> therfore: >> - Much open file descriptors >> - Hard limit on number of namespaces >> >> Therefore it seems better to switch back to having a single index for >> all namespaces. >> >> Thanks! >> >> Regards, >> Elmer van Chastelet >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org