lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <soko...@ifactory.com>
Subject Re: Spell check on a subset of an index ( 'namespace' aware spell checker)
Date Wed, 23 Nov 2011 19:04:30 GMT
could use simply index every term with a namespace prefix like:

Q::term

where Q is the namespace and term the term?

Then when you do spell corrections, submit each candidate term with the 
namespace prefix prepended

-Mike

On 11/23/2011 9:28 AM, E. van Chastelet wrote:
> I currently have an idea to get it done, but it's not a nice solution.
>
> If we have an index Q with all documents for all namespaces, we first 
> extract the list of all terms that appear for the field namespace in Q 
> (this field indicates the namespace of the document).
>
> Then, for each namespace n in the terms list:
>  - Get all docs from Q that match +namespace:n
>  - Construct a temporary index from these docs
>  - Use this temporary index to construct the dictionary, which the 
> SpellChecker can use as input.
>  - Call indexDictionary on SpellChecker to create spellcheck index for 
> current namespace.
>  - Delete temporary index
>
> We now have separate spell check indexes for each namespace.
>
> Any suggestions for a cleaner solution?
>
> Regards,
> Elmer van Chastelet
>
>
>
> On 11/10/2011 01:16 PM, E. van Chastelet wrote:
>> Hi all,
>>
>> In our project we like to have the ability to get search results 
>> scoped to one 'namespace' (as we call it). This can easily be 
>> achieved by using a filter or just an additional must-clause.
>> For the spellchecker (and our autocompletion, which is a modified 
>> spellchecker), the story seems different. The spell checker index is 
>> created using a LuceneDictionary, which has a IndexReader as source. 
>> We would like to get (spellcheck/autocomplete) suggestions that are 
>> scoped to one namespace (i.e. field 'namespace' should have a 
>> particular value).
>> With a single source index containing docs for all namespaces, it 
>> seems not possible to create a spellcheck index for each namespace 
>> the ordinary way.
>> Q1: Is there a way to construct a LuceneDictionary from a subset of a 
>> single source index (all terms where namespace = %value%) ?
>>
>> Another, maybe better solution is to customize the spellchecker by 
>> adding an additional namespace field to the spellchecker index. At 
>> query-time, an additional must-clause is added, scoping the 
>> suggestions to one (or more) namespace(s). The advantage of this is 
>> to have a singleton spellchecker (or at least the index reader) for 
>> all namespaces. This also means less open files by our application 
>> (imagine if there are over 1000 namespaces).
>> Q2: Will there be a significant penalty (say more than 50% slower) 
>> for the additional must-clause at query time?
>>
>> Q3: Or can you think of a better solution for this problem? :)
>>
>> How we currently do it: we currently use Lucene 3.1 with Hibernate 
>> Search and we actually already have auto completion and spell 
>> checking scoped to one namespace. This is currently achieved by using 
>> index sharding, so each namespace has its own index and reader, and 
>> another for spell check and auto completion. Unfortunately there are 
>> some downsides to this:
>> - Our faceting engine has no good support for multiple indexes, so 
>> faceting only works on a single namespace
>> - Needs administration for mapping namespace identifier (String) to 
>> index number (integer)
>> - The number of shards (and thus name spaces) is currently hardcoded. 
>> At this moment it is set to 100, and this means Hibernate Search 
>> opens up 100 index readers/writers, while only n<100 are in use. and 
>> therfore:
>> - Much open file descriptors
>> - Hard limit on number of namespaces
>>
>> Therefore it seems better to switch back to having a single index for 
>> all namespaces.
>>
>> Thanks!
>>
>> Regards,
>> Elmer van Chastelet
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message