lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "E. van Chastelet" <evanchaste...@gmail.com>
Subject Re: Spell check on a subset of an index ( 'namespace' aware spell checker)
Date Wed, 23 Nov 2011 14:28:42 GMT
I currently have an idea to get it done, but it's not a nice solution.

If we have an index Q with all documents for all namespaces, we first 
extract the list of all terms that appear for the field namespace in Q 
(this field indicates the namespace of the document).

Then, for each namespace n in the terms list:
  - Get all docs from Q that match +namespace:n
  - Construct a temporary index from these docs
  - Use this temporary index to construct the dictionary, which the 
SpellChecker can use as input.
  - Call indexDictionary on SpellChecker to create spellcheck index for 
current namespace.
  - Delete temporary index

We now have separate spell check indexes for each namespace.

Any suggestions for a cleaner solution?

Regards,
Elmer van Chastelet



On 11/10/2011 01:16 PM, E. van Chastelet wrote:
> Hi all,
>
> In our project we like to have the ability to get search results 
> scoped to one 'namespace' (as we call it). This can easily be achieved 
> by using a filter or just an additional must-clause.
> For the spellchecker (and our autocompletion, which is a modified 
> spellchecker), the story seems different. The spell checker index is 
> created using a LuceneDictionary, which has a IndexReader as source. 
> We would like to get (spellcheck/autocomplete) suggestions that are 
> scoped to one namespace (i.e. field 'namespace' should have a 
> particular value).
> With a single source index containing docs for all namespaces, it 
> seems not possible to create a spellcheck index for each namespace the 
> ordinary way.
> Q1: Is there a way to construct a LuceneDictionary from a subset of a 
> single source index (all terms where namespace = %value%) ?
>
> Another, maybe better solution is to customize the spellchecker by 
> adding an additional namespace field to the spellchecker index. At 
> query-time, an additional must-clause is added, scoping the 
> suggestions to one (or more) namespace(s). The advantage of this is to 
> have a singleton spellchecker (or at least the index reader) for all 
> namespaces. This also means less open files by our application 
> (imagine if there are over 1000 namespaces).
> Q2: Will there be a significant penalty (say more than 50% slower) for 
> the additional must-clause at query time?
>
> Q3: Or can you think of a better solution for this problem? :)
>
> How we currently do it: we currently use Lucene 3.1 with Hibernate 
> Search and we actually already have auto completion and spell checking 
> scoped to one namespace. This is currently achieved by using index 
> sharding, so each namespace has its own index and reader, and another 
> for spell check and auto completion. Unfortunately there are some 
> downsides to this:
> - Our faceting engine has no good support for multiple indexes, so 
> faceting only works on a single namespace
> - Needs administration for mapping namespace identifier (String) to 
> index number (integer)
> - The number of shards (and thus name spaces) is currently hardcoded. 
> At this moment it is set to 100, and this means Hibernate Search opens 
> up 100 index readers/writers, while only n<100 are in use. and therfore:
> - Much open file descriptors
> - Hard limit on number of namespaces
>
> Therefore it seems better to switch back to having a single index for 
> all namespaces.
>
> Thanks!
>
> Regards,
> Elmer van Chastelet
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message