lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "E. van Chastelet" <evanchaste...@gmail.com>
Subject Spell check on a subset of an index ( 'namespace' aware spell checker)
Date Thu, 10 Nov 2011 12:16:14 GMT
Hi all,

In our project we like to have the ability to get search results scoped 
to one 'namespace' (as we call it). This can easily be achieved by using 
a filter or just an additional must-clause.
For the spellchecker (and our autocompletion, which is a modified 
spellchecker), the story seems different. The spell checker index is 
created using a LuceneDictionary, which has a IndexReader as source. We 
would like to get (spellcheck/autocomplete) suggestions that are scoped 
to one namespace (i.e. field 'namespace' should have a particular value).
With a single source index containing docs for all namespaces, it seems 
not possible to create a spellcheck index for each namespace the 
ordinary way.
Q1: Is there a way to construct a LuceneDictionary from a subset of a 
single source index (all terms where namespace = %value%) ?

Another, maybe better solution is to customize the spellchecker by 
adding an additional namespace field to the spellchecker index. At 
query-time, an additional must-clause is added, scoping the suggestions 
to one (or more) namespace(s). The advantage of this is to have a 
singleton spellchecker (or at least the index reader) for all 
namespaces. This also means less open files by our application (imagine 
if there are over 1000 namespaces).
Q2: Will there be a significant penalty (say more than 50% slower) for 
the additional must-clause at query time?

Q3: Or can you think of a better solution for this problem? :)

How we currently do it: we currently use Lucene 3.1 with Hibernate 
Search and we actually already have auto completion and spell checking 
scoped to one namespace. This is currently achieved by using index 
sharding, so each namespace has its own index and reader, and another 
for spell check and auto completion. Unfortunately there are some 
downsides to this:
- Our faceting engine has no good support for multiple indexes, so 
faceting only works on a single namespace
- Needs administration for mapping namespace identifier (String) to 
index number (integer)
- The number of shards (and thus name spaces) is currently hardcoded. At 
this moment it is set to 100, and this means Hibernate Search opens up 
100 index readers/writers, while only n<100 are in use. and therfore:
- Much open file descriptors
- Hard limit on number of namespaces

Therefore it seems better to switch back to having a single index for 
all namespaces.

Thanks!

Regards,
Elmer van Chastelet


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message