For the japanese/english/french/german/dutch/russian/spanish/portuguese
with lots of searchable metadata dictionary that I am developping for
Android, I'm using a multi-field index that uses human input (a single
string) and i have to
USE 1 : guess/associate each term/range to one (or more) relevant fields
(field desambiguation)
USE 2 : suggest relevant terms for a given field
I managed to make it work in a satisfying way with WFSTCompletionLookup
<http://lucene.apache.org/core/3_6_0/api/contrib-spellchecker/org/apache/lucene/search/suggest/fst/WFSTCompletionLookup.html>
structures,
and in a not so satisfying way for terms with wildcards/regex... and
ranges mainly because the Lookup interface is much too limited
for my use cases. So, I'm looking for something better
for USE 1 :
I need to QUICKLY know
option A : if there are documents (a boolean)
option B : the number of documents (an int)
that
1) are in a range like "{an TO bam]"
2) that have a specified term (like "an*", or "an~1" or "/[ms]ad/"
for USE 2:
I need to QUICKLY get
option C : a most frequent completion (with number of docs) for a given
term like "an?b*" (WFSTCompletionLookup
<http://lucene.apache.org/core/3_6_0/api/contrib-spellchecker/org/apache/lucene/search/suggest/fst/WFSTCompletionLookup.html>
only does an*)
option D : the set of terms (with number of docs) that satisfy a regex
or a gien term like "an?b*"
option E : the set of terms (with number of docs) that satisfy a range
"{an TO bam]"
Basically, option D and E give me all I would need and are possible now
with queries
But I need it to be suggestion-quick (mobile phone/ 1s wait is too much)
not query-quick
I think that the perfect structure for this would be :
for each field, a simple (in ram) ordered list/navigatable tree of term
+ number of docs, that would work well with Automatons (fst ?)
with the right interface say
getTerms(term : String) : ArrayList<Pair<String, Int>>
getTermsForRange(termA:String, termB:String, aIncluded : Boolean,
bIncluded : Boolean): ArrayList<Pair<String, Int>>
Does something like this exist in lucene ? (in memory term dictionary ? )
If not, I will have to code one. What would be the nearest class I could
use to base this structure on ?
This structure could be built once from the index, with a filter to
remove docs not needed (for example, those that don't have english
translations for german users...) and saved to disk/restored from disk
(to avoid heavy processing on an android phone as much as possible)
Best regards,
Olivier
|