lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Binda <olivier.bi...@wanadoo.fr>
Subject Suggesters
Date Sun, 06 Apr 2014 15:16:40 GMT
For the japanese/english/french/german/dutch/russian/spanish/portuguese 
with lots of searchable metadata dictionary that I am developping for 
Android, I'm using a multi-field index that uses human input (a single 
string) and i have to

USE 1 : guess/associate each term/range to one (or more) relevant fields 
(field desambiguation)
USE 2 : suggest relevant terms for a given field

I managed to make it work in a satisfying way with WFSTCompletionLookup 
<http://lucene.apache.org/core/3_6_0/api/contrib-spellchecker/org/apache/lucene/search/suggest/fst/WFSTCompletionLookup.html>

structures,
and in a not so satisfying way for terms with wildcards/regex... and 
ranges mainly because the Lookup interface is much too limited
for my use cases. So, I'm looking for something better


for USE 1 :

I need to QUICKLY know

option A : if there are documents  (a boolean)
option B : the  number of documents (an int)

that

1) are in a range like "{an TO bam]"
2) that have a specified term (like "an*", or "an~1" or "/[ms]ad/"


for USE 2:

I need to QUICKLY get

option C : a most frequent completion (with number of docs) for a given 
term like "an?b*" (WFSTCompletionLookup 
<http://lucene.apache.org/core/3_6_0/api/contrib-spellchecker/org/apache/lucene/search/suggest/fst/WFSTCompletionLookup.html>

only does an*)
option D : the set of terms (with number of docs) that satisfy a regex 
or a gien term like "an?b*"
option E : the set of terms (with number of docs) that satisfy a range 
"{an TO bam]"


Basically, option D and E give me all I would need and are possible now 
with queries
But I need it to be suggestion-quick (mobile phone/ 1s wait is too much) 
not query-quick

I think that the perfect structure for this would be :

for each field, a simple (in ram) ordered list/navigatable tree of term 
+ number of docs, that would work well with Automatons (fst ?)
with the right interface say

getTerms(term : String) : ArrayList<Pair<String, Int>>
getTermsForRange(termA:String, termB:String, aIncluded : Boolean, 
bIncluded : Boolean): ArrayList<Pair<String, Int>>


Does something like this exist in lucene ? (in memory term dictionary ? )
If not, I will have to code one. What would be the nearest class I could 
use to base this structure on ?

This structure could be built once from the index, with a filter to 
remove docs not needed (for example, those that don't have english 
translations for german users...) and saved to disk/restored from disk  
(to avoid heavy processing on an android phone as much as possible)

Best regards,
Olivier

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message