lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 李文海 <...@whu.edu.cn>
Subject Question about high-performance methods to drive TFIDF queries.
Date Sun, 07 Jan 2018 13:52:21 GMT
Hi, all.
   Recently, we were performing experiment on Lucene based on TFIDF.
   We want to get the similar documents from the corpus, of which the similarity between each
document  (d) and the given query (q) is no less than a threshold. We use the following scoring
function.

   sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q)),

   where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).

  We perform this query by scanning the related docIds of all terms in the query, and the
related docIds are derived from function  PostingsEnum docEnum = MultiFields.getTermDocsEnum(indexReader,
"text", term.bytes()) . After the inner products of these related documents have been computed,
the final similarities are computed by dividing these inner products by their norms.

   However, when the documents scale up, e.g., more than ten million titles of twitter's text
filed each on average has 10 terms, the runtime is unacceptable (more than ten seconds) since
we always need to merge 0.5~2 million documents to generate the inner products. Does Lucene
provide more efficient interface to generate ranked results based on TFIDF, or directly filter
out the dissimilar documents (in lucene core) for a given threshold in the range of (0, 1)?

Best,
Wenhai 
Mime
View raw message