lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenhai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-8123) Question about how to retrieve by TFIDFSimilarity query on lucene
Date Sun, 07 Jan 2018 08:25:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wenhai updated LUCENE-8123:
---------------------------
    Description: 
Hi, all.
     Recently, we were performing experiment on Lucene based on TFIDF.
     We want to get the similar documents from the corpus, of which the similarity between
each document  (d) and the given query (q) is no less than a threshold. We use the following
scoring function.
    sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
    where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).

    We perform this query by scanning the related docIds of all terms in the query, and the
related docIds are derived from function  PostingsEnum docEnum = MultiFields.getTermDocsEnum(indexReader,
"text", terms.get(i).bytes()) . After the inner products of these related documents have been
computed, the final similarities are computed by dividing these inner products by their norms.

    However, when the documents scale up, e.g., more than ten million document, the runtime
is unacceptable (more than ten seconds). Does Lucene provide more efficient interface to generate
ranked results based on TFIDF?

Best
Wenhai 

  was:
Hi, all.
     Recently, we were performing experiment on Lucene based on TFIDF.
     We want to get the similar documents from the corpus, of which the similarity between
each document  (d) and the given query (q) is no less than a threshold. We use the following
scoring function.
    sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
    where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).

    We perform this query by scanning the related docIds of all terms in the query, and the
related docIds are derived from function  PostingsEnum docEnum = MultiFields.getTermDocsEnum(indexReader,
"text", terms.get(i).bytes()) . After the inner products of these related documents have been
computed, the final similarities are computed by dividing these inner products by their norm.

    However, when the documents scale up, e.g., more than ten million document, the runtime
is unacceptable (more than ten seconds). Does Lucene provide more efficient interface to generate
ranked results based on TFIDF?

Best
Wenhai 


> Question about how to retrieve by TFIDFSimilarity query on lucene
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring
>    Affects Versions: 7.2
>            Reporter: Wenhai
>            Priority: Minor
>
> Hi, all.
>      Recently, we were performing experiment on Lucene based on TFIDF.
>      We want to get the similar documents from the corpus, of which the similarity between
each document  (d) and the given query (q) is no less than a threshold. We use the following
scoring function.
>     sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
>     where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).
>     We perform this query by scanning the related docIds of all terms in the query, and
the related docIds are derived from function  PostingsEnum docEnum = MultiFields.getTermDocsEnum(indexReader,
"text", terms.get(i).bytes()) . After the inner products of these related documents have been
computed, the final similarities are computed by dividing these inner products by their norms.
>     However, when the documents scale up, e.g., more than ten million document, the runtime
is unacceptable (more than ten seconds). Does Lucene provide more efficient interface to generate
ranked results based on TFIDF?
> Best
> Wenhai 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message