lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <mkhlud...@griddynamics.com>
Subject Re: What are the options for obtaining IDF at interactive speeds?
Date Wed, 03 Jul 2013 18:46:14 GMT
Katie,

This case is actually really hard to get. Just let me provide the
contra-sample, to let you explain problem better by spotting the gap.
What if I say that, debugQuery=true provides tf, idf for the terms and
documents from the requested page of results. Why you can't use explain to
solve the problem?


On Wed, Jul 3, 2013 at 1:06 AM, Kathryn Mazaitis
<kathryn.rivard@gmail.com>wrote:

> Hi,
>
> I'm using SOLRJ to run a query, with the goal of obtaining:
>
> (1) the retrieved documents,
> (2) the TF of each term in each document,
> (3) the IDF of each term in the set of retrieved documents (TF/IDF would be
> fine too)
>
> ...all at interactive speeds, or <10s per query. This is a demo, so if all
> else fails I can adjust the corpus, but I'd rather, y'know, actually do it.
>
> (1) and (2) are working; I completed the patch posted in the following
> issue:
> https://issues.apache.org/jira/browse/SOLR-949
> and am just setting tv=true&tv.tf=true for my query. This way I get the
> documents and the tf information all in one go.
>
> With (3) I'm running into trouble. I have found 2 ways to do it so far:
>
> Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
> information along with the documents and tf information. Since each term
> may appear in multiple documents, this means retrieving idf information for
> each term about 20 times, and takes over a minute to do.
>
> Option B: After I've gathered the tf information, run through the list of
> terms used across the set of retrieved documents, and for each term, run a
> query like:
> {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1
> ...while this retrieves idf information only once for each term, the added
> latency for doing that many queries piles up to almost two minutes on my
> current corpus.
>
> Is there anything I didn't think of -- a way to construct a query to get
> idf information for a set of terms all in one go, outside the bounds of
> what terms happen to be in a document?
>
> Failing that, does anyone have a sense for how far I'd have to scale down a
> corpus to approach interactive speeds, if I want this sort of data?
>
> Katie
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhludnev@griddynamics.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message