lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <>
Subject Re: What are the options for obtaining IDF at interactive speeds?
Date Wed, 03 Jul 2013 18:46:14 GMT

This case is actually really hard to get. Just let me provide the
contra-sample, to let you explain problem better by spotting the gap.
What if I say that, debugQuery=true provides tf, idf for the terms and
documents from the requested page of results. Why you can't use explain to
solve the problem?

On Wed, Jul 3, 2013 at 1:06 AM, Kathryn Mazaitis

> Hi,
> I'm using SOLRJ to run a query, with the goal of obtaining:
> (1) the retrieved documents,
> (2) the TF of each term in each document,
> (3) the IDF of each term in the set of retrieved documents (TF/IDF would be
> fine too)
> ...all at interactive speeds, or <10s per query. This is a demo, so if all
> else fails I can adjust the corpus, but I'd rather, y'know, actually do it.
> (1) and (2) are working; I completed the patch posted in the following
> issue:
> and am just setting tv=true& for my query. This way I get the
> documents and the tf information all in one go.
> With (3) I'm running into trouble. I have found 2 ways to do it so far:
> Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
> information along with the documents and tf information. Since each term
> may appear in multiple documents, this means retrieving idf information for
> each term about 20 times, and takes over a minute to do.
> Option B: After I've gathered the tf information, run through the list of
> terms used across the set of retrieved documents, and for each term, run a
> query like:
> {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1
> ...while this retrieves idf information only once for each term, the added
> latency for doing that many queries piles up to almost two minutes on my
> current corpus.
> Is there anything I didn't think of -- a way to construct a query to get
> idf information for a set of terms all in one go, outside the bounds of
> what terms happen to be in a document?
> Failing that, does anyone have a sense for how far I'd have to scale down a
> corpus to approach interactive speeds, if I want this sort of data?
> Katie

Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message