lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aki Balogh <...@marketmuse.com>
Subject Re: Does docValues impact termfreq ?
Date Sat, 24 Oct 2015 20:05:47 GMT
Yes, sorry, I am not being clear.

We are not even doing scoring, just getting the raw TF values. We're doing
this in solr because it can scale well.

But with large corpora, retrieving the word counts takes some time, in part
because solr is splitting up word count by document and generating a large
request. We then get the request and just sum it all up. I'm wondering if
there's a more direct way.
On Oct 24, 2015 4:00 PM, "Upayavira" <uv@odoko.co.uk> wrote:

> Can you explain more what you are using TF for? Because it sounds rather
> like scoring. You could disable field norms and IDF and scoring would be
> mostly TF, no?
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> > Thanks, let me think about that.
> >
> > We're using termfreq to get the TF score, but we don't know which term
> > we'll need the TF for. So we'd have to do a corpuswide summing of
> > termfreq
> > for each potential term across all documents in the corpus. It seems like
> > it'd require some development work to compute that, and our code would be
> > fragile.
> >
> > Let me think about that more.
> >
> > It might make sense to just move to solrcloud, it's the right
> > architectural
> > decision anyway.
> >
> >
> > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv@odoko.co.uk> wrote:
> >
> > > If you just want word length, then do work during indexing - index a
> > > field for the word length. Then, I believe you can do faceting - e.g.
> > > with the json faceting API I believe you can do a sum() calculation on
> a
> > > field rather than the more traditional count.
> > >
> > > Thinking aloud, there might be an easier way - index a field that is
> the
> > > same for all documents, and facet on it. Instead of counting the number
> > > of documents, calculate the sum() of your word count field.
> > >
> > > I *think* that should work.
> > >
> > > Upayavira
> > >
> > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > > Hi Jack,
> > > >
> > > > I'm just using solr to get word count across a large number of
> documents.
> > > >
> > > > It's somewhat non-standard, because we're ignoring relevance, but it
> > > > seems
> > > > to work well for this use case otherwise.
> > > >
> > > > My understanding then is:
> > > > 1) since termfreq is pre-processed and fetched, there's no good way
> to
> > > > speed it up (except by caching earlier calculations)
> > > >
> > > > 2) there's no way to have solr sum up all of the termfreqs across all
> > > > documents in a search and just return one number for total termfreqs
> > > >
> > > >
> > > > Are these correct?
> > > >
> > > > Thanks,
> > > > Aki
> > > >
> > > >
> > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > > <jack.krupansky@gmail.com>
> > > > wrote:
> > > >
> > > > > That's what a normal query does - Lucene takes all the terms used
> in
> > > the
> > > > > query and sums them up for each document in the response,
> producing a
> > > > > single number, the score, for each document. That's the way Solr
is
> > > > > designed to be used. You still haven't elaborated why you are
> trying
> > > to use
> > > > > Solr in a way other than it was intended.
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <aki@marketmuse.com>
> > > wrote:
> > > > >
> > > > > > Gotcha - that's disheartening.
> > > > > >
> > > > > > One idea: when I run termfreq, I get all of the termfreqs for
> each
> > > > > document
> > > > > > one-by-one.
> > > > > >
> > > > > > Is there a way to have solr sum it up before creating the
> request,
> > > so I
> > > > > > only receive one number in the response?
> > > > > >
> > > > > >
> > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv@odoko.co.uk>
> wrote:
> > > > > >
> > > > > > > If you mean using the term frequency function query, then
I'm
> not
> > > sure
> > > > > > > there's a huge amount you can do to improve performance.
> > > > > > >
> > > > > > > The term frequency is a number that is used often, so it
is
> stored
> > > in
> > > > > > > the index pre-calculated. Perhaps, if your data is not
> changing,
> > > > > > > optimising your index would reduce it to one segment, and
thus
> > > might
> > > > > > > ever so slightly speed the aggregation of term frequencies,
> but I
> > > doubt
> > > > > > > it'd make enough difference to make it worth doing.
> > > > > > >
> > > > > > > Upayavira
> > > > > > >
> > > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > > > Thanks, Jack. I did some more research and found similar
> results.
> > > > > > > >
> > > > > > > > In our application, we are making multiple (think:
50)
> concurrent
> > > > > > > > requests
> > > > > > > > to calculate term frequency on a set of documents
in
> > > "real-time". The
> > > > > > > > faster that results return, the better.
> > > > > > > >
> > > > > > > > Most of these requests are unique, so cache only helps
> slightly.
> > > > > > > >
> > > > > > > > This analysis is happening on a single solr instance.
> > > > > > > >
> > > > > > > > Other than moving to solr cloud and splitting out
the
> processing
> > > onto
> > > > > > > > multiple servers, do you have any suggestions for
what might
> > > speed up
> > > > > > > > termfreq at query time?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Aki
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > > > <jack.krupansky@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Term frequency applies only to the indexed terms
of a
> tokenized
> > > > > > field.
> > > > > > > > > DocValues is really just a copy of the original
source text
> > > and is
> > > > > > not
> > > > > > > > > tokenized into terms.
> > > > > > > > >
> > > > > > > > > Maybe you could explain how exactly you are using
term
> > > frequency in
> > > > > > > > > function queries. More importantly, what is so
"heavy"
> about
> > > your
> > > > > > > usage?
> > > > > > > > > Generally, moderate use of a feature is much
more
> advisable to
> > > > > heavy
> > > > > > > usage,
> > > > > > > > > unless you don't care about performance.
> > > > > > > > >
> > > > > > > > > -- Jack Krupansky
> > > > > > > > >
> > > > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> > > aki@marketmuse.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > In our solr application, we use a Function
Query
> (termfreq)
> > > very
> > > > > > > heavily.
> > > > > > > > > >
> > > > > > > > > > Index time and disk space are not important,
but we're
> > > looking to
> > > > > > > improve
> > > > > > > > > > performance on termfreq at query time.
> > > > > > > > > > I've been reading up on docValues. Would
this be a way to
> > > improve
> > > > > > > > > > performance?
> > > > > > > > > >
> > > > > > > > > > I had read that Lucene uses Field Cache
for Function
> > > Queries, so
> > > > > > > > > > performance may not be affected.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > And, any general suggestions for improving
query
> performance
> > > on
> > > > > > > Function
> > > > > > > > > > Queries?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Aki
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message