lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aki Balogh <...@marketmuse.com>
Subject Re: Does docValues impact termfreq ?
Date Sat, 24 Oct 2015 20:28:33 GMT
Certainly, yes. I'm just doing a word count, ie how often does a specific
term come up in the corpus?
On Oct 24, 2015 4:20 PM, "Upayavira" <uv@odoko.co.uk> wrote:

> yes, but what do you want to do with the TF? What problem are you
> solving with it? If you are able to share that...
>
> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
> > Yes, sorry, I am not being clear.
> >
> > We are not even doing scoring, just getting the raw TF values. We're
> > doing
> > this in solr because it can scale well.
> >
> > But with large corpora, retrieving the word counts takes some time, in
> > part
> > because solr is splitting up word count by document and generating a
> > large
> > request. We then get the request and just sum it all up. I'm wondering if
> > there's a more direct way.
> > On Oct 24, 2015 4:00 PM, "Upayavira" <uv@odoko.co.uk> wrote:
> >
> > > Can you explain more what you are using TF for? Because it sounds
> rather
> > > like scoring. You could disable field norms and IDF and scoring would
> be
> > > mostly TF, no?
> > >
> > > Upayavira
> > >
> > > On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> > > > Thanks, let me think about that.
> > > >
> > > > We're using termfreq to get the TF score, but we don't know which
> term
> > > > we'll need the TF for. So we'd have to do a corpuswide summing of
> > > > termfreq
> > > > for each potential term across all documents in the corpus. It seems
> like
> > > > it'd require some development work to compute that, and our code
> would be
> > > > fragile.
> > > >
> > > > Let me think about that more.
> > > >
> > > > It might make sense to just move to solrcloud, it's the right
> > > > architectural
> > > > decision anyway.
> > > >
> > > >
> > > > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv@odoko.co.uk> wrote:
> > > >
> > > > > If you just want word length, then do work during indexing - index
> a
> > > > > field for the word length. Then, I believe you can do faceting -
> e.g.
> > > > > with the json faceting API I believe you can do a sum()
> calculation on
> > > a
> > > > > field rather than the more traditional count.
> > > > >
> > > > > Thinking aloud, there might be an easier way - index a field that
> is
> > > the
> > > > > same for all documents, and facet on it. Instead of counting the
> number
> > > > > of documents, calculate the sum() of your word count field.
> > > > >
> > > > > I *think* that should work.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > > > > Hi Jack,
> > > > > >
> > > > > > I'm just using solr to get word count across a large number
of
> > > documents.
> > > > > >
> > > > > > It's somewhat non-standard, because we're ignoring relevance,
> but it
> > > > > > seems
> > > > > > to work well for this use case otherwise.
> > > > > >
> > > > > > My understanding then is:
> > > > > > 1) since termfreq is pre-processed and fetched, there's no good
> way
> > > to
> > > > > > speed it up (except by caching earlier calculations)
> > > > > >
> > > > > > 2) there's no way to have solr sum up all of the termfreqs
> across all
> > > > > > documents in a search and just return one number for total
> termfreqs
> > > > > >
> > > > > >
> > > > > > Are these correct?
> > > > > >
> > > > > > Thanks,
> > > > > > Aki
> > > > > >
> > > > > >
> > > > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > > > > <jack.krupansky@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > That's what a normal query does - Lucene takes all the
terms
> used
> > > in
> > > > > the
> > > > > > > query and sums them up for each document in the response,
> > > producing a
> > > > > > > single number, the score, for each document. That's the
way
> Solr is
> > > > > > > designed to be used. You still haven't elaborated why you
are
> > > trying
> > > > > to use
> > > > > > > Solr in a way other than it was intended.
> > > > > > >
> > > > > > > -- Jack Krupansky
> > > > > > >
> > > > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
> aki@marketmuse.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Gotcha - that's disheartening.
> > > > > > > >
> > > > > > > > One idea: when I run termfreq, I get all of the termfreqs
for
> > > each
> > > > > > > document
> > > > > > > > one-by-one.
> > > > > > > >
> > > > > > > > Is there a way to have solr sum it up before creating
the
> > > request,
> > > > > so I
> > > > > > > > only receive one number in the response?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv@odoko.co.uk>
> > > wrote:
> > > > > > > >
> > > > > > > > > If you mean using the term frequency function
query, then
> I'm
> > > not
> > > > > sure
> > > > > > > > > there's a huge amount you can do to improve performance.
> > > > > > > > >
> > > > > > > > > The term frequency is a number that is used often,
so it is
> > > stored
> > > > > in
> > > > > > > > > the index pre-calculated. Perhaps, if your data
is not
> > > changing,
> > > > > > > > > optimising your index would reduce it to one
segment, and
> thus
> > > > > might
> > > > > > > > > ever so slightly speed the aggregation of term
frequencies,
> > > but I
> > > > > doubt
> > > > > > > > > it'd make enough difference to make it worth
doing.
> > > > > > > > >
> > > > > > > > > Upayavira
> > > > > > > > >
> > > > > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh
wrote:
> > > > > > > > > > Thanks, Jack. I did some more research and
found similar
> > > results.
> > > > > > > > > >
> > > > > > > > > > In our application, we are making multiple
(think: 50)
> > > concurrent
> > > > > > > > > > requests
> > > > > > > > > > to calculate term frequency on a set of
documents in
> > > > > "real-time". The
> > > > > > > > > > faster that results return, the better.
> > > > > > > > > >
> > > > > > > > > > Most of these requests are unique, so cache
only helps
> > > slightly.
> > > > > > > > > >
> > > > > > > > > > This analysis is happening on a single solr
instance.
> > > > > > > > > >
> > > > > > > > > > Other than moving to solr cloud and splitting
out the
> > > processing
> > > > > onto
> > > > > > > > > > multiple servers, do you have any suggestions
for what
> might
> > > > > speed up
> > > > > > > > > > termfreq at query time?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Aki
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > > > > > <jack.krupansky@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Term frequency applies only to the
indexed terms of a
> > > tokenized
> > > > > > > > field.
> > > > > > > > > > > DocValues is really just a copy of
the original source
> text
> > > > > and is
> > > > > > > > not
> > > > > > > > > > > tokenized into terms.
> > > > > > > > > > >
> > > > > > > > > > > Maybe you could explain how exactly
you are using term
> > > > > frequency in
> > > > > > > > > > > function queries. More importantly,
what is so "heavy"
> > > about
> > > > > your
> > > > > > > > > usage?
> > > > > > > > > > > Generally, moderate use of a feature
is much more
> > > advisable to
> > > > > > > heavy
> > > > > > > > > usage,
> > > > > > > > > > > unless you don't care about performance.
> > > > > > > > > > >
> > > > > > > > > > > -- Jack Krupansky
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki
Balogh <
> > > > > aki@marketmuse.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hello,
> > > > > > > > > > > >
> > > > > > > > > > > > In our solr application, we use
a Function Query
> > > (termfreq)
> > > > > very
> > > > > > > > > heavily.
> > > > > > > > > > > >
> > > > > > > > > > > > Index time and disk space are
not important, but
> we're
> > > > > looking to
> > > > > > > > > improve
> > > > > > > > > > > > performance on termfreq at query
time.
> > > > > > > > > > > > I've been reading up on docValues.
Would this be a
> way to
> > > > > improve
> > > > > > > > > > > > performance?
> > > > > > > > > > > >
> > > > > > > > > > > > I had read that Lucene uses Field
Cache for Function
> > > > > Queries, so
> > > > > > > > > > > > performance may not be affected.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > And, any general suggestions for
improving query
> > > performance
> > > > > on
> > > > > > > > > Function
> > > > > > > > > > > > Queries?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Aki
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message