lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aki Balogh <...@marketmuse.com>
Subject Re: Does docValues impact termfreq ?
Date Mon, 26 Oct 2015 13:43:19 GMT
Hi Emir,

This is correct. This is the only way we use the index.

Thanks,
Aki

On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
emir.arnautovic@sematext.com> wrote:

> If I got it right, you are using term query, use function to get TF as
> score, iterate all documents in results and sum up total number of
> occurrences of specific term in index? Is this only way you use index or
> this is side functionality?
>
> Thanks,
> Emir
>
>
> On 24.10.2015 22:28, Aki Balogh wrote:
>
>> Certainly, yes. I'm just doing a word count, ie how often does a specific
>> term come up in the corpus?
>> On Oct 24, 2015 4:20 PM, "Upayavira" <uv@odoko.co.uk> wrote:
>>
>> yes, but what do you want to do with the TF? What problem are you
>>> solving with it? If you are able to share that...
>>>
>>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>>>
>>>> Yes, sorry, I am not being clear.
>>>>
>>>> We are not even doing scoring, just getting the raw TF values. We're
>>>> doing
>>>> this in solr because it can scale well.
>>>>
>>>> But with large corpora, retrieving the word counts takes some time, in
>>>> part
>>>> because solr is splitting up word count by document and generating a
>>>> large
>>>> request. We then get the request and just sum it all up. I'm wondering
>>>> if
>>>> there's a more direct way.
>>>> On Oct 24, 2015 4:00 PM, "Upayavira" <uv@odoko.co.uk> wrote:
>>>>
>>>> Can you explain more what you are using TF for? Because it sounds
>>>>>
>>>> rather
>>>
>>>> like scoring. You could disable field norms and IDF and scoring would
>>>>>
>>>> be
>>>
>>>> mostly TF, no?
>>>>>
>>>>> Upayavira
>>>>>
>>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>>>>>
>>>>>> Thanks, let me think about that.
>>>>>>
>>>>>> We're using termfreq to get the TF score, but we don't know which
>>>>>>
>>>>> term
>>>
>>>> we'll need the TF for. So we'd have to do a corpuswide summing of
>>>>>> termfreq
>>>>>> for each potential term across all documents in the corpus. It seems
>>>>>>
>>>>> like
>>>
>>>> it'd require some development work to compute that, and our code
>>>>>>
>>>>> would be
>>>
>>>> fragile.
>>>>>>
>>>>>> Let me think about that more.
>>>>>>
>>>>>> It might make sense to just move to solrcloud, it's the right
>>>>>> architectural
>>>>>> decision anyway.
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv@odoko.co.uk>
wrote:
>>>>>>
>>>>>> If you just want word length, then do work during indexing - index
>>>>>>>
>>>>>> a
>>>
>>>> field for the word length. Then, I believe you can do faceting -
>>>>>>>
>>>>>> e.g.
>>>
>>>> with the json faceting API I believe you can do a sum()
>>>>>>>
>>>>>> calculation on
>>>
>>>> a
>>>>>
>>>>>> field rather than the more traditional count.
>>>>>>>
>>>>>>> Thinking aloud, there might be an easier way - index a field
that
>>>>>>>
>>>>>> is
>>>
>>>> the
>>>>>
>>>>>> same for all documents, and facet on it. Instead of counting the
>>>>>>>
>>>>>> number
>>>
>>>> of documents, calculate the sum() of your word count field.
>>>>>>>
>>>>>>> I *think* that should work.
>>>>>>>
>>>>>>> Upayavira
>>>>>>>
>>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>>>>>>>
>>>>>>>> Hi Jack,
>>>>>>>>
>>>>>>>> I'm just using solr to get word count across a large number
of
>>>>>>>>
>>>>>>> documents.
>>>>>
>>>>>> It's somewhat non-standard, because we're ignoring relevance,
>>>>>>>>
>>>>>>> but it
>>>
>>>> seems
>>>>>>>> to work well for this use case otherwise.
>>>>>>>>
>>>>>>>> My understanding then is:
>>>>>>>> 1) since termfreq is pre-processed and fetched, there's no
good
>>>>>>>>
>>>>>>> way
>>>
>>>> to
>>>>>
>>>>>> speed it up (except by caching earlier calculations)
>>>>>>>>
>>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
>>>>>>>>
>>>>>>> across all
>>>
>>>> documents in a search and just return one number for total
>>>>>>>>
>>>>>>> termfreqs
>>>
>>>>
>>>>>>>> Are these correct?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Aki
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
>>>>>>>> <jack.krupansky@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> That's what a normal query does - Lucene takes all the terms
>>>>>>>>>
>>>>>>>> used
>>>
>>>> in
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> query and sums them up for each document in the response,
>>>>>>>>>
>>>>>>>> producing a
>>>>>
>>>>>> single number, the score, for each document. That's the way
>>>>>>>>>
>>>>>>>> Solr is
>>>
>>>> designed to be used. You still haven't elaborated why you are
>>>>>>>>>
>>>>>>>> trying
>>>>>
>>>>>> to use
>>>>>>>
>>>>>>>> Solr in a way other than it was intended.
>>>>>>>>>
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
>>>>>>>>>
>>>>>>>> aki@marketmuse.com>
>>>
>>>> wrote:
>>>>>>>
>>>>>>>> Gotcha - that's disheartening.
>>>>>>>>>>
>>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs
for
>>>>>>>>>>
>>>>>>>>> each
>>>>>
>>>>>> document
>>>>>>>>>
>>>>>>>>>> one-by-one.
>>>>>>>>>>
>>>>>>>>>> Is there a way to have solr sum it up before creating
the
>>>>>>>>>>
>>>>>>>>> request,
>>>>>
>>>>>> so I
>>>>>>>
>>>>>>>> only receive one number in the response?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv@odoko.co.uk>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>
>>>>>> If you mean using the term frequency function query, then
>>>>>>>>>>>
>>>>>>>>>> I'm
>>>
>>>> not
>>>>>
>>>>>> sure
>>>>>>>
>>>>>>>> there's a huge amount you can do to improve performance.
>>>>>>>>>>>
>>>>>>>>>>> The term frequency is a number that is used often,
so it is
>>>>>>>>>>>
>>>>>>>>>> stored
>>>>>
>>>>>> in
>>>>>>>
>>>>>>>> the index pre-calculated. Perhaps, if your data is not
>>>>>>>>>>>
>>>>>>>>>> changing,
>>>>>
>>>>>> optimising your index would reduce it to one segment, and
>>>>>>>>>>>
>>>>>>>>>> thus
>>>
>>>> might
>>>>>>>
>>>>>>>> ever so slightly speed the aggregation of term frequencies,
>>>>>>>>>>>
>>>>>>>>>> but I
>>>>>
>>>>>> doubt
>>>>>>>
>>>>>>>> it'd make enough difference to make it worth doing.
>>>>>>>>>>>
>>>>>>>>>>> Upayavira
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh
wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Jack. I did some more research and
found similar
>>>>>>>>>>>>
>>>>>>>>>>> results.
>>>>>
>>>>>> In our application, we are making multiple (think: 50)
>>>>>>>>>>>>
>>>>>>>>>>> concurrent
>>>>>
>>>>>> requests
>>>>>>>>>>>> to calculate term frequency on a set of documents
in
>>>>>>>>>>>>
>>>>>>>>>>> "real-time". The
>>>>>>>
>>>>>>>> faster that results return, the better.
>>>>>>>>>>>>
>>>>>>>>>>>> Most of these requests are unique, so cache
only helps
>>>>>>>>>>>>
>>>>>>>>>>> slightly.
>>>>>
>>>>>> This analysis is happening on a single solr instance.
>>>>>>>>>>>>
>>>>>>>>>>>> Other than moving to solr cloud and splitting
out the
>>>>>>>>>>>>
>>>>>>>>>>> processing
>>>>>
>>>>>> onto
>>>>>>>
>>>>>>>> multiple servers, do you have any suggestions for what
>>>>>>>>>>>>
>>>>>>>>>>> might
>>>
>>>> speed up
>>>>>>>
>>>>>>>> termfreq at query time?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Aki
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
>>>>>>>>>>>> <jack.krupansky@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Term frequency applies only to the indexed
terms of a
>>>>>>>>>>>>>
>>>>>>>>>>>> tokenized
>>>>>
>>>>>> field.
>>>>>>>>>>
>>>>>>>>>>> DocValues is really just a copy of the original
source
>>>>>>>>>>>>>
>>>>>>>>>>>> text
>>>
>>>> and is
>>>>>>>
>>>>>>>> not
>>>>>>>>>>
>>>>>>>>>>> tokenized into terms.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe you could explain how exactly you
are using term
>>>>>>>>>>>>>
>>>>>>>>>>>> frequency in
>>>>>>>
>>>>>>>> function queries. More importantly, what is so "heavy"
>>>>>>>>>>>>>
>>>>>>>>>>>> about
>>>>>
>>>>>> your
>>>>>>>
>>>>>>>> usage?
>>>>>>>>>>>
>>>>>>>>>>>> Generally, moderate use of a feature is much
more
>>>>>>>>>>>>>
>>>>>>>>>>>> advisable to
>>>>>
>>>>>> heavy
>>>>>>>>>
>>>>>>>>>> usage,
>>>>>>>>>>>
>>>>>>>>>>>> unless you don't care about performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki
Balogh <
>>>>>>>>>>>>>
>>>>>>>>>>>> aki@marketmuse.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In our solr application, we use a
Function Query
>>>>>>>>>>>>>>
>>>>>>>>>>>>> (termfreq)
>>>>>
>>>>>> very
>>>>>>>
>>>>>>>> heavily.
>>>>>>>>>>>
>>>>>>>>>>>> Index time and disk space are not important,
but
>>>>>>>>>>>>>>
>>>>>>>>>>>>> we're
>>>
>>>> looking to
>>>>>>>
>>>>>>>> improve
>>>>>>>>>>>
>>>>>>>>>>>> performance on termfreq at query time.
>>>>>>>>>>>>>> I've been reading up on docValues.
Would this be a
>>>>>>>>>>>>>>
>>>>>>>>>>>>> way to
>>>
>>>> improve
>>>>>>>
>>>>>>>> performance?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I had read that Lucene uses Field
Cache for Function
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Queries, so
>>>>>>>
>>>>>>>> performance may not be affected.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And, any general suggestions for
improving query
>>>>>>>>>>>>>>
>>>>>>>>>>>>> performance
>>>>>
>>>>>> on
>>>>>>>
>>>>>>>> Function
>>>>>>>>>>>
>>>>>>>>>>>> Queries?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> <https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message