Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <562E2B28.2070907@sematext.com>
References: 
 <CAAbn9SaZWD4OuvDsKpHoNr7_ROghKOA6DNvuMn5ECNzcEGbtBg@mail.gmail.com>
 <CAOxAL62WdJDrtmbEhTr7Pz_AOepHgaWY+fHRyMCTWRi8SSE-YA@mail.gmail.com>
 <CAAbn9SYkPBsM7cHDbE0-u-G3i6_7_Edxyg5HHjG199zf92fWSg@mail.gmail.com>
 <1445699139.1296465.419124889.739758DB@webmail.messagingengine.com>
 <CAAbn9SbL6NpnF0whKkZCH4+pPAN3JE4g0M+-9AsND5iLs6Ytqg@mail.gmail.com>
 <CAOxAL61AW-rhU9jofO86eSU6WUDJ92ovQaW3pSRoonBrtUFEig@mail.gmail.com>
 <CAAbn9SZi95_DGfHnq6neYfBxqjF0=iJZ2JggqDJrPEXeGejHFA@mail.gmail.com>
 <1445709278.1336707.419204305.328373DD@webmail.messagingengine.com>
 <CAAbn9Sa8WUt8oBS70gsn+xcXfS8axUrz-VzB8P=f3LOBbw6gVg@mail.gmail.com>
 <1445716827.1365742.419257273.65A3A7A9@webmail.messagingengine.com>
 <CAAbn9SZ6Nw-87GaHRJFXTwsuRucqwz_Hrc3uq5mvAi66QirLww@mail.gmail.com>
 <1445718027.1370602.419266065.4D50B719@webmail.messagingengine.com>
 <CAAbn9SZLnDiv22ONkE+i32m0Y3V6VtcOX2CM+hdyormqLot5BA@mail.gmail.com>
 <562E2B28.2070907@sematext.com>
From: Aki Balogh <aki@marketmuse.com>
Date: Mon, 26 Oct 2015 09:43:19 -0400
Message-ID: 
 <CAAbn9SYhMEVJO0MhMViNQcXMocwYu7TURimYoOwotFyvpWCxig@mail.gmail.com>
Subject: Re: Does docValues impact termfreq ?
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001a113d59aa3da51a05230227f7

--001a113d59aa3da51a05230227f7
Content-Type: text/plain; charset=UTF-8

Hi Emir,

This is correct. This is the only way we use the index.

Thanks,
Aki

On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
emir.arnautovic@sematext.com> wrote:

> If I got it right, you are using term query, use function to get TF as
> score, iterate all documents in results and sum up total number of
> occurrences of specific term in index? Is this only way you use index or
> this is side functionality?
>
> Thanks,
> Emir
>
>
> On 24.10.2015 22:28, Aki Balogh wrote:
>
>> Certainly, yes. I'm just doing a word count, ie how often does a specific
>> term come up in the corpus?
>> On Oct 24, 2015 4:20 PM, "Upayavira" <uv@odoko.co.uk> wrote:
>>
>> yes, but what do you want to do with the TF? What problem are you
>>> solving with it? If you are able to share that...
>>>
>>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>>>
>>>> Yes, sorry, I am not being clear.
>>>>
>>>> We are not even doing scoring, just getting the raw TF values. We're
>>>> doing
>>>> this in solr because it can scale well.
>>>>
>>>> But with large corpora, retrieving the word counts takes some time, in
>>>> part
>>>> because solr is splitting up word count by document and generating a
>>>> large
>>>> request. We then get the request and just sum it all up. I'm wondering
>>>> if
>>>> there's a more direct way.
>>>> On Oct 24, 2015 4:00 PM, "Upayavira" <uv@odoko.co.uk> wrote:
>>>>
>>>> Can you explain more what you are using TF for? Because it sounds
>>>>>
>>>> rather
>>>
>>>> like scoring. You could disable field norms and IDF and scoring would
>>>>>
>>>> be
>>>
>>>> mostly TF, no?
>>>>>
>>>>> Upayavira
>>>>>
>>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>>>>>
>>>>>> Thanks, let me think about that.
>>>>>>
>>>>>> We're using termfreq to get the TF score, but we don't know which
>>>>>>
>>>>> term
>>>
>>>> we'll need the TF for. So we'd have to do a corpuswide summing of
>>>>>> termfreq
>>>>>> for each potential term across all documents in the corpus. It seems
>>>>>>
>>>>> like
>>>
>>>> it'd require some development work to compute that, and our code
>>>>>>
>>>>> would be
>>>
>>>> fragile.
>>>>>>
>>>>>> Let me think about that more.
>>>>>>
>>>>>> It might make sense to just move to solrcloud, it's the right
>>>>>> architectural
>>>>>> decision anyway.
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <uv@odoko.co.uk> wrote:
>>>>>>
>>>>>> If you just want word length, then do work during indexing - index
>>>>>>>
>>>>>> a
>>>
>>>> field for the word length. Then, I believe you can do faceting -
>>>>>>>
>>>>>> e.g.
>>>
>>>> with the json faceting API I believe you can do a sum()
>>>>>>>
>>>>>> calculation on
>>>
>>>> a
>>>>>
>>>>>> field rather than the more traditional count.
>>>>>>>
>>>>>>> Thinking aloud, there might be an easier way - index a field that
>>>>>>>
>>>>>> is
>>>
>>>> the
>>>>>
>>>>>> same for all documents, and facet on it. Instead of counting the
>>>>>>>
>>>>>> number
>>>
>>>> of documents, calculate the sum() of your word count field.
>>>>>>>
>>>>>>> I *think* that should work.
>>>>>>>
>>>>>>> Upayavira
>>>>>>>
>>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>>>>>>>
>>>>>>>> Hi Jack,
>>>>>>>>
>>>>>>>> I'm just using solr to get word count across a large number of
>>>>>>>>
>>>>>>> documents.
>>>>>
>>>>>> It's somewhat non-standard, because we're ignoring relevance,
>>>>>>>>
>>>>>>> but it
>>>
>>>> seems
>>>>>>>> to work well for this use case otherwise.
>>>>>>>>
>>>>>>>> My understanding then is:
>>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good
>>>>>>>>
>>>>>>> way
>>>
>>>> to
>>>>>
>>>>>> speed it up (except by caching earlier calculations)
>>>>>>>>
>>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
>>>>>>>>
>>>>>>> across all
>>>
>>>> documents in a search and just return one number for total
>>>>>>>>
>>>>>>> termfreqs
>>>
>>>>
>>>>>>>> Are these correct?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Aki
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
>>>>>>>> <jack.krupansky@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> That's what a normal query does - Lucene takes all the terms
>>>>>>>>>
>>>>>>>> used
>>>
>>>> in
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> query and sums them up for each document in the response,
>>>>>>>>>
>>>>>>>> producing a
>>>>>
>>>>>> single number, the score, for each document. That's the way
>>>>>>>>>
>>>>>>>> Solr is
>>>
>>>> designed to be used. You still haven't elaborated why you are
>>>>>>>>>
>>>>>>>> trying
>>>>>
>>>>>> to use
>>>>>>>
>>>>>>>> Solr in a way other than it was intended.
>>>>>>>>>
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
>>>>>>>>>
>>>>>>>> aki@marketmuse.com>
>>>
>>>> wrote:
>>>>>>>
>>>>>>>> Gotcha - that's disheartening.
>>>>>>>>>>
>>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for
>>>>>>>>>>
>>>>>>>>> each
>>>>>
>>>>>> document
>>>>>>>>>
>>>>>>>>>> one-by-one.
>>>>>>>>>>
>>>>>>>>>> Is there a way to have solr sum it up before creating the
>>>>>>>>>>
>>>>>>>>> request,
>>>>>
>>>>>> so I
>>>>>>>
>>>>>>>> only receive one number in the response?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <uv@odoko.co.uk>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>
>>>>>> If you mean using the term frequency function query, then
>>>>>>>>>>>
>>>>>>>>>> I'm
>>>
>>>> not
>>>>>
>>>>>> sure
>>>>>>>
>>>>>>>> there's a huge amount you can do to improve performance.
>>>>>>>>>>>
>>>>>>>>>>> The term frequency is a number that is used often, so it is
>>>>>>>>>>>
>>>>>>>>>> stored
>>>>>
>>>>>> in
>>>>>>>
>>>>>>>> the index pre-calculated. Perhaps, if your data is not
>>>>>>>>>>>
>>>>>>>>>> changing,
>>>>>
>>>>>> optimising your index would reduce it to one segment, and
>>>>>>>>>>>
>>>>>>>>>> thus
>>>
>>>> might
>>>>>>>
>>>>>>>> ever so slightly speed the aggregation of term frequencies,
>>>>>>>>>>>
>>>>>>>>>> but I
>>>>>
>>>>>> doubt
>>>>>>>
>>>>>>>> it'd make enough difference to make it worth doing.
>>>>>>>>>>>
>>>>>>>>>>> Upayavira
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
>>>>>>>>>>>>
>>>>>>>>>>> results.
>>>>>
>>>>>> In our application, we are making multiple (think: 50)
>>>>>>>>>>>>
>>>>>>>>>>> concurrent
>>>>>
>>>>>> requests
>>>>>>>>>>>> to calculate term frequency on a set of documents in
>>>>>>>>>>>>
>>>>>>>>>>> "real-time". The
>>>>>>>
>>>>>>>> faster that results return, the better.
>>>>>>>>>>>>
>>>>>>>>>>>> Most of these requests are unique, so cache only helps
>>>>>>>>>>>>
>>>>>>>>>>> slightly.
>>>>>
>>>>>> This analysis is happening on a single solr instance.
>>>>>>>>>>>>
>>>>>>>>>>>> Other than moving to solr cloud and splitting out the
>>>>>>>>>>>>
>>>>>>>>>>> processing
>>>>>
>>>>>> onto
>>>>>>>
>>>>>>>> multiple servers, do you have any suggestions for what
>>>>>>>>>>>>
>>>>>>>>>>> might
>>>
>>>> speed up
>>>>>>>
>>>>>>>> termfreq at query time?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Aki
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
>>>>>>>>>>>> <jack.krupansky@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Term frequency applies only to the indexed terms of a
>>>>>>>>>>>>>
>>>>>>>>>>>> tokenized
>>>>>
>>>>>> field.
>>>>>>>>>>
>>>>>>>>>>> DocValues is really just a copy of the original source
>>>>>>>>>>>>>
>>>>>>>>>>>> text
>>>
>>>> and is
>>>>>>>
>>>>>>>> not
>>>>>>>>>>
>>>>>>>>>>> tokenized into terms.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe you could explain how exactly you are using term
>>>>>>>>>>>>>
>>>>>>>>>>>> frequency in
>>>>>>>
>>>>>>>> function queries. More importantly, what is so "heavy"
>>>>>>>>>>>>>
>>>>>>>>>>>> about
>>>>>
>>>>>> your
>>>>>>>
>>>>>>>> usage?
>>>>>>>>>>>
>>>>>>>>>>>> Generally, moderate use of a feature is much more
>>>>>>>>>>>>>
>>>>>>>>>>>> advisable to
>>>>>
>>>>>> heavy
>>>>>>>>>
>>>>>>>>>> usage,
>>>>>>>>>>>
>>>>>>>>>>>> unless you don't care about performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
>>>>>>>>>>>>>
>>>>>>>>>>>> aki@marketmuse.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In our solr application, we use a Function Query
>>>>>>>>>>>>>>
>>>>>>>>>>>>> (termfreq)
>>>>>
>>>>>> very
>>>>>>>
>>>>>>>> heavily.
>>>>>>>>>>>
>>>>>>>>>>>> Index time and disk space are not important, but
>>>>>>>>>>>>>>
>>>>>>>>>>>>> we're
>>>
>>>> looking to
>>>>>>>
>>>>>>>> improve
>>>>>>>>>>>
>>>>>>>>>>>> performance on termfreq at query time.
>>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
>>>>>>>>>>>>>>
>>>>>>>>>>>>> way to
>>>
>>>> improve
>>>>>>>
>>>>>>>> performance?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Queries, so
>>>>>>>
>>>>>>>> performance may not be affected.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And, any general suggestions for improving query
>>>>>>>>>>>>>>
>>>>>>>>>>>>> performance
>>>>>
>>>>>> on
>>>>>>>
>>>>>>>> Function
>>>>>>>>>>>
>>>>>>>>>>>> Queries?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> <https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F>
>
>

--001a113d59aa3da51a05230227f7--