Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 25A9718D27 for ; Mon, 26 Oct 2015 13:44:15 +0000 (UTC) Received: (qmail 24840 invoked by uid 500); 26 Oct 2015 13:44:10 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 24765 invoked by uid 500); 26 Oct 2015 13:44:10 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 24753 invoked by uid 99); 26 Oct 2015 13:44:10 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Oct 2015 13:44:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0E796C8951 for ; Mon, 26 Oct 2015 13:44:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.29 X-Spam-Level: *** X-Spam-Status: No, score=3.29 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_FONT_FACE_BAD=0.289, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=marketmuse_com.20150623.gappssmtp.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id DhGeZ-h74YY0 for ; Mon, 26 Oct 2015 13:44:00 +0000 (UTC) Received: from mail-oi0-f53.google.com (mail-oi0-f53.google.com [209.85.218.53]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id A804F20751 for ; Mon, 26 Oct 2015 13:43:59 +0000 (UTC) Received: by oifu63 with SMTP id u63so57770273oif.2 for ; Mon, 26 Oct 2015 06:43:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marketmuse_com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=QE48IhLQmJBYjzDEYFjwz9+W+irCXjZMYAgzN/bpYFE=; b=evCSAiEQBI1CxycEUM4qGv5bp0dNZFIraGBVqXy4KWCfmYM0NKM0/0wYJ6YCHX5xXw ivfqNX9YsuJyM/PDcLIZgQw5KVViusGob6WOydO1tMgN+GnkXMo0Thw95swJkgZmUuDm ID9gQIg5zOMvZDla/giWy+29ZFwP3i79dpRz1LqLtL6SzDwK1mO8pByhpWsM00klR3mu TT436vf7x1KxV6ma0OLEzt71AlCJxCk2738SlPKyuACmbmCaFynOImgDUmkbDQg647o4 gFWlCSw2bP72R7rmvs8FYRna5R5P3F9twayriLqU993YXL32VsE3Vt0SME9ff5NqooN/ j2jw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=QE48IhLQmJBYjzDEYFjwz9+W+irCXjZMYAgzN/bpYFE=; b=M8WSY3V0vX5Hl4VAFIUYcVES9D3lR6TxZKm150NStkreshOpjJdRzJCmWxYJun1abw Ch1chvq62j6dFbu27n/9Y83TIhEmmTMf0Gl2NhSBqqwkswsObbrvg93nkJ4eflwADz0h DO5YNeF49i2hcx5m+lrjRxN04y0LKSpHz6bTZxZyh/03LBhd1t8jlySJed0aADEUN7nn cra9pbwIGHXN/ZPYGJGOwD4yntQFCGGG/X8DEurO0c1UEK+epAvFfywZ5B7Ue7kF1A/5 Jo9FdZcGyekAJu9jBI15xTULwHkjsWWMVvwTFZscbzpIxaSpkgP8mfRSOAhECcICsn0c 3Ztw== X-Gm-Message-State: ALoCoQnllQbWxkvPq3KDQPyWuK1Vq44H3nHJfGrFPDTL6aIBADbWRMSc+mG4ieEDcDJJJvf6HQYs X-Received: by 10.202.93.70 with SMTP id r67mr23014391oib.89.1445867038795; Mon, 26 Oct 2015 06:43:58 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.56.142 with HTTP; Mon, 26 Oct 2015 06:43:19 -0700 (PDT) In-Reply-To: <562E2B28.2070907@sematext.com> References: <1445699139.1296465.419124889.739758DB@webmail.messagingengine.com> <1445709278.1336707.419204305.328373DD@webmail.messagingengine.com> <1445716827.1365742.419257273.65A3A7A9@webmail.messagingengine.com> <1445718027.1370602.419266065.4D50B719@webmail.messagingengine.com> <562E2B28.2070907@sematext.com> From: Aki Balogh Date: Mon, 26 Oct 2015 09:43:19 -0400 Message-ID: Subject: Re: Does docValues impact termfreq ? To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a113d59aa3da51a05230227f7 --001a113d59aa3da51a05230227f7 Content-Type: text/plain; charset=UTF-8 Hi Emir, This is correct. This is the only way we use the index. Thanks, Aki On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic < emir.arnautovic@sematext.com> wrote: > If I got it right, you are using term query, use function to get TF as > score, iterate all documents in results and sum up total number of > occurrences of specific term in index? Is this only way you use index or > this is side functionality? > > Thanks, > Emir > > > On 24.10.2015 22:28, Aki Balogh wrote: > >> Certainly, yes. I'm just doing a word count, ie how often does a specific >> term come up in the corpus? >> On Oct 24, 2015 4:20 PM, "Upayavira" wrote: >> >> yes, but what do you want to do with the TF? What problem are you >>> solving with it? If you are able to share that... >>> >>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote: >>> >>>> Yes, sorry, I am not being clear. >>>> >>>> We are not even doing scoring, just getting the raw TF values. We're >>>> doing >>>> this in solr because it can scale well. >>>> >>>> But with large corpora, retrieving the word counts takes some time, in >>>> part >>>> because solr is splitting up word count by document and generating a >>>> large >>>> request. We then get the request and just sum it all up. I'm wondering >>>> if >>>> there's a more direct way. >>>> On Oct 24, 2015 4:00 PM, "Upayavira" wrote: >>>> >>>> Can you explain more what you are using TF for? Because it sounds >>>>> >>>> rather >>> >>>> like scoring. You could disable field norms and IDF and scoring would >>>>> >>>> be >>> >>>> mostly TF, no? >>>>> >>>>> Upayavira >>>>> >>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote: >>>>> >>>>>> Thanks, let me think about that. >>>>>> >>>>>> We're using termfreq to get the TF score, but we don't know which >>>>>> >>>>> term >>> >>>> we'll need the TF for. So we'd have to do a corpuswide summing of >>>>>> termfreq >>>>>> for each potential term across all documents in the corpus. It seems >>>>>> >>>>> like >>> >>>> it'd require some development work to compute that, and our code >>>>>> >>>>> would be >>> >>>> fragile. >>>>>> >>>>>> Let me think about that more. >>>>>> >>>>>> It might make sense to just move to solrcloud, it's the right >>>>>> architectural >>>>>> decision anyway. >>>>>> >>>>>> >>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira wrote: >>>>>> >>>>>> If you just want word length, then do work during indexing - index >>>>>>> >>>>>> a >>> >>>> field for the word length. Then, I believe you can do faceting - >>>>>>> >>>>>> e.g. >>> >>>> with the json faceting API I believe you can do a sum() >>>>>>> >>>>>> calculation on >>> >>>> a >>>>> >>>>>> field rather than the more traditional count. >>>>>>> >>>>>>> Thinking aloud, there might be an easier way - index a field that >>>>>>> >>>>>> is >>> >>>> the >>>>> >>>>>> same for all documents, and facet on it. Instead of counting the >>>>>>> >>>>>> number >>> >>>> of documents, calculate the sum() of your word count field. >>>>>>> >>>>>>> I *think* that should work. >>>>>>> >>>>>>> Upayavira >>>>>>> >>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote: >>>>>>> >>>>>>>> Hi Jack, >>>>>>>> >>>>>>>> I'm just using solr to get word count across a large number of >>>>>>>> >>>>>>> documents. >>>>> >>>>>> It's somewhat non-standard, because we're ignoring relevance, >>>>>>>> >>>>>>> but it >>> >>>> seems >>>>>>>> to work well for this use case otherwise. >>>>>>>> >>>>>>>> My understanding then is: >>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good >>>>>>>> >>>>>>> way >>> >>>> to >>>>> >>>>>> speed it up (except by caching earlier calculations) >>>>>>>> >>>>>>>> 2) there's no way to have solr sum up all of the termfreqs >>>>>>>> >>>>>>> across all >>> >>>> documents in a search and just return one number for total >>>>>>>> >>>>>>> termfreqs >>> >>>> >>>>>>>> Are these correct? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Aki >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> That's what a normal query does - Lucene takes all the terms >>>>>>>>> >>>>>>>> used >>> >>>> in >>>>> >>>>>> the >>>>>>> >>>>>>>> query and sums them up for each document in the response, >>>>>>>>> >>>>>>>> producing a >>>>> >>>>>> single number, the score, for each document. That's the way >>>>>>>>> >>>>>>>> Solr is >>> >>>> designed to be used. You still haven't elaborated why you are >>>>>>>>> >>>>>>>> trying >>>>> >>>>>> to use >>>>>>> >>>>>>>> Solr in a way other than it was intended. >>>>>>>>> >>>>>>>>> -- Jack Krupansky >>>>>>>>> >>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh < >>>>>>>>> >>>>>>>> aki@marketmuse.com> >>> >>>> wrote: >>>>>>> >>>>>>>> Gotcha - that's disheartening. >>>>>>>>>> >>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for >>>>>>>>>> >>>>>>>>> each >>>>> >>>>>> document >>>>>>>>> >>>>>>>>>> one-by-one. >>>>>>>>>> >>>>>>>>>> Is there a way to have solr sum it up before creating the >>>>>>>>>> >>>>>>>>> request, >>>>> >>>>>> so I >>>>>>> >>>>>>>> only receive one number in the response? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira >>>>>>>>>> >>>>>>>>> wrote: >>>>> >>>>>> If you mean using the term frequency function query, then >>>>>>>>>>> >>>>>>>>>> I'm >>> >>>> not >>>>> >>>>>> sure >>>>>>> >>>>>>>> there's a huge amount you can do to improve performance. >>>>>>>>>>> >>>>>>>>>>> The term frequency is a number that is used often, so it is >>>>>>>>>>> >>>>>>>>>> stored >>>>> >>>>>> in >>>>>>> >>>>>>>> the index pre-calculated. Perhaps, if your data is not >>>>>>>>>>> >>>>>>>>>> changing, >>>>> >>>>>> optimising your index would reduce it to one segment, and >>>>>>>>>>> >>>>>>>>>> thus >>> >>>> might >>>>>>> >>>>>>>> ever so slightly speed the aggregation of term frequencies, >>>>>>>>>>> >>>>>>>>>> but I >>>>> >>>>>> doubt >>>>>>> >>>>>>>> it'd make enough difference to make it worth doing. >>>>>>>>>>> >>>>>>>>>>> Upayavira >>>>>>>>>>> >>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks, Jack. I did some more research and found similar >>>>>>>>>>>> >>>>>>>>>>> results. >>>>> >>>>>> In our application, we are making multiple (think: 50) >>>>>>>>>>>> >>>>>>>>>>> concurrent >>>>> >>>>>> requests >>>>>>>>>>>> to calculate term frequency on a set of documents in >>>>>>>>>>>> >>>>>>>>>>> "real-time". The >>>>>>> >>>>>>>> faster that results return, the better. >>>>>>>>>>>> >>>>>>>>>>>> Most of these requests are unique, so cache only helps >>>>>>>>>>>> >>>>>>>>>>> slightly. >>>>> >>>>>> This analysis is happening on a single solr instance. >>>>>>>>>>>> >>>>>>>>>>>> Other than moving to solr cloud and splitting out the >>>>>>>>>>>> >>>>>>>>>>> processing >>>>> >>>>>> onto >>>>>>> >>>>>>>> multiple servers, do you have any suggestions for what >>>>>>>>>>>> >>>>>>>>>>> might >>> >>>> speed up >>>>>>> >>>>>>>> termfreq at query time? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Aki >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Term frequency applies only to the indexed terms of a >>>>>>>>>>>>> >>>>>>>>>>>> tokenized >>>>> >>>>>> field. >>>>>>>>>> >>>>>>>>>>> DocValues is really just a copy of the original source >>>>>>>>>>>>> >>>>>>>>>>>> text >>> >>>> and is >>>>>>> >>>>>>>> not >>>>>>>>>> >>>>>>>>>>> tokenized into terms. >>>>>>>>>>>>> >>>>>>>>>>>>> Maybe you could explain how exactly you are using term >>>>>>>>>>>>> >>>>>>>>>>>> frequency in >>>>>>> >>>>>>>> function queries. More importantly, what is so "heavy" >>>>>>>>>>>>> >>>>>>>>>>>> about >>>>> >>>>>> your >>>>>>> >>>>>>>> usage? >>>>>>>>>>> >>>>>>>>>>>> Generally, moderate use of a feature is much more >>>>>>>>>>>>> >>>>>>>>>>>> advisable to >>>>> >>>>>> heavy >>>>>>>>> >>>>>>>>>> usage, >>>>>>>>>>> >>>>>>>>>>>> unless you don't care about performance. >>>>>>>>>>>>> >>>>>>>>>>>>> -- Jack Krupansky >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh < >>>>>>>>>>>>> >>>>>>>>>>>> aki@marketmuse.com> >>>>>>> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> In our solr application, we use a Function Query >>>>>>>>>>>>>> >>>>>>>>>>>>> (termfreq) >>>>> >>>>>> very >>>>>>> >>>>>>>> heavily. >>>>>>>>>>> >>>>>>>>>>>> Index time and disk space are not important, but >>>>>>>>>>>>>> >>>>>>>>>>>>> we're >>> >>>> looking to >>>>>>> >>>>>>>> improve >>>>>>>>>>> >>>>>>>>>>>> performance on termfreq at query time. >>>>>>>>>>>>>> I've been reading up on docValues. Would this be a >>>>>>>>>>>>>> >>>>>>>>>>>>> way to >>> >>>> improve >>>>>>> >>>>>>>> performance? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function >>>>>>>>>>>>>> >>>>>>>>>>>>> Queries, so >>>>>>> >>>>>>>> performance may not be affected. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> And, any general suggestions for improving query >>>>>>>>>>>>>> >>>>>>>>>>>>> performance >>>>> >>>>>> on >>>>>>> >>>>>>>> Function >>>>>>>>>>> >>>>>>>>>>>> Queries? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Aki >>>>>>>>>>>>>> >>>>>>>>>>>>>> > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > --001a113d59aa3da51a05230227f7--