Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 71D81200D35 for ; Tue, 7 Nov 2017 16:43:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 709DB160BED; Tue, 7 Nov 2017 15:43:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 902B61609C8 for ; Tue, 7 Nov 2017 16:43:06 +0100 (CET) Received: (qmail 76444 invoked by uid 500); 7 Nov 2017 15:42:59 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 76432 invoked by uid 99); 7 Nov 2017 15:42:59 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Nov 2017 15:42:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id AB7DC1B7BE6 for ; Tue, 7 Nov 2017 15:42:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.379 X-Spam-Level: X-Spam-Status: No, score=0.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id AVBNMci3K7Fh for ; Tue, 7 Nov 2017 15:42:57 +0000 (UTC) Received: from mail-lf0-f44.google.com (mail-lf0-f44.google.com [209.85.215.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 40D4C60F01 for ; Tue, 7 Nov 2017 15:42:56 +0000 (UTC) Received: by mail-lf0-f44.google.com with SMTP id e143so14989571lfg.12 for ; Tue, 07 Nov 2017 07:42:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=+UDUxAuppiuw71gLmSDy3PawHme9bwHtGFSG9BWMgyA=; b=XUQuPZ4YZ8JFZRlPSD77mPg0NvKl0mBIuEOSPTB2X+z6Fb95MWIi2/QIriVXItXvz9 Oz2hrMUt3ys5fs5OxE0WeKNdnwDYfSqgAiMXYTBD5L6RF1ka3ZC1qSpN/oL18m99VicC mdjHzHmQY0KKHamJ1e3P2AVkeMxrja6uCrmYrd5lpFmdT34G9klSVMXiD6gfu8U0RedQ m3T4KE3hdWkq7bnS/3NcTDjVeXte4/moaNfkD5EZdkxprKSKBkLItyq57rzX3vBIL+bU IYFj3VglDgv37v1nxME2BCZkAeEEOmIG1vQvug6ysB2SYo+gUlLSA2FDnPw6jdKrv/U6 tWfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=+UDUxAuppiuw71gLmSDy3PawHme9bwHtGFSG9BWMgyA=; b=cczX4sBVxKMxEMw6XPgaM6x+AT5M/KP/iErWN4jlWg162fAW4YHC8lky3dCM8o6AzJ C8yn7BFH/BPglzcyhIh+NoSTzgarjehbP8nx+JG2ULDCfX3yjntIphoE0TDT5D/Copbt OSEfdgr8OZTivb+wFwl2GtOQjZ10DyEuEP3hy3GpP58ysSazcEbOvF5JNH9OaEkixvFp yQ30hNjscuUoNIIqrZNW59tK1JJayyLH/jUxTFJg81nrs4Xd599AKsnyI8dav3wz1xHP kCnNkR9KWoLoYk9PkxNe14PP68tC0UVYsn+7/j0U3u0GRZnwgDFPVjAGu2bTxW1vxvOk 3YjQ== X-Gm-Message-State: AJaThX72H+kV/wO9nyw5R7I5/eZMB5BpccVO7FnTtcl1X5D9ImKZFCTS wRfBnaY4iYb2KxKUAbEoF2dg12N5AM3oUmHKpUPARUM9 X-Google-Smtp-Source: ABhQp+Rw56of2mKH4QhyIpNFX8jAropj5/v+putCRdwwlNh88gdRDw3ZySH799hWKHtQViI72Cy5zmDuQ7V77KEP61A= X-Received: by 10.25.143.25 with SMTP id r25mr6659851lfd.86.1510069375313; Tue, 07 Nov 2017 07:42:55 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.123.28 with HTTP; Tue, 7 Nov 2017 07:42:14 -0800 (PST) In-Reply-To: References: <3709F6D2-6A01-4180-A78A-473D0E37F7EE@sematext.com> <8441BB60-FDCA-4FDF-931F-43FBC56CB0CC@sematext.com> From: Erick Erickson Date: Tue, 7 Nov 2017 07:42:14 -0800 Message-ID: Subject: Re: Faceting Word Count To: solr-user Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable archived-at: Tue, 07 Nov 2017 15:43:07 -0000 bq: 10k as a max number of rows. This doesn't matter. In order to facet on the word count, Solr has to be prepared to facet on all possible docs. For all Solr knows, a _single_ document may contain every word so the size of the structure that contains the counters has to be prepared for N buckets, where N is the total number of distinct words in the entire corpus. You'll really have to find an alternative approach, somehow restrict the choices etc. I think. Best, Erick On Tue, Nov 7, 2017 at 12:26 AM, Wael Kader wrote: > Hi, > > The whole index has 100M but when I add the criteria, it will filter the > data to maybe 10k as a max number of rows. > The facet isn't working when the total number of records in the index is > 100M but it was working at 5M. > > I have social media & RSS data in the index and I am trying to get the wo= rd > count for a specific user on specific date intervals. > > Regards, > Wael > > On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson > wrote: > >> _Why_ do you want to get the word counts? Faceting on all of the >> tokens for 100M docs isn't something Solr is ordinarily used for. As >> Emir says it'll take a huge amount of memory. You can use one of the >> function queries (termfreq IIRC) that will give you the count of any >> individual term you have and will be very fast. >> >> But getting all of the word counts in the index is probably not >> something I'd use Solr for. >> >> This may be an XY problem, you're asking how to do something specific >> (X) without explaining what the problem you're trying to solve is (Y). >> Perhaps there's another way to accomplish (Y) if we knew more about >> what it is. >> >> Best, >> Erick >> >> >> >> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautovi=C4=87 >> wrote: >> > Hi Wael, >> > You are faceting on analyzed field. This results in field being >> uninverted - fieldValueCache being built - on first call after every >> commit. This is both time and memory consuming (you can check in admin >> console in stats how much memory it took). >> > What you need to do is to create multivalue string field (not text) an= d >> parse values (do analysis steps) on client side and store it like that. >> This will allow you to enable docValues on that field and avoid building >> fieldValueCache. >> > >> > HTH, >> > Emir >> > -- >> > Monitoring - Log Management - Alerting - Anomaly Detection >> > Solr & Elasticsearch Consulting Support Training - http://sematext.com= / >> > >> > >> > >> >> On 6 Nov 2017, at 13:06, Wael Kader wrote: >> >> >> >> Hi, >> >> >> >> I am using a custom field. Below is the field definition. >> >> I am using this because I don't want stemming. >> >> >> >> >> >> > >> positionIncrementGap=3D"100"> >> >> >> >> > >> mapping=3D"mapping-ISOLatin1Accent.txt"/> >> >> >> >> >> >> > >> ignoreCase=3D"true" >> >> words=3D"stopwords.txt" >> >> enablePositionIncrements=3D"true" >> >> /> >> >> > >> protected=3D"protwords.txt" >> >> generateWordParts=3D"0" >> >> generateNumberParts=3D"1" >> >> catenateWords=3D"1" >> >> catenateNumbers=3D"1" >> >> catenateAll=3D"0" >> >> splitOnCaseChange=3D"1" >> >> preserveOriginal=3D"1"/> >> >> >> >> >> >> >> >> >> >> >> >> > >> mapping=3D"mapping-ISOLatin1Accent.txt"/> >> >> >> >> > synonyms=3D"synonyms.txt" >> >> ignoreCase=3D"true" expand=3D"true"/> >> >> > >> ignoreCase=3D"true" >> >> words=3D"stopwords.txt" >> >> enablePositionIncrements=3D"true" >> >> /> >> >> >> >> > >> protected=3D"protwords.txt" >> >> generateWordParts=3D"0" >> >> catenateWords=3D"0" >> >> catenateNumbers=3D"0" >> >> catenateAll=3D"0" >> >> splitOnCaseChange=3D"1" >> >> preserveOriginal=3D"1"/> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Regards, >> >> Wael >> >> >> >> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautovi=C4=87 < >> >> emir.arnautovic@sematext.com> wrote: >> >> >> >>> Hi Wael, >> >>> Can you provide your field definition and sample query. >> >>> >> >>> Thanks, >> >>> Emir >> >>> -- >> >>> Monitoring - Log Management - Alerting - Anomaly Detection >> >>> Solr & Elasticsearch Consulting Support Training - >> http://sematext.com/ >> >>> >> >>> >> >>> >> >>>> On 6 Nov 2017, at 08:30, Wael Kader wrote: >> >>>> >> >>>> Hello, >> >>>> >> >>>> I am having an index with around 100 Million documents. >> >>>> I have a multivalued column that I am saving big chunks of text dat= a >> in. >> >>> It >> >>>> has around 20 GB of RAM and 4 CPU's. >> >>>> >> >>>> I was doing faceting on it to get word cloud but it was taking arou= nd >> 1 >> >>>> second to retrieve when the data was 5-10 Million . >> >>>> Now I have more data and its taking minutes to get the results (tha= t >> is >> >>> if >> >>>> it gets it and SOLR doesn't crash). Whats the best way to make it r= un >> or >> >>>> maybe its not scalable to make it run on my current schema and desi= gn >> >>> with >> >>>> News articles. >> >>>> >> >>>> I am looking to find the best solution for this. Maybe create anoth= er >> >>> index >> >>>> to split the data while inserting it or maybe if I change some >> settings >> >>> in >> >>>> SolrConfig or add some RAM, it would perform better. >> >>>> >> >>>> -- >> >>>> Regards, >> >>>> Wael >> >>> >> >>> >> >> >> >> >> >> -- >> >> Regards, >> >> Wael >> > >> > > > > -- > Regards, > Wael