Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <CACCK0pxBS=YGd+j1-mTqWQXqvHKSi6La8sRC9wj7-0wjx1KWjg@mail.gmail.com>
References: <CACCK0pxTCB6zt0xET53_uqKe1kr3W9iKhwrJ-6MuB_SYjVgxwQ@mail.gmail.com>
 <3709F6D2-6A01-4180-A78A-473D0E37F7EE@sematext.com> <CACCK0pxr-o-Fioyb-Mjtk+4NLZbQ=knXu8bz=RyMAgQDegAyLg@mail.gmail.com>
 <8441BB60-FDCA-4FDF-931F-43FBC56CB0CC@sematext.com> <CAN4YXvfuSh+auL7pUaEB_4ivZNZjsgEigJ9uVfBehYoKCn1k1Q@mail.gmail.com>
 <CACCK0pxBS=YGd+j1-mTqWQXqvHKSi6La8sRC9wj7-0wjx1KWjg@mail.gmail.com>
From: Erick Erickson <erickerickson@gmail.com>
Date: Tue, 7 Nov 2017 07:42:14 -0800
Message-ID: <CAN4YXveibUcH5wfxx1errFzCvcYn9o=6QKgRqnOwt2O7XMW9zg@mail.gmail.com>
Subject: Re: Faceting Word Count
To: solr-user <solr-user@lucene.apache.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
archived-at: Tue, 07 Nov 2017 15:43:07 -0000

bq: 10k as a max number of rows.

This doesn't matter. In order to facet on the word count, Solr has to
be prepared to facet on all possible docs. For all Solr knows, a
_single_ document may contain every word so the size of the structure
that contains the counters has to be prepared for N buckets, where N
is the total number of distinct words in the entire corpus.

You'll really have to find an alternative approach, somehow restrict
the choices etc. I think.

Best,
Erick

On Tue, Nov 7, 2017 at 12:26 AM, Wael Kader <wael@softech-lb.com> wrote:
> Hi,
>
> The whole index has 100M but when I add the criteria, it will filter the
> data to maybe 10k as a max number of rows.
> The facet isn't working when the total number of records in the index is
> 100M but it was working at 5M.
>
> I have social media & RSS data in the index and I am trying to get the wo=
rd
> count for a specific user on specific date intervals.
>
> Regards,
> Wael
>
> On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> _Why_ do you want to get the word counts? Faceting on all of the
>> tokens for 100M docs isn't something Solr is ordinarily used for. As
>> Emir says it'll take a huge amount of memory. You can use one of the
>> function queries (termfreq IIRC) that will give you the count of any
>> individual term you have and will be very fast.
>>
>> But getting all of the word counts in the index is probably not
>> something I'd use Solr for.
>>
>> This may be an XY problem, you're asking how to do something specific
>> (X) without explaining what the problem you're trying to solve is (Y).
>> Perhaps there's another way to accomplish (Y) if we knew more about
>> what it is.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautovi=C4=87
>> <emir.arnautovic@sematext.com> wrote:
>> > Hi Wael,
>> > You are faceting on analyzed field. This results in field being
>> uninverted - fieldValueCache being built - on first call after every
>> commit. This is both time and memory consuming (you can check in admin
>> console in stats how much memory it took).
>> > What you need to do is to create multivalue string field (not text) an=
d
>> parse values (do analysis steps) on client side and store it like that.
>> This will allow you to enable docValues on that field and avoid building
>> fieldValueCache.
>> >
>> > HTH,
>> > Emir
>> > --
>> > Monitoring - Log Management - Alerting - Anomaly Detection
>> > Solr & Elasticsearch Consulting Support Training - http://sematext.com=
/
>> >
>> >
>> >
>> >> On 6 Nov 2017, at 13:06, Wael Kader <wael@softech-lb.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I am using a custom field. Below is the field definition.
>> >> I am using this because I don't want stemming.
>> >>
>> >>
>> >>    <fieldType name=3D"text_no_stem2" class=3D"solr.TextField"
>> >> positionIncrementGap=3D"100">
>> >>      <analyzer type=3D"index">
>> >>        <charFilter class=3D"solr.MappingCharFilterFactory"
>> >> mapping=3D"mapping-ISOLatin1Accent.txt"/>
>> >>        <tokenizer class=3D"solr.WhitespaceTokenizerFactory"/>
>> >>
>> >>        <filter class=3D"solr.StopFilterFactory"
>> >>                ignoreCase=3D"true"
>> >>                words=3D"stopwords.txt"
>> >>                enablePositionIncrements=3D"true"
>> >>                />
>> >>        <filter class=3D"solr.WordDelimiterFilterFactory"
>> >>                protected=3D"protwords.txt"
>> >>                generateWordParts=3D"0"
>> >>                generateNumberParts=3D"1"
>> >>                catenateWords=3D"1"
>> >>                catenateNumbers=3D"1"
>> >>                catenateAll=3D"0"
>> >>                splitOnCaseChange=3D"1"
>> >>                preserveOriginal=3D"1"/>
>> >>        <filter class=3D"solr.LowerCaseFilterFactory"/>
>> >>
>> >>        <filter class=3D"solr.RemoveDuplicatesTokenFilterFactory"/>
>> >>      </analyzer>
>> >>      <analyzer type=3D"query">
>> >>        <charFilter class=3D"solr.MappingCharFilterFactory"
>> >> mapping=3D"mapping-ISOLatin1Accent.txt"/>
>> >>        <tokenizer class=3D"solr.WhitespaceTokenizerFactory"/>
>> >>        <filter class=3D"solr.SynonymFilterFactory"
>> synonyms=3D"synonyms.txt"
>> >> ignoreCase=3D"true" expand=3D"true"/>
>> >>        <filter class=3D"solr.StopFilterFactory"
>> >>                ignoreCase=3D"true"
>> >>                words=3D"stopwords.txt"
>> >>                enablePositionIncrements=3D"true"
>> >>                />
>> >> <!--ORIGINAL                generateNumberParts=3D"1"-->
>> >>        <filter class=3D"solr.WordDelimiterFilterFactory"
>> >>                protected=3D"protwords.txt"
>> >>                generateWordParts=3D"0"
>> >>                catenateWords=3D"0"
>> >>                catenateNumbers=3D"0"
>> >>                catenateAll=3D"0"
>> >>                splitOnCaseChange=3D"1"
>> >>                preserveOriginal=3D"1"/>
>> >>        <filter class=3D"solr.LowerCaseFilterFactory"/>
>> >>        <!-- ORIGINAL filter class=3D"solr.SnowballPorterFilterFactory=
"
>> >> language=3D"English" protected=3D"protwords.txt"/-->
>> >>        <!-- Webel: switch off Porter-stemmer algorithm to enforce who=
le
>> >> word match -->
>> >>        <filter class=3D"solr.RemoveDuplicatesTokenFilterFactory"/>
>> >>      </analyzer>
>> >>    </fieldType>
>> >>
>> >>
>> >> Regards,
>> >> Wael
>> >>
>> >> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautovi=C4=87 <
>> >> emir.arnautovic@sematext.com> wrote:
>> >>
>> >>> Hi Wael,
>> >>> Can you provide your field definition and sample query.
>> >>>
>> >>> Thanks,
>> >>> Emir
>> >>> --
>> >>> Monitoring - Log Management - Alerting - Anomaly Detection
>> >>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>> >>>
>> >>>
>> >>>
>> >>>> On 6 Nov 2017, at 08:30, Wael Kader <wael@softech-lb.com> wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> I am having an index with around 100 Million documents.
>> >>>> I have a multivalued column that I am saving big chunks of text dat=
a
>> in.
>> >>> It
>> >>>> has around 20 GB of RAM and 4 CPU's.
>> >>>>
>> >>>> I was doing faceting on it to get word cloud but it was taking arou=
nd
>> 1
>> >>>> second to retrieve when the data was 5-10 Million .
>> >>>> Now I have more data and its taking minutes to get the results (tha=
t
>> is
>> >>> if
>> >>>> it gets it and SOLR doesn't crash). Whats the best way to make it r=
un
>> or
>> >>>> maybe its not scalable to make it run on my current schema and desi=
gn
>> >>> with
>> >>>> News articles.
>> >>>>
>> >>>> I am looking to find the best solution for this. Maybe create anoth=
er
>> >>> index
>> >>>> to split the data while inserting it or maybe if I change some
>> settings
>> >>> in
>> >>>> SolrConfig or add some RAM, it would perform better.
>> >>>>
>> >>>> --
>> >>>> Regards,
>> >>>> Wael
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Wael
>> >
>>
>
>
>
> --
> Regards,
> Wael