lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeongseok Son <invictu...@gmail.com>
Subject Re: solr-user Digest of: get.100322
Date Tue, 20 May 2014 08:01:58 GMT
Thank you for your reply! I also found docValues after sending an
email and your suggestion seems the best solution for me.

Now I'm configuring schema.xml to use docValues and have a question
about docValuesFormat.

According to this thread(
http://lucene.472066.n3.nabble.com/Trade-offs-in-choosing-DocValuesFormat-td4114758.html
),

Solr 4.6 only holds some hash structures in memory space with the
default docValuesFormat configuration.

Though it uses only small amount of memory I'm worried about memory
usage because I have to store so many documents. (32GB RAM / total 5B
docs, sum of docs. of all cores)

Which docValuesFormat is more appropriate in my case? (Default or
Disk?) Can I change it later without re-indexing?

On Sat, May 17, 2014 at 9:45 PM,  <solr-user-help@lucene.apache.org> wrote:
>
> solr-user Digest of: get.100322
>
> Topics (messages 100322 through 100322)
>
> Re: Sorting problem in Solr due to Lucene Field Cache
>         100322 by: Joel Bernstein
>
> Administrivia:
>
>
> --- Administrative commands for the solr-user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>    <solr-user-subscribe@lucene.apache.org>
>
> To remove your address from the list, send a message to:
>    <solr-user-unsubscribe@lucene.apache.org>
>
> Send mail to the following for info and FAQ for this list:
>    <solr-user-info@lucene.apache.org>
>    <solr-user-faq@lucene.apache.org>
>
> Similar addresses exist for the digest list:
>    <solr-user-digest-subscribe@lucene.apache.org>
>    <solr-user-digest-unsubscribe@lucene.apache.org>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>    <solr-user-get.123_145@lucene.apache.org>
>
> To get an index with subject and author for messages 123-456 , mail:
>    <solr-user-index.123_456@lucene.apache.org>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>    <solr-user-thread.12345@lucene.apache.org>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> <solr-user-subscribe-john=host.domain@lucene.apache.org>
>
> To stop subscription for this address, mail:
> <solr-user-unsubscribe-john=host.domain@lucene.apache.org>
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> solr-user-owner@lucene.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: <invictusjs@gmail.com>
> Received: (qmail 64267 invoked by uid 99); 17 May 2014 12:22:20 -0000
> Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 May 2014 12:22:20 +0000
> X-ASF-Spam-Status: No, hits=-0.7 required=5.0
>         tests=RCVD_IN_DNSWL_LOW,SPF_PASS
> X-Spam-Check-By: apache.org
> Received-SPF: pass (athena.apache.org: domain of invictusjs@gmail.com designates 209.85.128.193
as permitted sender)
> Received: from [209.85.128.193] (HELO mail-ve0-f193.google.com) (209.85.128.193)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 May 2014 12:22:14 +0000
> Received: by mail-ve0-f193.google.com with SMTP id sa20so1075564veb.8
>         for <solr-user-get.100322@lucene.apache.org>; Sat, 17 May 2014 05:21:54
-0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>         d=gmail.com; s=20120113;
>         h=mime-version:date:message-id:subject:from:to:content-type;
>         bh=QzTOKgbCPT36kZdZcCT/uV4aRZ2PlQ3OgQFPLH0SCoc=;
>         b=yygC07cHEwmRg6rS0bHxGg5AaqtPRdsozFD6eO8ssVVC+YsfT32ZWUDDk9s7/2Z91Q
>          aCwFsbb7Thla9nkKbtMctqonOacly29Tsple/lzQX5qOQyAFdzOsQHpim+9jB+W0B1Ac
>          ZEDLqPzdMG8ZszKDa8lJ8yRadUtlb83HgB56PulZLh1XQG+WOMAuC8pBQ2zS8c/0lsib
>          JVehSX/OdqU+6HAhPYcIm6pLNWP4lYPwjTAp66Bms9j2/Y5ROwZ6azwCgGIe2hsk06q6
>          5BSKtoTXAfGweIvTQHEfvp6KgLEhIpgjlgo/s5r0NzNaaRM9zdkhp+qYOWM8nWuT8RAu
>          ytng==
> MIME-Version: 1.0
> X-Received: by 10.220.95.204 with SMTP id e12mr2401964vcn.37.1400329314139;
>  Sat, 17 May 2014 05:21:54 -0700 (PDT)
> Received: by 10.52.10.137 with HTTP; Sat, 17 May 2014 05:21:54 -0700 (PDT)
> Date: Sat, 17 May 2014 21:21:54 +0900
> Message-ID: <CABH_4FoTg+xYGgJ90r_c+0Nb-YBOfZYq7rRyrvXe2ybXkF=Bmg@mail.gmail.com>
> Subject: Give me this mail
> From: Jeongseok Son <invictusjs@gmail.com>
> To: solr-user-get.100322@lucene.apache.org
> Content-Type: text/plain; charset=UTF-8
> X-Virus-Checked: Checked by ClamAV on apache.org
>
>
> ----------------------------------------------------------------------
>
>
>
> ---------- Forwarded message ----------
> From: Joel Bernstein <joelsolr@gmail.com>
> To: solr-user@lucene.apache.org
> Cc:
> Date: Fri, 16 May 2014 17:49:51 -0400
> Subject: Re: Sorting problem in Solr due to Lucene Field Cache
> Take a look at Solr's use of DocValues:
> https://cwiki.apache.org/confluence/display/solr/DocValues.
>
> There are docValues options that use less memory then the FieldCache.
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Thu, May 15, 2014 at 6:39 AM, Jeongseok Son <invictusjs@gmail.com> wrote:
>
>> Hello, I'm struggling with large data indexed and searched by Solr.
>>
>> The schema of the documents consist of date(YYYY-MM-DD), text(tokenized and
>> indexed with Natural Language Toolkit), and several numerical fields.
>>
>> Each document is small-sized but but the number of the docs is very large,
>> which is around 10 million per each date. The server has 32GB of memory and
>> I allocated around 30GB for Solr JVM.
>>
>> My Solr server has to return documents sorted by one of the numerical
>> fields when is requested with specific date and text.(ex.
>> q=date:YYYY-MM-DD+text:KEYWORD) The problem is that sorting in Lucene
>> requires lots of Field Cache and Solr can't handle Field Cache well. The
>> Field Cache is getting larger as more queries are executed and is not
>> evicted. When the whole memory is filled with Field Cache, Solr server
>> stops or generates Out of Memory exception.
>>
>> Solr cannot control Lucene field cache at all so I have a difficult time to
>> solve this problem. I'm considering these three ways to solve this.
>>
>> 1) Add more memory.
>> This can relieve the problem but I don't think it can completely solve it.
>> Anyway the memory would fill up with field cache as the server handles
>> search requests.
>> 2) Separate numerical data from text data
>> I find Solr/Lucene isn't suitable for sorting large numerical data.
>> Therefore I'm thinking of storing numerical data in another DB(HBase,
>> MongoDB ...), then Solr server will just do some text search.
>> 3) Switching to Elasticsearch
>> According to this page(
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
>> )
>> Elasticsearch can control field cache. I think ES could solve my
>> problem.
>>
>> I'm likely to try 2nd, or 3rd way. Are these appropriate solutions? If you
>> have any better ideas please let me know. I've went through too many
>> troubles so it's time to make a decision. I want my choices reviewed by
>> many other excellent Solr users and developers and also want to find better
>> solutions.
>> I really appreciate any help you can provide.
>>
>

Mime
View raw message