lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wunderw...@netflix.com>
Subject Re: SOLR Performance
Date Tue, 04 Nov 2008 22:10:36 GMT
Funny, that is exactly what Infoseek did back in 1996. A big index that
changed rarely and a small index with real-time changes. Once each week,
merge to make a new big index and start over with the small one.

You also need to handle deletes specially.

wunder

On 11/3/08 6:44 PM, "Lance Norskog" <goksron@gmail.com> wrote:

> The logistics of handling giant index files hit us before search
> performance. We switched to a set of indexes running inside one server
> (tomcat) instance with the Multicore+Distributed Search tools, with a frozen
> old index and a new index actively taking updates. The smaller new index
> takes much less time to recover after a commit.
> 
> The DS code does not handle cases where the new and old index have different
> versions of the same document. We wrote a custom distributed search that
> favored the "new" index over the "old".
> 
> Lance
> 
> -----Original Message-----
> From: Mike Klaas [mailto:mike.klaas@gmail.com]
> Sent: Monday, November 03, 2008 4:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR Performance
> 
> If you never execute any queries, a gig should be more than enough.
> 
> Of course, I've never played around with a .8 billion doc corpus on one
> machine.
> 
> -Mike
> 
> On 3-Nov-08, at 2:16 PM, Alok Dhir wrote:
> 
>> in terms of RAM -- how to size that on the indexer?
>> 
>> ---
>> Alok K. Dhir
>> Symplicity Corporation
>> www.symplicity.com
>> (703) 351-0200 x 8080
>> adhir@symplicity.com
>> 
>> On Nov 3, 2008, at 4:07 PM, Walter Underwood wrote:
>> 
>>> The indexing box can be much smaller, especially in terms of CPU.
>>> It just needs one fast thread and enough disk.
>>> 
>>> wunder
>>> 
>>> On 11/3/08 2:58 PM, "Alok Dhir" <adhir@symplicity.com> wrote:
>>> 
>>>> I was afraid of that.  Was hoping not to need another big fat box
>>>> like this one...
>>>> 
>>>> ---
>>>> Alok K. Dhir
>>>> Symplicity Corporation
>>>> www.symplicity.com
>>>> (703) 351-0200 x 8080
>>>> adhir@symplicity.com
>>>> 
>>>> On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:
>>>> 
>>>>> I believe this is one of the reasons that a master/slave
>>>>> configuration comes in handy. Commits to the Master don't slow down
>>>>> queries on the Slave.
>>>>> 
>>>>> -Todd
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Alok Dhir [mailto:adhir@symplicity.com]
>>>>> Sent: Monday, November 03, 2008 1:47 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: SOLR Performance
>>>>> 
>>>>> We've moved past this issue by reducing date precision -- thanks to
>>>>> all for the help.  Now we're at another problem.
>>>>> 
>>>>> There is relatively constant updating of the index -- new log
>>>>> entries are pumped in from several applications continuously.
>>>>> Obviously, new entries do not appear in searches until after a
>>>>> commit occurs.
>>>>> 
>>>>> The problem is, issuing a commit causes searches to come to a
>>>>> screeching halt for up to 2 minutes.  We're up to around 80M docs.
>>>>> Index size is 27G.  The number of docs will soon be 800M, which
>>>>> doesn't bode well for these "pauses" in search performance.
>>>>> 
>>>>> I'd appreciate any suggestions.
>>>>> 
>>>>> ---
>>>>> Alok K. Dhir
>>>>> Symplicity Corporation
>>>>> www.symplicity.com
>>>>> (703) 351-0200 x 8080
>>>>> adhir@symplicity.com
>>>>> 
>>>>> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
>>>>> 
>>>>>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core
>>>>>> machine.
>>>>>> 
>>>>>> Fairly simple schema -- no large text fields, standard request
>>>>>> handler.  4 small facet fields.
>>>>>> 
>>>>>> The index is an event log -- a primary search/retrieval
>>>>>> requirement is date range queries.
>>>>>> 
>>>>>> A simple query without a date range subquery is ridiculously fast
>>>>>> - 2ms.  The same query with a date range takes up to 30s
>>>>>> (30,000ms).
>>>>>> 
>>>>>> Concrete example, this query just look 18s:
>>>>>> 
>>>>>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
>>>>> TO
>>>>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>>>>> 
>>>>>> The exact same query without the date range took 2ms.
>>>>>> 
>>>>>> I saw a thread from Apr 2008 which explains the problem being due
>>>>>> to too much precision on the DateField type, and the range
>>>>>> expansion leading to far too many elements being checked.
>>>>>> Proposed solution appears to be a hack where you index date fields
>>>>>> as strings and hacking together date functions to generate proper
>>>>>> queries/format results.
>>>>>> 
>>>>>> Does this remain the recommended solution to this issue?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> ---
>>>>>> Alok K. Dhir
>>>>>> Symplicity Corporation
>>>>>> www.symplicity.com
>>>>>> (703) 351-0200 x 8080
>>>>>> adhir@symplicity.com
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 


Mime
View raw message