lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: How best to handle a reasonable amount to data (25TB+)
Date Wed, 08 Feb 2012 02:38:07 GMT
I'm all confused. 100M X 13 shards = 1.3G records, not 1.25 T

But I get it 1.5 x 10^7 x 12 x 7 = 1.26 x 10 ^ 9 = 1.26 Billion, or am
I off base again? But yes, at 100M
records that would be 13 servers.

As for whether 100M documents/shard is reasonable... it depends (tm).
There are so many variables
that the *only* way is to try it with *your* data and *your* queries.
Otherwise it's just guessing. Are you
faceting? Sorting? Do you have 10 unique terms/field? 10M unique
terms? 10B unique terms?
All that stuff goes in to the mix to determine how many documents a
shard can hold and still get
adequate performance.

Not to mention the question "what's the hardware"? A MacBook Air with
4G memory? A monster
piece of metal with a bazillion gigs of memory and SSDs?

All that said, and especially with trunk, 100M documents/shard is
quite possible. So is
10M docs/shard. And it's not even, really, the size of the documents
that solely
determines the requirements, it's this weird calculation of how many
docs, how many
unique terms/doc and how you're searching them. I expect your documents are
quite small, so that may help. Some.

Try filling out the spreadsheet here:
http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/
and you'll swiftly find out how hard abstract estimations are....

Best
Erick

On Tue, Feb 7, 2012 at 9:07 PM, Peter Miller
<Peter.Miller@objectconsulting.com.au> wrote:
> Oops again! Turns out I got to the right result earlier by the wrong means! I found this
reference (http://www.dejavutechnologies.com/faq-solr-lucene.html) that states shards can
be up to 100,000,000 documents. So, I'm back to 13 shards again. Phew!
>
> Now I'm just wondering if Cassandra/Lucandra would be a better option anyways. If Cassandra
offers some of the same advantage as OpenStack Swift object store does, then it should be
the way to go.
>
> Still looking for thoughts...
>
> Thanks, The Captn
>
> -----Original Message-----
> From: Peter Miller [mailto:Peter.Miller@objectconsulting.com.au]
> Sent: Wednesday, 8 February 2012 12:20 PM
> To: java-user@lucene.apache.org
> Subject: RE: How best to handle a reasonable amount to data (25TB+)
>
> Whoops! Very poor basic maths, I should have written it down. I was thinking 13 shards.
But yes, 13,000 is a bit different. Now I'm in even more need of help.
>
> How is "easy" - 15 million audit records a month, coming from several active systems,
and a requirement to keep and search across seven years of data.
>
> <Goes off to do more googling>
>
> Thanks a lot,
> The Captn
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, 8 February 2012 12:39 AM
> To: java-user@lucene.apache.org
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> I'm curious what the nature of your data is such that you have 1.25 trillion documents.
Even at 100M/shard, you're still talking  12,500 shards. The "laggard"
> problem will rear it's ugly
> head, not to mention the administration of that many machines will be, shall we say,
non-trivial...
>
> Best
> Erick
>
> On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller <Peter.Miller@objectconsulting.com.au>
wrote:
>> Thanks for the response. Actually, I am more concerned with trying to use an Object
Store for the indexes. The next concern is the use of a local index versus the sharded ones,
but I'm more relaxed about that now after thinking about it. I see that index shards could
be up to 100 million documents, so that makes the 1.25 trillion number look reasonable.
>>
>> Any other thoughts?
>>
>> Thanks,
>> The Captn.
>>
>> -----Original Message-----
>> From: ppp c [mailto:peter.c.eric@gmail.com]
>> Sent: Monday, 6 February 2012 5:29 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>>
>> it sounds not an issue of lucene but the logic of your app.
>> if you're afraid too many docs in one index you can make multiple indexes.
>> And then search across them, then merge, then over.
>>
>> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < Peter.Miller@objectconsulting.com.au>
wrote:
>>
>>> Hi,
>>>
>>> I have a little bit of an unusual set of requirements, and I am
>>> looking for advice. I have researched the archives, and seen some
>>> relevant posts, but they are fairly old and not specifically a match,
>>> so I thought I would give this a try.
>>>
>>> We will eventually have about 50TB raw, non-searchable data and 25TB
>>> of search attributes to handle in Lucene, across about 1.25 trillion
>>> documents. The app is write once, read many. There are many document
>>> types involved that have to be able to be searched separately or
>>> together, with some common attributes, but also unique ones per type.
>>> I plan on using a JCP implementation that uses Lucene under the
>>> covers. The data itself is not searchable, only the attributes. I
>>> plan to hook the JCP repo
>>> (ModeShape) up to the OpenStack Object Storage on commodity hardware
>>> eventually with 5 machines, each with 24 x 2TB drives. This should
>>> allow for redundancy (3 copies), although I would suppose we would
>>> add bigger drives as we go on.
>>>
>>> Since there is such a lot of data to index (not outrageous amounts
>>> for these days, but a bit chunky), I was sort of assuming that the
>>> Lucene indexes would go on the object storage solution too, to handle
>>> availability and other infrastructure issues. Most of the searches
>>> would be date-constrained, so I thought that the indexes could be sharded by
date.
>>>
>>> There would be a local disk index being built near real time on the
>>> JCP hardware that could be regularly merged in with the main indexes
>>> on the object storage, I suppose.
>>>
>>> Does that make sense, and would it work? Sorry, but this is just
>>> theoretical at the moment and I'm not experienced in Lucene, as you
>>> can no doubt tell.
>>>
>>> I came across a piece that was talking about Hardoop and distributed
>>> Solr, http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/,
>>> and I'm now wondering if that would be a superior approach? Or any other suggestions?
>>>
>>> Many Thanks,
>>> The Captn
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message