lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Miller <Peter.Mil...@objectconsulting.com.au>
Subject RE: How best to handle a reasonable amount to data (25TB+)
Date Wed, 08 Feb 2012 02:07:49 GMT
Oops again! Turns out I got to the right result earlier by the wrong means! I found this reference
(http://www.dejavutechnologies.com/faq-solr-lucene.html) that states shards can be up to 100,000,000
documents. So, I'm back to 13 shards again. Phew!

Now I'm just wondering if Cassandra/Lucandra would be a better option anyways. If Cassandra
offers some of the same advantage as OpenStack Swift object store does, then it should be
the way to go.

Still looking for thoughts...

Thanks, The Captn

-----Original Message-----
From: Peter Miller [mailto:Peter.Miller@objectconsulting.com.au] 
Sent: Wednesday, 8 February 2012 12:20 PM
To: java-user@lucene.apache.org
Subject: RE: How best to handle a reasonable amount to data (25TB+)

Whoops! Very poor basic maths, I should have written it down. I was thinking 13 shards. But
yes, 13,000 is a bit different. Now I'm in even more need of help. 

How is "easy" - 15 million audit records a month, coming from several active systems, and
a requirement to keep and search across seven years of data.

<Goes off to do more googling>

Thanks a lot,
The Captn

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Wednesday, 8 February 2012 12:39 AM
To: java-user@lucene.apache.org
Subject: Re: How best to handle a reasonable amount to data (25TB+)

I'm curious what the nature of your data is such that you have 1.25 trillion documents. Even
at 100M/shard, you're still talking  12,500 shards. The "laggard"
problem will rear it's ugly
head, not to mention the administration of that many machines will be, shall we say, non-trivial...

Best
Erick

On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller <Peter.Miller@objectconsulting.com.au>
wrote:
> Thanks for the response. Actually, I am more concerned with trying to use an Object Store
for the indexes. The next concern is the use of a local index versus the sharded ones, but
I'm more relaxed about that now after thinking about it. I see that index shards could be
up to 100 million documents, so that makes the 1.25 trillion number look reasonable.
>
> Any other thoughts?
>
> Thanks,
> The Captn.
>
> -----Original Message-----
> From: ppp c [mailto:peter.c.eric@gmail.com]
> Sent: Monday, 6 February 2012 5:29 PM
> To: java-user@lucene.apache.org
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> it sounds not an issue of lucene but the logic of your app.
> if you're afraid too many docs in one index you can make multiple indexes.
> And then search across them, then merge, then over.
>
> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < Peter.Miller@objectconsulting.com.au>
wrote:
>
>> Hi,
>>
>> I have a little bit of an unusual set of requirements, and I am 
>> looking for advice. I have researched the archives, and seen some 
>> relevant posts, but they are fairly old and not specifically a match, 
>> so I thought I would give this a try.
>>
>> We will eventually have about 50TB raw, non-searchable data and 25TB 
>> of search attributes to handle in Lucene, across about 1.25 trillion 
>> documents. The app is write once, read many. There are many document 
>> types involved that have to be able to be searched separately or 
>> together, with some common attributes, but also unique ones per type.
>> I plan on using a JCP implementation that uses Lucene under the 
>> covers. The data itself is not searchable, only the attributes. I 
>> plan to hook the JCP repo
>> (ModeShape) up to the OpenStack Object Storage on commodity hardware 
>> eventually with 5 machines, each with 24 x 2TB drives. This should 
>> allow for redundancy (3 copies), although I would suppose we would 
>> add bigger drives as we go on.
>>
>> Since there is such a lot of data to index (not outrageous amounts 
>> for these days, but a bit chunky), I was sort of assuming that the 
>> Lucene indexes would go on the object storage solution too, to handle 
>> availability and other infrastructure issues. Most of the searches 
>> would be date-constrained, so I thought that the indexes could be sharded by date.
>>
>> There would be a local disk index being built near real time on the 
>> JCP hardware that could be regularly merged in with the main indexes 
>> on the object storage, I suppose.
>>
>> Does that make sense, and would it work? Sorry, but this is just 
>> theoretical at the moment and I'm not experienced in Lucene, as you 
>> can no doubt tell.
>>
>> I came across a piece that was talking about Hardoop and distributed 
>> Solr, http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/,
>> and I'm now wondering if that would be a superior approach? Or any other suggestions?
>>
>> Many Thanks,
>> The Captn
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message