lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: How best to handle a reasonable amount to data (25TB+)
Date Tue, 07 Feb 2012 13:39:20 GMT
I'm curious what the nature of your data is such that you have 1.25
trillion documents. Even
at 100M/shard, you're still talking  12,500 shards. The "laggard"
problem will rear it's ugly
head, not to mention the administration of that many machines will be,
shall we say, non-trivial...


On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller
<> wrote:
> Thanks for the response. Actually, I am more concerned with trying to use an Object Store
for the indexes. The next concern is the use of a local index versus the sharded ones, but
I'm more relaxed about that now after thinking about it. I see that index shards could be
up to 100 million documents, so that makes the 1.25 trillion number look reasonable.
> Any other thoughts?
> Thanks,
> The Captn.
> -----Original Message-----
> From: ppp c []
> Sent: Monday, 6 February 2012 5:29 PM
> To:
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
> it sounds not an issue of lucene but the logic of your app.
> if you're afraid too many docs in one index you can make multiple indexes.
> And then search across them, then merge, then over.
> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller <>
>> Hi,
>> I have a little bit of an unusual set of requirements, and I am
>> looking for advice. I have researched the archives, and seen some
>> relevant posts, but they are fairly old and not specifically a match,
>> so I thought I would give this a try.
>> We will eventually have about 50TB raw, non-searchable data and 25TB
>> of search attributes to handle in Lucene, across about 1.25 trillion
>> documents. The app is write once, read many. There are many document
>> types involved that have to be able to be searched separately or
>> together, with some common attributes, but also unique ones per type.
>> I plan on using a JCP implementation that uses Lucene under the
>> covers. The data itself is not searchable, only the attributes. I plan
>> to hook the JCP repo
>> (ModeShape) up to the OpenStack Object Storage on commodity hardware
>> eventually with 5 machines, each with 24 x 2TB drives. This should
>> allow for redundancy (3 copies), although I would suppose we would add
>> bigger drives as we go on.
>> Since there is such a lot of data to index (not outrageous amounts for
>> these days, but a bit chunky), I was sort of assuming that the Lucene
>> indexes would go on the object storage solution too, to handle
>> availability and other infrastructure issues. Most of the searches
>> would be date-constrained, so I thought that the indexes could be sharded by date.
>> There would be a local disk index being built near real time on the
>> JCP hardware that could be regularly merged in with the main indexes
>> on the object storage, I suppose.
>> Does that make sense, and would it work? Sorry, but this is just
>> theoretical at the moment and I'm not experienced in Lucene, as you
>> can no doubt tell.
>> I came across a piece that was talking about Hardoop and distributed
>> Solr,, and
>> I'm now wondering if that would be a superior approach? Or any other suggestions?
>> Many Thanks,
>> The Captn
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message