incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <>
Subject Re: Possible Limitations?
Date Mon, 21 Oct 2013 01:56:20 GMT

Colton McInroy

  * Director of Security Engineering

(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support <>
Email <>

On 10/19/2013 6:38 AM, Aaron McCurry wrote:
> Colton,
> You always have great emails.  :-)
Thank you :)
> I will attempt to add some value beyond Garrett's email.
Thanks, he did a fairly good job at answering a chunk of them.
> Ok, so Lucene's limits are a little different than what you have described
> (I think).  Based on the
> limit is not per index but instead per segment within the index.  Plus by
> default Lucene will not merge segments that are over 5GB in size, so that
> provides a kind of ceiling for the size of segment.  Overall the biggest
> limit you will likely see is the 2 Billion document (Records in our case)
> limit.  I think a Lucene index could exceed that, but the counts are based
> on integers.  Given that I would think that you might want to calculate a
> top end of 500 million per index (I have run 400+ million in a single index
> in the past but keep things around 100 million now) so that you have plenty
> of room before you get into the unknown limits of a single Lucene index.
> At this point I think the biggest question mark in the system from want you
> have described is how many shard servers you may need to have.  I have
> tested Blur up to 200 shard servers in a single cluster and the controllers
> seem to do just fine.  But at some point there will likely need to be some
> work done to expand to 1000's of shard servers within a single cluster.
>   Where is that point, I'm not sure, but I think that there may likely need
> a second layer of controllers to reduce network hotspots.  But that's just
> a guess.
Hmmm... Ok, so to figure out the max, I basically just need to change 
this part of the calculation...

(Integer.MAX_VALUE * luceneTermIndexInterval * shardCount)


(Integer.MAX_VALUE * luceneTermIndexInterval * segmentsWithinIndex * shardCount)

At first, I will be starting off with probably the bare minimum hardware 
we can, but then expand from there. Typically we go with supermicro 1u 
xeon servers, in this situation though I understand hadoop/blur are 
meant to work on commodity hardware. What effect would having good/large 
servers on the same cluster as slower smaller servers? Would it still 
add to the performance, or would it cause some delays for any requests 
possibly hitting that server because it doesn't have the same amount of 
cpu power or memory?

My concern with this is that I have a constantly non-stop stream of new 
data coming in which has to get indexed as soon as possible so that it 
can be queried/displayed using various custom tools. The events per 
second coming in could be 10,000, 100,000, or even higher than that per 
second, perhaps 1,000,000. I currently have done as you suggested, 
separate it into only a few tables, rather than one every hour or 
something like I originally mentioned.
Now if there is this much data coming in non-stop for years, I need to 
try and figure out the problems I would run into. From what I can see, 
table shard counts can only be set once, so if you wanted to grow that 
table, I couldn't go from like 10 shards, to 100. Is this coming that 
will be possible in the future? And, if I have that much data coming in 
constantly, is using a single table ok, ideal, or would it be better to 
separate it up?, in which case, how often should I create new tables, 
and I would need the capability to query multiple tables at a time for 
it to work/make that feature myself/build my own post processor so 
aggregate them until the feature is added. At 10,000 entries (an entry 
consist of 1 row with two records) per second sustained, that's just shy 
of 14 hours until i've hit 500 million rows in a table (7 hours until 
500 million records). When doing 1,000,000 rows per second, that would 
less than just 10 minutes. Although 1,000,000 rows per second is fairly 
high, some day, I expect to hit that. For now I would expect anywhere 
from 10,000 to 100,000 rows per second, which is about 1-10 hours for 
500 million rows.
My data flow will not stop, and I need fast access to it... primarily 
the newest data, but historical data has some importance as well. I am 
in the DDoS mitigation business though, so I take on large DDoS attacks 
successfully and log all the requests to analyze them, which is why this 
is so important. For the short term data, I need to be able to see what 
is going on right now on in any of the logs, then perhaps compare that 
to past requests around that time of year or something.
Ideally using a single table for all information would obviously make it 
easy for coding purposes for me, but given the level of data, what would 
you recommend? 1 table per month with a 100 shard count? 1 table per 
year with 1000 a shard count, or perhaps just one table period with a 
1000 shard count and use that until it fills up, which hopefully isn't 
for a really long time... long enough time that the number of shards can 
be changed or something?

>> If I am correct on all of this, what kind of effect does having a large
>> shard count (1000+)? I notice that tables when creating them seem to take
>> for ever in comparison to the file structure made. I've seen the create
>> process take 10+ second, but in the background I look, and see the file
>> structure created, check the disk space used while the create command is
>> running, and there is no change. I've tried both in hdfs and localfs. I am
>> doing all of this in a virtual set of sessions right now, so it may not be
>> the best testing platform. I'm in the process of requisitioning the
>> hardware though to start building the cluster at our data center. Before I
>> start putting all of the money into it, I want to see what kind of
>> limitations I may be up against if any.
> The delay is likely because all 1000 of indexes are opened for writing and
> their nrt threads are started up and there is a lot of ZooKeeper logic to
> get everything into a known state.  Also since creating a table is not
> something I have optimized for speed, it's more of a get it right process
> instead of get it done fast.
So far, all the tables I have made have only a 10 shard count, so it 
shouldn't be that. Depending upon what you suggest above, I may need 
this process to be a bit more optimized. When dealing with big data that 
is constantly flowing at you, stopping for 10+ seconds to create an 
index can back up to millions of entries that need to be flushed that 
could cause all kinds of problems/delays. Especially if it takes about 
10+ seconds for just a 10 shard count... would that mean it takes 1000+ 
seconds for a 1000 shard count? I haven't tried anything othat than 10 
so far really.
> Aaron
>> It's been a long night for me so far, and still not over, I got meetings
>> to go to soon. Hopefully I havent rabled on too much ;)
>> I have been trying to keep close with the git code so that I can deal with
>> the changes as they come, as well as help where I can, although I still got
>> a lot of code reading left to do on the blur project.
>> --
>> Thanks,
>> Colton McInroy
>>   * Director of Security Engineering
>> Phone
>> (Toll Free)
>> _US_    (888)-818-1344 Press 2
>> _UK_    0-800-635-0551 Press 2
>> My Extension    101
>> 24/7 Support <>
>> Email <>
>> Website

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message