incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <>
Subject Possible Limitations?
Date Fri, 18 Oct 2013 19:25:37 GMT
Hey Guys,

     I am just wondering what kind of limitations/draw backs I may run into.

     Currently I am adding a row (UUID) with two records (1 and 2) for 
every syslog event. I pool together the mutates and so far it appears to 
be working, although there may be further optimizations I can do. My 
original plan was to make a table with the application and date stamp 
then query across the tables, but i was informed by Aaron that it is 
currently not possible, although that's something that is being looked 
into being added (any idea how long Aaron?). Instead as per Aarons 
suggestion, I made a table with just the application name with the afore 
mentioned row/records. My inquery is about what kind of things I can 
possibly expect or do to avoid future problems... Such as...
Is there a limit on the number of rows in a table?
If I am going to be putting years worth of access logs for all the sites 
I am logging, it's possible that I could have billions/trillions+ of 
rows in a table. If I do have that many of rows, will the queries take 
an extremely long amount of time?
Is this based off of my shard count (shart count per table or number of 
shard servers)?
If I know a table is going to be very large, should I set a really 
larger shard count, or does that have any bearing after a certain number?
It was my understanding that lucene only supports about 274 billion 
unique terms per index, from my experience so far, each shard (directory 
not shard server... kinda confusing between the two when trying to 
explain... any plans on changing the name of the shard server to 
something else?) appears to be an index, so if that is correct, that 
would be 274*shard count which would be about 2.7 trillion for 10 
shards. Is that correct?

If querying across tables is going to be coming sometime soon, should I 
wait and separate the tables with date stamps to keep the indexes 
smaller, or will using a single table with lots of shards mean I can 
plow through the data without a problem?
If my understanding is correct, and the shard count does separate and 
index into the number of shard counts, then where I would do previous 1 
index for each year/month/day/hour setting the shard count to 365*24 
would be the same number of indexes I currently create on the fly for 
one year (actually, I do 365*24*SITECOUNT). If I set the number of 
shards to 1000 would that be a optimal number for well... 
1000*274billion = 274quadrillion entries.
Again, if this is all correct, for that scenario, I can see the following...
274,000,000,000,000 / 365 / 24 / 60 / 60 = 8,712,353 rows per second for 
one year until full
Unless it is per record, which I insert 2 for each row, then it would be....
274,000,000,000,000 / 365 / 24 / 60 / 60 / 2 = 4,356,176 records per 
second for one year until full

4 million events per second is above what I am dealing with right now. 
But this math should help people be able to figure out what settings for 
what kind of scale... again, if this is all correct, I'll need someone 
like Aaron to chime in. I end up with this calculation...

int shardCount = 10;
int recordsPerRow = 2;
int recordsPerSecond = 10000;
int luceneTermIndexInterval = 128; // Default
int yearsUntilFull = (Integer.MAX_VALUE * luceneTermIndexInterval * 
shardCount) / 365 / 24 / 60 / 60 / recordsPerRow / recordsPerSecond; // 
4.36 years until full

I just kinda threw this out there, may not work in java but I am pretty 
sure my math is right (haven't been drinking... too much), I just did it 
out of my head quickly to demonstrate. In that example, it's 4.36 years 
for 10 shards with 10,000 sustained entries per second. Add another 0 to 
the number of shards then it's 100,000 sustained entries per second for 
4.36 years until full.

If I am correct on all of this, what kind of effect does having a large 
shard count (1000+)? I notice that tables when creating them seem to 
take for ever in comparison to the file structure made. I've seen the 
create process take 10+ second, but in the background I look, and see 
the file structure created, check the disk space used while the create 
command is running, and there is no change. I've tried both in hdfs and 
localfs. I am doing all of this in a virtual set of sessions right now, 
so it may not be the best testing platform. I'm in the process of 
requisitioning the hardware though to start building the cluster at our 
data center. Before I start putting all of the money into it, I want to 
see what kind of limitations I may be up against if any.

It's been a long night for me so far, and still not over, I got meetings 
to go to soon. Hopefully I havent rabled on too much ;)

I have been trying to keep close with the git code so that I can deal 
with the changes as they come, as well as help where I can, although I 
still got a lot of code reading left to do on the blur project.

Colton McInroy

  * Director of Security Engineering

(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support <>
Email <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message