incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Garrett Barton <garrett.bar...@gmail.com>
Subject Re: Possible Limitations?
Date Fri, 18 Oct 2013 20:37:16 GMT
I don't know all the answers but I can answer a few of them:

*If I am going to be putting years worth of access logs for all the sites I
am logging, it's possible that I could have billions/trillions+ of rows in
a table. If I do have that many of rows, will the queries take an extremely
long amount of time?
*
I run 10's of billions of rows with an average of ~12 records per row on
around ~128 shards on a 16 node (as in 16 shard servers) cluster at the
moment.  Performance is sub second to a few seconds worst case.

*Is this based off of my shard count (shart count per table or number of
shard servers)?
*
Its a ratio of the number of shards and their size being served per
shardServer.  Its different for everyone's workflow unfortunately, but you
can empirically load up a small cluster with data until the queries start
performing under what your desiring.  In the end the shard server will use
a thread to hit every shard in a given table to answer a single query, so
if you have set it to end up at say ~100 shards per server, then there is
100 threads off querying.  Somewhere there is a balance between too broken
up and threads are blocking each other (unless you've got 100 core
machines? :D ) and too much data per shard for a thread to reasonably
consume. Also note that the number of threads available to the shardServer
for doing the concurrent shard querying is configurable and its going to be
really low out of the box for what your doing.

*If I know a table is going to be very large, should I set a really larger
shard count, or does that have any bearing after a certain number?
*
Probably, unless you plan on rebuilding it once in a while to grow it then
you will have to start with an artificially large shard count.  Target
where you want to be based on your above testing and known hardware growth
say in a year and go with that. If you start the table with 100 shards and
today you have 10 servers with 8 cores a piece, you get ~10
shards/shardServer. That means today you'll be able to throw 8 full out
concurrent threads on each shardServer (SS, getting tired of typing that)
with 2 doing double duty. Next week when you add 10 more servers you'll
have an underutilized cluster since you will be only using 100 threads
concurrently in your 160 thread capable cluster.  So if you started the
table out with 160 shards, your performance will grow better with the
cluster.

*It was my understanding that lucene only supports about 274 billion unique
terms per index, from my experience so far, each shard (directory not shard
server... kinda confusing between the two when trying to explain... any
plans on changing the name of the shard server to something else?) appears
to be an index, so if that is correct, that would be 274*shard count which
would be about 2.7 trillion for 10 shards. Is that correct?
*
I did not know of that limitation, assuming its true then yes that's
correct for 10 shards.  Now as far as your record/row count I think thats
really 2.7T/avgRowTermWidth or 2.7T/avgRecordTermWidth for your actual
numbers of calculating a full index.


As far as large shard counts, I will be building some tables with several
1000's of shards in the coming weeks. I'm pretty sure Aaron is into the low
2000's already as well.


I hope that tides you over a little till Arron returns.

~Garrett


On Fri, Oct 18, 2013 at 3:25 PM, Colton McInroy <colton@dosarrest.com>wrote:

> Hey Guys,
>
>     I am just wondering what kind of limitations/draw backs I may run into.
>
>     Currently I am adding a row (UUID) with two records (1 and 2) for
> every syslog event. I pool together the mutates and so far it appears to be
> working, although there may be further optimizations I can do. My original
> plan was to make a table with the application and date stamp then query
> across the tables, but i was informed by Aaron that it is currently not
> possible, although that's something that is being looked into being added
> (any idea how long Aaron?). Instead as per Aarons suggestion, I made a
> table with just the application name with the afore mentioned row/records.
> My inquery is about what kind of things I can possibly expect or do to
> avoid future problems... Such as...
> Is there a limit on the number of rows in a table?
> If I am going to be putting years worth of access logs for all the sites I
> am logging, it's possible that I could have billions/trillions+ of rows in
> a table. If I do have that many of rows, will the queries take an extremely
> long amount of time?
> Is this based off of my shard count (shart count per table or number of
> shard servers)?
> If I know a table is going to be very large, should I set a really larger
> shard count, or does that have any bearing after a certain number?
> It was my understanding that lucene only supports about 274 billion unique
> terms per index, from my experience so far, each shard (directory not shard
> server... kinda confusing between the two when trying to explain... any
> plans on changing the name of the shard server to something else?) appears
> to be an index, so if that is correct, that would be 274*shard count which
> would be about 2.7 trillion for 10 shards. Is that correct?
>
> If querying across tables is going to be coming sometime soon, should I
> wait and separate the tables with date stamps to keep the indexes smaller,
> or will using a single table with lots of shards mean I can plow through
> the data without a problem?
> If my understanding is correct, and the shard count does separate and
> index into the number of shard counts, then where I would do previous 1
> index for each year/month/day/hour setting the shard count to 365*24 would
> be the same number of indexes I currently create on the fly for one year
> (actually, I do 365*24*SITECOUNT). If I set the number of shards to 1000
> would that be a optimal number for well... 1000*274billion = 274quadrillion
> entries.
> Again, if this is all correct, for that scenario, I can see the
> following...
> 274,000,000,000,000 / 365 / 24 / 60 / 60 = 8,712,353 rows per second for
> one year until full
> Unless it is per record, which I insert 2 for each row, then it would
> be....
> 274,000,000,000,000 / 365 / 24 / 60 / 60 / 2 = 4,356,176 records per
> second for one year until full
>
> 4 million events per second is above what I am dealing with right now. But
> this math should help people be able to figure out what settings for what
> kind of scale... again, if this is all correct, I'll need someone like
> Aaron to chime in. I end up with this calculation...
>
> int shardCount = 10;
> int recordsPerRow = 2;
> int recordsPerSecond = 10000;
> int luceneTermIndexInterval = 128; // Default
> int yearsUntilFull = (Integer.MAX_VALUE * luceneTermIndexInterval *
> shardCount) / 365 / 24 / 60 / 60 / recordsPerRow / recordsPerSecond; //
> 4.36 years until full
>
> I just kinda threw this out there, may not work in java but I am pretty
> sure my math is right (haven't been drinking... too much), I just did it
> out of my head quickly to demonstrate. In that example, it's 4.36 years for
> 10 shards with 10,000 sustained entries per second. Add another 0 to the
> number of shards then it's 100,000 sustained entries per second for 4.36
> years until full.
>
> If I am correct on all of this, what kind of effect does having a large
> shard count (1000+)? I notice that tables when creating them seem to take
> for ever in comparison to the file structure made. I've seen the create
> process take 10+ second, but in the background I look, and see the file
> structure created, check the disk space used while the create command is
> running, and there is no change. I've tried both in hdfs and localfs. I am
> doing all of this in a virtual set of sessions right now, so it may not be
> the best testing platform. I'm in the process of requisitioning the
> hardware though to start building the cluster at our data center. Before I
> start putting all of the money into it, I want to see what kind of
> limitations I may be up against if any.
>
> It's been a long night for me so far, and still not over, I got meetings
> to go to soon. Hopefully I havent rabled on too much ;)
>
> I have been trying to keep close with the git code so that I can deal with
> the changes as they come, as well as help where I can, although I still got
> a lot of code reading left to do on the blur project.
>
> --
> Thanks,
> Colton McInroy
>
>  * Director of Security Engineering
>
>
> Phone
> (Toll Free)
> _US_    (888)-818-1344 Press 2
> _UK_    0-800-635-0551 Press 2
>
> My Extension    101
> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
> Website         http://www.dosarrest.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message