incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <>
Subject Re: Creating/Managing tables saying table already exists
Date Wed, 02 Oct 2013 21:52:59 GMT
I have tried all of those systems, and with the amount of data we deal 
with, they don't stand up very well to what we need done. I have already 
designed 2 log systems, and this will be hopefully my third and final 
implementation. This time being a distributed version of my previous work.

It does take time to develop things, but for the amount of time I would 
spend developing with this, it's well worth it. I have tried other free 
systems out there like ElasticSearch and the likes, but they all fall 
short. With my software running on a dual intel xeon W5580@3.2Ghz with 
32hdd's in hardware raid (5 or 6, I forget which) can process about 
10,000-20,000 entries per second while putting the server load at about 
2.00-5.00 over 5 minutes and about 50-70% disk utilization. When using 
other systems like ElasticSearch, the server hits a load average of over 
50.00 and 100% disk utilization which makes most of those solutions 
inadequate for what I am trying to do.

I also need something that I can adjust the log formats on the fly, 
adding new ones, detecting entries that have no patterns, etc. Splunk is 
the only thing I have seen which can properly handle the amount of log 
data I am dealing with in terms of both indexing and searching, but it 
is just extremely expensive. I have no idea how they can get away with 
such a ridiculas pricing scheme. I have talked to their developers at 
conferences and they agreed the prices are just crazy. For the price of 
just one years subscription fee, I can and will pay myself to build 
something that will do roughly the same job as well as buy the hardware 
to start the clusters.

Colton McInroy

  * Director of Security Engineering

(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support <>
Email <>

On 10/1/2013 6:59 PM, Otis Gospodnetic wrote:
> Hi,
> I'm joining this very informative thread a little late. Splunk is
> pricey.  But why not go for Loggly or something along those lines?
> Building a good, scalable, fast logging system with more than just
> basic features and with a nice UI takes time, and time is priceless.
> If it's of interest, we just released Logsene internally here at
> Sematext ( ) and have Kibana running on
> top of it.  May be something to consider if you want a logging system
> todayish.
> Back to  observer mode...
> Otis
> --
> Solr & ElasticSearch Support --
> Performance Monitoring --
> On Sat, Sep 28, 2013 at 9:13 AM, Colton McInroy <> wrote:
>> Thanks for the info, appreciate it.
>> Yes I have looked at splunk... for the amount of data I am dealing with,
>> they want $500,000 outright, with a $200,000 yearly subscription. As we
>> scale up, the price goes up as well, their pricing scheme is absolutely
>> nuts. I figure for less than the outright cost, I could build myself a
>> hadoop cluster, a blur custer, and build the software myself that does just
>> as good if not better than what splunk does. That's why I am working on this
>> project right now. I made something with with lucene itself, but to ensure
>> redundancy and scalability, I wanted to go with something distributed.
>> Before I started developing my own project from scratch, I took a look at
>> what was already out there for merging hadoop and lucene, which is what led
>> me to blur. I am hoping that using blur with hadoop will allow me to manage
>> the large amount of log data that I have constantly flowing in.
>> Right now it's all on one server that has over 100tb of disk space, lots of
>> ram and cores, but as I watch the load of the system right now, I realize
>> that at some point, the single server just isn't going to cut it. At some
>> point the level of data will go above what any single hardware box can do.
>> Being the data is log entries, my goal is to be able to store/index all log
>> data that comes in real time and make it easily searchable while using facet
>> data to obtain metrics while being distributed across a redundant
>> infrastructure.
>> My plan is to use this information to correlate stuff that occurs across
>> large timespans. Like seeing what traffic levels last year during this month
>> where compared to the month this year. Or seeing what servers an IP has
>> accessed over the past year, etc.
>>  From what I see so far, I will be building a hadoop cluster to store the
>> data, a blur cluster to process the data, and then making a parser which
>> takes in data with various formats to takes the data and passes it off to
>> blur. Then I will have a client which handles the search queries against
>> it... which actually brings up another question... If I parse data one way,
>> but then craft a new parser, how well does blur handle changing records?...
>> Like say I do not have a parser that handles a particular log entry. So that
>> line ends up being logged as just a message field with the contents of the
>> data stored in the message field. But then later I figure out a way to parse
>> that line into custom fields. Does the mutation system work well for then
>> when manipulating a lot of records... like say going over a month, or even a
>> years worth of entries matching a certain query?
>> Thanks,
>> Colton McInroy
>>   * Director of Security Engineering
>> Phone
>> (Toll Free)
>> _US_    (888)-818-1344 Press 2
>> _UK_    0-800-635-0551 Press 2
>> My Extension    101
>> 24/7 Support <>
>> Email <>
>> Website
>> On 9/28/2013 5:29 AM, Garrett Barton wrote:
>>> The going back over a certain time was just a suggestion based on a guess
>>> of your query needs.  Personally I would go for one large index to begin
>>> with, if performance ever became a problem and I could not add a few more
>>> nodes to the cluster, I would then consider splitting the index.
>>>    In the next month or so I will be prototyping 2 very interesting
>>> indexes.
>>>    One will be a massive, massive full text index that I plan on using bulk
>>> MR running every hour into.  The other has to be able to load several
>>> TB/hour and I will be trying that on a much shorter MR schedule, say every
>>> 5-10 minutes. I expect both to work fine.
>>>     I don't think its that far of a stretch for you to go to the minute
>>> level
>>> like you have today with the MR approach, or hell try the thrift api, with
>>> enough nodes I bet it would handle that kinda load as well.
>>> Just slightly off topic, have you looked at splunk?  Does what your trying
>>> to do out of the box.
>>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy
>>> <>wrote:
>>>> I actually didn't have any kind of loading interval, I loaded the new
>>>> event log entries into the index in real time. My code runs as a daemon
>>>> accepting syslog entries which indexes them live as they come in with a
>>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>>> And I don't want to have any limitation on lookback time. I want to be
>>>> able to look at the history of any site going back years if need be.
>>>> Sucks there is no multi table reader, that limits what I can do by a bit.
>>>> Thanks,
>>>> Colton McInroy
>>>>    * Director of Security Engineering
>>>> Phone
>>>> (Toll Free)
>>>> _US_    (888)-818-1344 Press 2
>>>> _UK_    0-800-635-0551 Press 2
>>>> My Extension    101
>>>> 24/7 Support <>
>>>> Email <>
>>>> Website
>>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>>> Mapreduce is a bulk entrypoint to loading blur. Much in the same way
>>>>> bet
>>>>> you have some fancy code to grab up a bunch of log files, over some kind
>>>>> of
>>>>> interval and load them into your index,  MR replaces that process with
>>>>> an
>>>>> auto scaling (via hardware additions only) high bandwidth load that you
>>>>> could fire off at any interval you want. The MR bulk load writes a new
>>>>> index and merges that into the index already running when it completes.
>>>>> The
>>>>> catch is that it is NOT as efficient as your implementation is in terms
>>>>> of
>>>>> latency into the index. So your current impl that will load a small
>>>>> sites
>>>>> couple of mb real fast, MR might take 30 seconds to a minute to bring
>>>>> that
>>>>> online. Having said that blur has a realtime api for inserting that has
>>>>> low
>>>>> latency but you trade in your high bandwidth for it. Might be something
>>>>> you
>>>>> could detect on your front door and decide which way in the data comes.
>>>>> When I was in your shoes, highly optimizing your indexes based on size
>>>>> and
>>>>> load for a single badass machine and doing manual partitioning tricks
>>>>> keep things snappy was key.  The neat thing about blur is some of that
>>>>> you
>>>>> don't do anymore.  I would call it an early optimization at this point
>>>>> to
>>>>> do anything shorter than say a day or whatever your max lookback time
>>>>> is.
>>>>> (Oh btw you can't search across tables in blur, forgot to mention that.)
>>>>> Instead of the lots of tables route I suggest trying one large one and
>>>>> seeing where that goes. Utilize blurs cache initializing capabilities
>>>>> and
>>>>> load in your site and time columns to keep your logical partitioning
>>>>> columns in the block cache and thus very fast. I bet you will see good
>>>>> performance with this approach. Certainly better than es. Not as fast
>>>>> raw lucene, but there is always a price to pay for distributing and so
>>>>> far
>>>>> blur has the lowest overhead I've seen.
>>>>> Hope that helps some.
>>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <>
>>>>>    Coments inline...
>>>>>> Thanks,
>>>>>> Colton McInroy
>>>>>>     * Director of Security Engineering
>>>>>> Phone
>>>>>> (Toll Free)
>>>>>> _US_    (888)-818-1344 Press 2
>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>> My Extension    101
>>>>>> 24/7 Support <>
>>>>>> Email <>
>>>>>> Website
>>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>>    I have commented inline below:
>>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <
>>>>>>>> wrote:
>>>>>>>>         I do have a few question if you don't mind... I am
>>>>>>>> trying
>>>>>>>> to
>>>>>>>> wrap my head around how this works. In my current implementation
>>>>>>>> a
>>>>>>>> logging system I create new indexes for each hour because
I have a
>>>>>>>> massive
>>>>>>>> amount of data coming in. I take in live log data from syslog
>>>>>>>> parse/store it in hourly lucene indexes along with a facet
index. I
>>>>>>>> want
>>>>>>>> to
>>>>>>>> turn this into a distributed redundant system and blur appears
to be
>>>>>>>> the
>>>>>>>> way to go. I tried elasticsearch but it is just too slow
compared to
>>>>>>>> my
>>>>>>>> current implementation. Given I take in gigs of raw log data
an hour,
>>>>>>>> I
>>>>>>>> need something that is robust and able to keep up with in
flow of
>>>>>>>> data.
>>>>>>>>     Due to the current implementation of building up an index
for an
>>>>>>>> hour
>>>>>>> and
>>>>>>> then making available.  I would use MapReduce for this:
>>>>>>> html#map-reduce<http://****
>>>>>>> docs/0.2.0/using-blur.html#**map-reduce<>
>>>>>>> That way all the shards in a table get a little more data each
>>>>>>> and
>>>>>>> it's very low impact on the running cluster.
>>>>>>>    Not sure I understand this. I would like data to be accessible
>>>>>>> as
>>>>>> it
>>>>>> comes in, not wait an hour before I can query against it.
>>>>>> I am also not sure where map-reduce comes in here. I thought mapreduce
>>>>>> is
>>>>>> something that blur used internally.
>>>>>>          When taking in lots of data constantly, how is it recommended
>>>>>>> that it
>>>>>>>> be stored? I mentioned above that I create a new index for
each hour
>>>>>>>> to
>>>>>>>> keep data separated and quicker to search. If I want to look
up a
>>>>>>>> specific
>>>>>>>> time frame, I only have to load the directories timestamped
with the
>>>>>>>> hours
>>>>>>>> I want to look at. So instead of having to look at a huge
index of
>>>>>>>> like a
>>>>>>>> years worth of data, i'm looking at a much smaller data set
>>>>>>>> results
>>>>>>>> in faster query response times. Should a new table be created
>>>>>>>> each
>>>>>>>> hour
>>>>>>>> of data? When I typed in the create command into the shell,
it takes
>>>>>>>> about
>>>>>>>> 6 seconds to create a table. If I have to create a table
for each
>>>>>>>> application each hour, this could create a lot of lag. Perhaps
>>>>>>>> is
>>>>>>>> just
>>>>>>>> in my test environment though. Any thoughts on this? I also
>>>>>>>> see
>>>>>>>> any
>>>>>>>> examples of how to create tables via code.
>>>>>>>>     First off Blur is designed to store very large amounts
of data.
>>>>>>>> And
>>>>>>> while
>>>>>>> it can do NRT updates like Solr and ES it's main focus in on
>>>>>>> ingestion
>>>>>>> through MapReduce.  Given that, the real limiting factor is how
>>>>>>> hardware you have.  Let's play out a scenario.  If you are adding
>>>>>>> of
>>>>>>> data an hour and I would think that a good rough ballpark guess
>>>>>>> that
>>>>>>> you
>>>>>>> will need 10-15% of inbound data size as memory to make the search
>>>>>>> perform
>>>>>>> well.  However as the index sizes increase this % may decrease
>>>>>>> time.
>>>>>>>      Blur has an off-heap lru cache to make accessing hdfs faster,
>>>>>>> however if
>>>>>>> you don't have enough memory the searches (and the cluster for
>>>>>>> matter)
>>>>>>> won't fail, they will simply become slower.
>>>>>>> So it's really a question of how much hardware you have.  If
you have
>>>>>>> filling a table enough to where it does perform well given the
>>>>>>> you
>>>>>>> have.  You might have to break it into pieces.  But I think that
>>>>>>> hourly
>>>>>>> is
>>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>>    In my current system (which uses just lucene) I designed we
take in
>>>>>> mainly
>>>>>> web logs and separate them into indexes. Each web server gets it's
>>>>>> index for each hour. Then when I need to query the data, I use a
>>>>>> index reader to access the timeframe I need allowing me to keep the
>>>>>> size
>>>>>> of
>>>>>> index down to roughly what I need to search. If data was stored over
>>>>>> month, and I want to query data that happened in just a single hour,
>>>>>> a
>>>>>> few minutes, it makes sense to me to keep things optimized. Also,
if I
>>>>>> wanted to compare one web server to another, I would just use the
>>>>>> index reader to load both indexes. This is all handled by a single
>>>>>> server
>>>>>> though, so it is limited by the hardware of the single server. If
>>>>>> something
>>>>>> fails, it's a big problem. When trying to query large data sets,
>>>>>> again, only a single server, so it takes longer than I would like
>>>>>> the
>>>>>> index it's reading is large.
>>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>>> imagining
>>>>>> that each "table" is an index. So I would have a table format like...
>>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>>>>>> tables... like a milti table reader or something? or am I limited
>>>>>> looking at a single table at a time?
>>>>>> For some web servers that have little traffic, an hour of data may
>>>>>> have a few mb of data in it while other may have like a 5-10gb index.
>>>>>> If
>>>>>> I
>>>>>> combined the index from a large site with the small sites, this should
>>>>>> make
>>>>>> everything slower for the queries against the small sites index
>>>>>> correct?
>>>>>> Or
>>>>>> would it all be the same due to how blur separates indexes into shards?
>>>>>> Would it perhaps be better to have an index for each web server,
>>>>>> configure small sites to have less shards while larger sites have
>>>>>> shards?
>>>>>> We just got a new really large powerful server to be our log server,
>>>>>> but
>>>>>> as I realize that it's a single point of failure, I want to change
>>>>>> configuration to use a clustered/distributed configuration. So we
>>>>>> start with probably a minimal configuration, and start adding more
>>>>>> shard
>>>>>> servers when ever we can afford it or need it.
>>>>>>          Do shards contain the index data while the location (hdfs)
>>>>>>> contains
>>>>>>>> the documents (what lucene referred to them as)? I read that
>>>>>>>> shard
>>>>>>>> contains the index while the fs contains the data... I just
>>>>>>>> quiet
>>>>>>>> sure what the data was, because when I work with lucene,
the index
>>>>>>>> directory contains the data as a document.
>>>>>>>>    The shard is stored in HDFS, and it is a Lucene index.
 We store
>>>>>>>> the
>>>>>>> data
>>>>>>> inside the Lucene index, so it's basically Lucene all the way
down to
>>>>>>> HDFS.
>>>>>>>    Ok, so basically a controller is a service which connects
to all (or
>>>>>> some?) shards a distributed query, which tells the shard to run a
>>>>>> against a certain data set, that shard then gets that data set either
>>>>>> from
>>>>>> memory or from the hadoop cluster, processes it, and returns the
>>>>>> to
>>>>>> the controller which condenses the results from all the queried shards
>>>>>> into
>>>>>> a final result right?
>>>>>>    Hope this helps.  Let us know if you have more questions.
>>>>>>> Thanks,
>>>>>>> Aaron
>>>>>>>    Thanks,
>>>>>>>> Colton McInroy
>>>>>>>>      * Director of Security Engineering
>>>>>>>> Phone
>>>>>>>> (Toll Free)
>>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>> My Extension    101
>>>>>>>> 24/7 Support <>
>>>>>>>> Email <>
>>>>>>>> Website

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message