incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Creating/Managing tables saying table already exists
Date Wed, 02 Oct 2013 01:59:02 GMT

I'm joining this very informative thread a little late. Splunk is
pricey.  But why not go for Loggly or something along those lines?
Building a good, scalable, fast logging system with more than just
basic features and with a nice UI takes time, and time is priceless.
If it's of interest, we just released Logsene internally here at
Sematext ( ) and have Kibana running on
top of it.  May be something to consider if you want a logging system

Back to  observer mode...

Solr & ElasticSearch Support --
Performance Monitoring --

On Sat, Sep 28, 2013 at 9:13 AM, Colton McInroy <> wrote:
> Thanks for the info, appreciate it.
> Yes I have looked at splunk... for the amount of data I am dealing with,
> they want $500,000 outright, with a $200,000 yearly subscription. As we
> scale up, the price goes up as well, their pricing scheme is absolutely
> nuts. I figure for less than the outright cost, I could build myself a
> hadoop cluster, a blur custer, and build the software myself that does just
> as good if not better than what splunk does. That's why I am working on this
> project right now. I made something with with lucene itself, but to ensure
> redundancy and scalability, I wanted to go with something distributed.
> Before I started developing my own project from scratch, I took a look at
> what was already out there for merging hadoop and lucene, which is what led
> me to blur. I am hoping that using blur with hadoop will allow me to manage
> the large amount of log data that I have constantly flowing in.
> Right now it's all on one server that has over 100tb of disk space, lots of
> ram and cores, but as I watch the load of the system right now, I realize
> that at some point, the single server just isn't going to cut it. At some
> point the level of data will go above what any single hardware box can do.
> Being the data is log entries, my goal is to be able to store/index all log
> data that comes in real time and make it easily searchable while using facet
> data to obtain metrics while being distributed across a redundant
> infrastructure.
> My plan is to use this information to correlate stuff that occurs across
> large timespans. Like seeing what traffic levels last year during this month
> where compared to the month this year. Or seeing what servers an IP has
> accessed over the past year, etc.
> From what I see so far, I will be building a hadoop cluster to store the
> data, a blur cluster to process the data, and then making a parser which
> takes in data with various formats to takes the data and passes it off to
> blur. Then I will have a client which handles the search queries against
> it... which actually brings up another question... If I parse data one way,
> but then craft a new parser, how well does blur handle changing records?...
> Like say I do not have a parser that handles a particular log entry. So that
> line ends up being logged as just a message field with the contents of the
> data stored in the message field. But then later I figure out a way to parse
> that line into custom fields. Does the mutation system work well for then
> when manipulating a lot of records... like say going over a month, or even a
> years worth of entries matching a certain query?
> Thanks,
> Colton McInroy
>  * Director of Security Engineering
> Phone
> (Toll Free)
> _US_    (888)-818-1344 Press 2
> _UK_    0-800-635-0551 Press 2
> My Extension    101
> 24/7 Support <>
> Email <>
> Website
> On 9/28/2013 5:29 AM, Garrett Barton wrote:
>> The going back over a certain time was just a suggestion based on a guess
>> of your query needs.  Personally I would go for one large index to begin
>> with, if performance ever became a problem and I could not add a few more
>> nodes to the cluster, I would then consider splitting the index.
>>   In the next month or so I will be prototyping 2 very interesting
>> indexes.
>>   One will be a massive, massive full text index that I plan on using bulk
>> MR running every hour into.  The other has to be able to load several
>> TB/hour and I will be trying that on a much shorter MR schedule, say every
>> 5-10 minutes. I expect both to work fine.
>>    I don't think its that far of a stretch for you to go to the minute
>> level
>> like you have today with the MR approach, or hell try the thrift api, with
>> enough nodes I bet it would handle that kinda load as well.
>> Just slightly off topic, have you looked at splunk?  Does what your trying
>> to do out of the box.
>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy
>> <>wrote:
>>> I actually didn't have any kind of loading interval, I loaded the new
>>> event log entries into the index in real time. My code runs as a daemon
>>> accepting syslog entries which indexes them live as they come in with a
>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>> And I don't want to have any limitation on lookback time. I want to be
>>> able to look at the history of any site going back years if need be.
>>> Sucks there is no multi table reader, that limits what I can do by a bit.
>>> Thanks,
>>> Colton McInroy
>>>   * Director of Security Engineering
>>> Phone
>>> (Toll Free)
>>> _US_    (888)-818-1344 Press 2
>>> _UK_    0-800-635-0551 Press 2
>>> My Extension    101
>>> 24/7 Support <>
>>> Email <>
>>> Website
>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>> Mapreduce is a bulk entrypoint to loading blur. Much in the same way I
>>>> bet
>>>> you have some fancy code to grab up a bunch of log files, over some kind
>>>> of
>>>> interval and load them into your index,  MR replaces that process with
>>>> an
>>>> auto scaling (via hardware additions only) high bandwidth load that you
>>>> could fire off at any interval you want. The MR bulk load writes a new
>>>> index and merges that into the index already running when it completes.
>>>> The
>>>> catch is that it is NOT as efficient as your implementation is in terms
>>>> of
>>>> latency into the index. So your current impl that will load a small
>>>> sites
>>>> couple of mb real fast, MR might take 30 seconds to a minute to bring
>>>> that
>>>> online. Having said that blur has a realtime api for inserting that has
>>>> low
>>>> latency but you trade in your high bandwidth for it. Might be something
>>>> you
>>>> could detect on your front door and decide which way in the data comes.
>>>> When I was in your shoes, highly optimizing your indexes based on size
>>>> and
>>>> load for a single badass machine and doing manual partitioning tricks to
>>>> keep things snappy was key.  The neat thing about blur is some of that
>>>> you
>>>> don't do anymore.  I would call it an early optimization at this point
>>>> to
>>>> do anything shorter than say a day or whatever your max lookback time
>>>> is.
>>>> (Oh btw you can't search across tables in blur, forgot to mention that.)
>>>> Instead of the lots of tables route I suggest trying one large one and
>>>> seeing where that goes. Utilize blurs cache initializing capabilities
>>>> and
>>>> load in your site and time columns to keep your logical partitioning
>>>> columns in the block cache and thus very fast. I bet you will see good
>>>> performance with this approach. Certainly better than es. Not as fast as
>>>> raw lucene, but there is always a price to pay for distributing and so
>>>> far
>>>> blur has the lowest overhead I've seen.
>>>> Hope that helps some.
>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <> wrote:
>>>>   Coments inline...
>>>>> Thanks,
>>>>> Colton McInroy
>>>>>    * Director of Security Engineering
>>>>> Phone
>>>>> (Toll Free)
>>>>> _US_    (888)-818-1344 Press 2
>>>>> _UK_    0-800-635-0551 Press 2
>>>>> My Extension    101
>>>>> 24/7 Support <>
>>>>> Email <>
>>>>> Website
>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>   I have commented inline below:
>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <
>>>>>>> wrote:
>>>>>>>        I do have a few question if you don't mind... I am still
>>>>>>> trying
>>>>>>> to
>>>>>>> wrap my head around how this works. In my current implementation
>>>>>>> a
>>>>>>> logging system I create new indexes for each hour because I have
>>>>>>> massive
>>>>>>> amount of data coming in. I take in live log data from syslog
>>>>>>> parse/store it in hourly lucene indexes along with a facet index.
>>>>>>> want
>>>>>>> to
>>>>>>> turn this into a distributed redundant system and blur appears
to be
>>>>>>> the
>>>>>>> way to go. I tried elasticsearch but it is just too slow compared
>>>>>>> my
>>>>>>> current implementation. Given I take in gigs of raw log data
an hour,
>>>>>>> I
>>>>>>> need something that is robust and able to keep up with in flow
>>>>>>> data.
>>>>>>>    Due to the current implementation of building up an index
for an
>>>>>>> hour
>>>>>> and
>>>>>> then making available.  I would use MapReduce for this:
>>>>>> html#map-reduce<http://****
>>>>>> docs/0.2.0/using-blur.html#**map-reduce<>
>>>>>> That way all the shards in a table get a little more data each hour
>>>>>> and
>>>>>> it's very low impact on the running cluster.
>>>>>>   Not sure I understand this. I would like data to be accessible
>>>>>> as
>>>>> it
>>>>> comes in, not wait an hour before I can query against it.
>>>>> I am also not sure where map-reduce comes in here. I thought mapreduce
>>>>> is
>>>>> something that blur used internally.
>>>>>         When taking in lots of data constantly, how is it recommended
>>>>>> that it
>>>>>>> be stored? I mentioned above that I create a new index for each
>>>>>>> to
>>>>>>> keep data separated and quicker to search. If I want to look
up a
>>>>>>> specific
>>>>>>> time frame, I only have to load the directories timestamped with
>>>>>>> hours
>>>>>>> I want to look at. So instead of having to look at a huge index
>>>>>>> like a
>>>>>>> years worth of data, i'm looking at a much smaller data set which
>>>>>>> results
>>>>>>> in faster query response times. Should a new table be created
>>>>>>> each
>>>>>>> hour
>>>>>>> of data? When I typed in the create command into the shell, it
>>>>>>> about
>>>>>>> 6 seconds to create a table. If I have to create a table for
>>>>>>> application each hour, this could create a lot of lag. Perhaps
>>>>>>> is
>>>>>>> just
>>>>>>> in my test environment though. Any thoughts on this? I also didn't
>>>>>>> see
>>>>>>> any
>>>>>>> examples of how to create tables via code.
>>>>>>>    First off Blur is designed to store very large amounts of
>>>>>>> And
>>>>>> while
>>>>>> it can do NRT updates like Solr and ES it's main focus in on bulk
>>>>>> ingestion
>>>>>> through MapReduce.  Given that, the real limiting factor is how much
>>>>>> hardware you have.  Let's play out a scenario.  If you are adding
>>>>>> of
>>>>>> data an hour and I would think that a good rough ballpark guess is
>>>>>> that
>>>>>> you
>>>>>> will need 10-15% of inbound data size as memory to make the search
>>>>>> perform
>>>>>> well.  However as the index sizes increase this % may decrease over
>>>>>> time.
>>>>>>     Blur has an off-heap lru cache to make accessing hdfs faster,
>>>>>> however if
>>>>>> you don't have enough memory the searches (and the cluster for that
>>>>>> matter)
>>>>>> won't fail, they will simply become slower.
>>>>>> So it's really a question of how much hardware you have.  If you
>>>>>> filling a table enough to where it does perform well given the cluster
>>>>>> you
>>>>>> have.  You might have to break it into pieces.  But I think that
>>>>>> hourly
>>>>>> is
>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>   In my current system (which uses just lucene) I designed we take
>>>>> mainly
>>>>> web logs and separate them into indexes. Each web server gets it's own
>>>>> index for each hour. Then when I need to query the data, I use a multi
>>>>> index reader to access the timeframe I need allowing me to keep the
>>>>> size
>>>>> of
>>>>> index down to roughly what I need to search. If data was stored over
>>>>> month, and I want to query data that happened in just a single hour,
>>>>> a
>>>>> few minutes, it makes sense to me to keep things optimized. Also, if
>>>>> wanted to compare one web server to another, I would just use the multi
>>>>> index reader to load both indexes. This is all handled by a single
>>>>> server
>>>>> though, so it is limited by the hardware of the single server. If
>>>>> something
>>>>> fails, it's a big problem. When trying to query large data sets, it's
>>>>> again, only a single server, so it takes longer than I would like if
>>>>> the
>>>>> index it's reading is large.
>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>> imagining
>>>>> that each "table" is an index. So I would have a table format like...
>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>>>>> tables... like a milti table reader or something? or am I limited to
>>>>> looking at a single table at a time?
>>>>> For some web servers that have little traffic, an hour of data may only
>>>>> have a few mb of data in it while other may have like a 5-10gb index.
>>>>> If
>>>>> I
>>>>> combined the index from a large site with the small sites, this should
>>>>> make
>>>>> everything slower for the queries against the small sites index
>>>>> correct?
>>>>> Or
>>>>> would it all be the same due to how blur separates indexes into shards?
>>>>> Would it perhaps be better to have an index for each web server, and
>>>>> configure small sites to have less shards while larger sites have more
>>>>> shards?
>>>>> We just got a new really large powerful server to be our log server,
>>>>> but
>>>>> as I realize that it's a single point of failure, I want to change our
>>>>> configuration to use a clustered/distributed configuration. So we would
>>>>> start with probably a minimal configuration, and start adding more
>>>>> shard
>>>>> servers when ever we can afford it or need it.
>>>>>         Do shards contain the index data while the location (hdfs)
>>>>>> contains
>>>>>>> the documents (what lucene referred to them as)? I read that
>>>>>>> shard
>>>>>>> contains the index while the fs contains the data... I just wasn't
>>>>>>> quiet
>>>>>>> sure what the data was, because when I work with lucene, the
>>>>>>> directory contains the data as a document.
>>>>>>>   The shard is stored in HDFS, and it is a Lucene index.  We
>>>>>>> the
>>>>>> data
>>>>>> inside the Lucene index, so it's basically Lucene all the way down
>>>>>> HDFS.
>>>>>>   Ok, so basically a controller is a service which connects to all
>>>>> some?) shards a distributed query, which tells the shard to run a query
>>>>> against a certain data set, that shard then gets that data set either
>>>>> from
>>>>> memory or from the hadoop cluster, processes it, and returns the result
>>>>> to
>>>>> the controller which condenses the results from all the queried shards
>>>>> into
>>>>> a final result right?
>>>>>   Hope this helps.  Let us know if you have more questions.
>>>>>> Thanks,
>>>>>> Aaron
>>>>>>   Thanks,
>>>>>>> Colton McInroy
>>>>>>>     * Director of Security Engineering
>>>>>>> Phone
>>>>>>> (Toll Free)
>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>> My Extension    101
>>>>>>> 24/7 Support <>
>>>>>>> Email <>
>>>>>>> Website

View raw message