incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <>
Subject Re: Creating/Managing tables saying table already exists
Date Sat, 28 Sep 2013 09:59:48 GMT
I actually didn't have any kind of loading interval, I loaded the new 
event log entries into the index in real time. My code runs as a daemon 
accepting syslog entries which indexes them live as they come in with a 
flush call every 10000 entries or 1 minute, which ever comes first.

And I don't want to have any limitation on lookback time. I want to be 
able to look at the history of any site going back years if need be.

Sucks there is no multi table reader, that limits what I can do by a bit.

Colton McInroy

  * Director of Security Engineering

(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support <>
Email <>

On 9/28/2013 12:48 AM, Garrett Barton wrote:
> Mapreduce is a bulk entrypoint to loading blur. Much in the same way I bet
> you have some fancy code to grab up a bunch of log files, over some kind of
> interval and load them into your index,  MR replaces that process with an
> auto scaling (via hardware additions only) high bandwidth load that you
> could fire off at any interval you want. The MR bulk load writes a new
> index and merges that into the index already running when it completes. The
> catch is that it is NOT as efficient as your implementation is in terms of
> latency into the index. So your current impl that will load a small sites
> couple of mb real fast, MR might take 30 seconds to a minute to bring that
> online. Having said that blur has a realtime api for inserting that has low
> latency but you trade in your high bandwidth for it. Might be something you
> could detect on your front door and decide which way in the data comes.
> When I was in your shoes, highly optimizing your indexes based on size and
> load for a single badass machine and doing manual partitioning tricks to
> keep things snappy was key.  The neat thing about blur is some of that you
> don't do anymore.  I would call it an early optimization at this point to
> do anything shorter than say a day or whatever your max lookback time is.
> (Oh btw you can't search across tables in blur, forgot to mention that.)
> Instead of the lots of tables route I suggest trying one large one and
> seeing where that goes. Utilize blurs cache initializing capabilities and
> load in your site and time columns to keep your logical partitioning
> columns in the block cache and thus very fast. I bet you will see good
> performance with this approach. Certainly better than es. Not as fast as
> raw lucene, but there is always a price to pay for distributing and so far
> blur has the lowest overhead I've seen.
> Hope that helps some.
> On Sep 27, 2013 11:31 PM, "Colton McInroy" <> wrote:
>> Coments inline...
>> Thanks,
>> Colton McInroy
>>   * Director of Security Engineering
>> Phone
>> (Toll Free)
>> _US_    (888)-818-1344 Press 2
>> _UK_    0-800-635-0551 Press 2
>> My Extension    101
>> 24/7 Support <>
>> Email <>
>> Website
>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>> I have commented inline below:
>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <
>>>> wrote:
>>>>       I do have a few question if you don't mind... I am still trying to
>>>> wrap my head around how this works. In my current implementation for a
>>>> logging system I create new indexes for each hour because I have a
>>>> massive
>>>> amount of data coming in. I take in live log data from syslog and
>>>> parse/store it in hourly lucene indexes along with a facet index. I want
>>>> to
>>>> turn this into a distributed redundant system and blur appears to be the
>>>> way to go. I tried elasticsearch but it is just too slow compared to my
>>>> current implementation. Given I take in gigs of raw log data an hour, I
>>>> need something that is robust and able to keep up with in flow of data.
>>>>   Due to the current implementation of building up an index for an hour
>>> and
>>> then making available.  I would use MapReduce for this:
>>> html#map-reduce<>
>>> That way all the shards in a table get a little more data each hour and
>>> it's very low impact on the running cluster.
>> Not sure I understand this. I would like data to be accessible live as it
>> comes in, not wait an hour before I can query against it.
>> I am also not sure where map-reduce comes in here. I thought mapreduce is
>> something that blur used internally.
>>>       When taking in lots of data constantly, how is it recommended that it
>>>> be stored? I mentioned above that I create a new index for each hour to
>>>> keep data separated and quicker to search. If I want to look up a
>>>> specific
>>>> time frame, I only have to load the directories timestamped with the
>>>> hours
>>>> I want to look at. So instead of having to look at a huge index of like a
>>>> years worth of data, i'm looking at a much smaller data set which results
>>>> in faster query response times. Should a new table be created for each
>>>> hour
>>>> of data? When I typed in the create command into the shell, it takes
>>>> about
>>>> 6 seconds to create a table. If I have to create a table for each
>>>> application each hour, this could create a lot of lag. Perhaps this is
>>>> just
>>>> in my test environment though. Any thoughts on this? I also didn't see
>>>> any
>>>> examples of how to create tables via code.
>>>>   First off Blur is designed to store very large amounts of data.  And
>>> while
>>> it can do NRT updates like Solr and ES it's main focus in on bulk
>>> ingestion
>>> through MapReduce.  Given that, the real limiting factor is how much
>>> hardware you have.  Let's play out a scenario.  If you are adding 10GB of
>>> data an hour and I would think that a good rough ballpark guess is that
>>> you
>>> will need 10-15% of inbound data size as memory to make the search perform
>>> well.  However as the index sizes increase this % may decrease over time.
>>>    Blur has an off-heap lru cache to make accessing hdfs faster, however if
>>> you don't have enough memory the searches (and the cluster for that
>>> matter)
>>> won't fail, they will simply become slower.
>>> So it's really a question of how much hardware you have.  If you have
>>> filling a table enough to where it does perform well given the cluster you
>>> have.  You might have to break it into pieces.  But I think that hourly is
>>> too small.  Daily, Weekly, Monthly, etc.
>> In my current system (which uses just lucene) I designed we take in mainly
>> web logs and separate them into indexes. Each web server gets it's own
>> index for each hour. Then when I need to query the data, I use a multi
>> index reader to access the timeframe I need allowing me to keep the size of
>> index down to roughly what I need to search. If data was stored over a
>> month, and I want to query data that happened in just a single hour, or a
>> few minutes, it makes sense to me to keep things optimized. Also, if I
>> wanted to compare one web server to another, I would just use the multi
>> index reader to load both indexes. This is all handled by a single server
>> though, so it is limited by the hardware of the single server. If something
>> fails, it's a big problem. When trying to query large data sets, it's
>> again, only a single server, so it takes longer than I would like if the
>> index it's reading is large.
>> I am not entirely sure how to go about doing this in blur. I'm imagining
>> that each "table" is an index. So I would have a table format like...
>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>> tables... like a milti table reader or something? or am I limited to
>> looking at a single table at a time?
>> For some web servers that have little traffic, an hour of data may only
>> have a few mb of data in it while other may have like a 5-10gb index. If I
>> combined the index from a large site with the small sites, this should make
>> everything slower for the queries against the small sites index correct? Or
>> would it all be the same due to how blur separates indexes into shards?
>> Would it perhaps be better to have an index for each web server, and
>> configure small sites to have less shards while larger sites have more
>> shards?
>> We just got a new really large powerful server to be our log server, but
>> as I realize that it's a single point of failure, I want to change our
>> configuration to use a clustered/distributed configuration. So we would
>> start with probably a minimal configuration, and start adding more shard
>> servers when ever we can afford it or need it.
>>>       Do shards contain the index data while the location (hdfs) contains
>>>> the documents (what lucene referred to them as)? I read that the shard
>>>> contains the index while the fs contains the data... I just wasn't quiet
>>>> sure what the data was, because when I work with lucene, the index
>>>> directory contains the data as a document.
>>> The shard is stored in HDFS, and it is a Lucene index.  We store the data
>>> inside the Lucene index, so it's basically Lucene all the way down to
>>> HDFS.
>> Ok, so basically a controller is a service which connects to all (or
>> some?) shards a distributed query, which tells the shard to run a query
>> against a certain data set, that shard then gets that data set either from
>> memory or from the hadoop cluster, processes it, and returns the result to
>> the controller which condenses the results from all the queried shards into
>> a final result right?
>>> Hope this helps.  Let us know if you have more questions.
>>> Thanks,
>>> Aaron
>>>> Thanks,
>>>> Colton McInroy
>>>>    * Director of Security Engineering
>>>> Phone
>>>> (Toll Free)
>>>> _US_    (888)-818-1344 Press 2
>>>> _UK_    0-800-635-0551 Press 2
>>>> My Extension    101
>>>> 24/7 Support <>
>>>> Email <>
>>>> Website

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message