incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <>
Subject Re: Creating/Managing tables saying table already exists
Date Sat, 28 Sep 2013 13:47:33 GMT
On Sat, Sep 28, 2013 at 9:23 AM, Colton McInroy <>wrote:

> So, basically your suggesting I use this undocumented Bulk MapReduce
> method to add all of the data live as it comes in? Do you have an example
> or any information on how I would accomplish this? What I could do is have
> a flush period, where as the logs come in and get parsed, I build them up
> to like 10000 entries or a timed interval, then bulk load them into blur.

I will add some documentation on how to use it and probably an example, but
I would try using the async client (maybe start with the regular client)
first to see if it can keep up.  Just as an FYI I found a bug in 0.2.0
mutateBatch that causes a deadlock.  I will resolve later today, but if you
try it out before 0.2.1 is released (a couple of weeks) you will likely
need to patch the code.  Here's the issue:


> Thanks,
> Colton McInroy
>  * Director of Security Engineering
> Phone
> (Toll Free)
> _US_    (888)-818-1344 Press 2
> _UK_    0-800-635-0551 Press 2
> My Extension    101
> 24/7 Support <>
> Email <>
> Website
> On 9/28/2013 6:06 AM, Aaron McCurry wrote:
>> So there is a method that is not documented that the Bulk MapReduce uses
>> that could fill the gaps between MR and NRT updates.  Let's say that there
>> a table with 100 shards.  In a given table on hdfs the path would look
>> like
>> "/blur/tables/table12345/**shard-000010/<the main index goes here>".
>> Now the way MapReduce works is that it creates a sub directory in the main
>> index:
>> "/blur/tables/table12345/**shard-000010/some_index_name.**tmp/<new data
>> here>"
>> Once the index is ready to be committed the writer is closed for the new
>> index and the suddir is renamed to:
>> "/blur/tables/table12345/**shard-000010/some_index_name.**commit/<new
>> data
>> here>"
>> The act of having an index in the shard directory that ends with ".commit"
>> makes the shard pick up the index and do an index merge through the
>> writer.addDirectory(..) call.  It checks for this every 10 seconds.
>> While this is not a really easy to integrate yet.  I think that if I build
>> an Apache Flume integration it will likely make use of this feature, or at
>> least have an option to use it.
>> As far as searching multiple tables, this has been asked for before so I
>> think that is something that we should add.  It actually shouldn't be that
>> difficult.
>> Aaron
>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy <
>> >wrote:
>>  I actually didn't have any kind of loading interval, I loaded the new
>>> event log entries into the index in real time. My code runs as a daemon
>>> accepting syslog entries which indexes them live as they come in with a
>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>> And I don't want to have any limitation on lookback time. I want to be
>>> able to look at the history of any site going back years if need be.
>>> Sucks there is no multi table reader, that limits what I can do by a bit.
>>> Thanks,
>>> Colton McInroy
>>>   * Director of Security Engineering
>>> Phone
>>> (Toll Free)
>>> _US_    (888)-818-1344 Press 2
>>> _UK_    0-800-635-0551 Press 2
>>> My Extension    101
>>> 24/7 Support <>
>>> Email <>
>>> Website
>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>  Mapreduce is a bulk entrypoint to loading blur. Much in the same way I
>>>> bet
>>>> you have some fancy code to grab up a bunch of log files, over some kind
>>>> of
>>>> interval and load them into your index,  MR replaces that process with
>>>> an
>>>> auto scaling (via hardware additions only) high bandwidth load that you
>>>> could fire off at any interval you want. The MR bulk load writes a new
>>>> index and merges that into the index already running when it completes.
>>>> The
>>>> catch is that it is NOT as efficient as your implementation is in terms
>>>> of
>>>> latency into the index. So your current impl that will load a small
>>>> sites
>>>> couple of mb real fast, MR might take 30 seconds to a minute to bring
>>>> that
>>>> online. Having said that blur has a realtime api for inserting that has
>>>> low
>>>> latency but you trade in your high bandwidth for it. Might be something
>>>> you
>>>> could detect on your front door and decide which way in the data comes.
>>>> When I was in your shoes, highly optimizing your indexes based on size
>>>> and
>>>> load for a single badass machine and doing manual partitioning tricks to
>>>> keep things snappy was key.  The neat thing about blur is some of that
>>>> you
>>>> don't do anymore.  I would call it an early optimization at this point
>>>> to
>>>> do anything shorter than say a day or whatever your max lookback time
>>>> is.
>>>> (Oh btw you can't search across tables in blur, forgot to mention that.)
>>>> Instead of the lots of tables route I suggest trying one large one and
>>>> seeing where that goes. Utilize blurs cache initializing capabilities
>>>> and
>>>> load in your site and time columns to keep your logical partitioning
>>>> columns in the block cache and thus very fast. I bet you will see good
>>>> performance with this approach. Certainly better than es. Not as fast as
>>>> raw lucene, but there is always a price to pay for distributing and so
>>>> far
>>>> blur has the lowest overhead I've seen.
>>>> Hope that helps some.
>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <>
>>>> wrote:
>>>>   Coments inline...
>>>>> Thanks,
>>>>> Colton McInroy
>>>>>    * Director of Security Engineering
>>>>> Phone
>>>>> (Toll Free)
>>>>> _US_    (888)-818-1344 Press 2
>>>>> _UK_    0-800-635-0551 Press 2
>>>>> My Extension    101
>>>>> 24/7 Support <>
>>>>> Email <>
>>>>> Website
>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>   I have commented inline below:
>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <
>>>>>>  wrote:
>>>>>>>        I do have a few question if you don't mind... I am still
>>>>>>> trying
>>>>>>> to
>>>>>>> wrap my head around how this works. In my current implementation
>>>>>>> a
>>>>>>> logging system I create new indexes for each hour because I have
>>>>>>> massive
>>>>>>> amount of data coming in. I take in live log data from syslog
>>>>>>> parse/store it in hourly lucene indexes along with a facet index.
>>>>>>> want
>>>>>>> to
>>>>>>> turn this into a distributed redundant system and blur appears
to be
>>>>>>> the
>>>>>>> way to go. I tried elasticsearch but it is just too slow compared
>>>>>>> my
>>>>>>> current implementation. Given I take in gigs of raw log data
>>>>>>> hour, I
>>>>>>> need something that is robust and able to keep up with in flow
>>>>>>> data.
>>>>>>>    Due to the current implementation of building up an index
for an
>>>>>>> hour
>>>>>>>  and
>>>>>> then making available.  I would use MapReduce for this:
>>>>>> <****blur/docs/0.2.0/using-blur.****<**blur/docs/0.2.0/using-blur.**>
>>>>>> >
>>>>>> html#map-reduce<http://**incub****<**>
>>>>>> docs/0.2.0/using-blur.html#****map-reduce<http://incubator.**
>>>>>> >
>>>>>> That way all the shards in a table get a little more data each hour
>>>>>> and
>>>>>> it's very low impact on the running cluster.
>>>>>>   Not sure I understand this. I would like data to be accessible
>>>>>> as
>>>>> it
>>>>> comes in, not wait an hour before I can query against it.
>>>>> I am also not sure where map-reduce comes in here. I thought mapreduce
>>>>> is
>>>>> something that blur used internally.
>>>>>         When taking in lots of data constantly, how is it recommended
>>>>>> that it
>>>>>>  be stored? I mentioned above that I create a new index for each
>>>>>>> to
>>>>>>> keep data separated and quicker to search. If I want to look
up a
>>>>>>> specific
>>>>>>> time frame, I only have to load the directories timestamped with
>>>>>>> hours
>>>>>>> I want to look at. So instead of having to look at a huge index
>>>>>>> like a
>>>>>>> years worth of data, i'm looking at a much smaller data set which
>>>>>>> results
>>>>>>> in faster query response times. Should a new table be created
>>>>>>> each
>>>>>>> hour
>>>>>>> of data? When I typed in the create command into the shell, it
>>>>>>> about
>>>>>>> 6 seconds to create a table. If I have to create a table for
>>>>>>> application each hour, this could create a lot of lag. Perhaps
>>>>>>> is
>>>>>>> just
>>>>>>> in my test environment though. Any thoughts on this? I also didn't
>>>>>>> see
>>>>>>> any
>>>>>>> examples of how to create tables via code.
>>>>>>>    First off Blur is designed to store very large amounts of
>>>>>>>  And
>>>>>>>  while
>>>>>> it can do NRT updates like Solr and ES it's main focus in on bulk
>>>>>> ingestion
>>>>>> through MapReduce.  Given that, the real limiting factor is how much
>>>>>> hardware you have.  Let's play out a scenario.  If you are adding
>>>>>> of
>>>>>> data an hour and I would think that a good rough ballpark guess is
>>>>>> that
>>>>>> you
>>>>>> will need 10-15% of inbound data size as memory to make the search
>>>>>> perform
>>>>>> well.  However as the index sizes increase this % may decrease over
>>>>>> time.
>>>>>>     Blur has an off-heap lru cache to make accessing hdfs faster,
>>>>>> however if
>>>>>> you don't have enough memory the searches (and the cluster for that
>>>>>> matter)
>>>>>> won't fail, they will simply become slower.
>>>>>> So it's really a question of how much hardware you have.  If you
>>>>>> filling a table enough to where it does perform well given the cluster
>>>>>> you
>>>>>> have.  You might have to break it into pieces.  But I think that
>>>>>> hourly
>>>>>> is
>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>   In my current system (which uses just lucene) I designed we take
>>>>> mainly
>>>>> web logs and separate them into indexes. Each web server gets it's own
>>>>> index for each hour. Then when I need to query the data, I use a multi
>>>>> index reader to access the timeframe I need allowing me to keep the
>>>>> size
>>>>> of
>>>>> index down to roughly what I need to search. If data was stored over
>>>>> month, and I want to query data that happened in just a single hour,
>>>>> or a
>>>>> few minutes, it makes sense to me to keep things optimized. Also, if
>>>>> wanted to compare one web server to another, I would just use the multi
>>>>> index reader to load both indexes. This is all handled by a single
>>>>> server
>>>>> though, so it is limited by the hardware of the single server. If
>>>>> something
>>>>> fails, it's a big problem. When trying to query large data sets, it's
>>>>> again, only a single server, so it takes longer than I would like if
>>>>> the
>>>>> index it's reading is large.
>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>> imagining
>>>>> that each "table" is an index. So I would have a table format like...
>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>>>>> tables... like a milti table reader or something? or am I limited to
>>>>> looking at a single table at a time?
>>>>> For some web servers that have little traffic, an hour of data may only
>>>>> have a few mb of data in it while other may have like a 5-10gb index.
>>>>> If
>>>>> I
>>>>> combined the index from a large site with the small sites, this should
>>>>> make
>>>>> everything slower for the queries against the small sites index
>>>>> correct?
>>>>> Or
>>>>> would it all be the same due to how blur separates indexes into shards?
>>>>> Would it perhaps be better to have an index for each web server, and
>>>>> configure small sites to have less shards while larger sites have more
>>>>> shards?
>>>>> We just got a new really large powerful server to be our log server,
>>>>> but
>>>>> as I realize that it's a single point of failure, I want to change our
>>>>> configuration to use a clustered/distributed configuration. So we would
>>>>> start with probably a minimal configuration, and start adding more
>>>>> shard
>>>>> servers when ever we can afford it or need it.
>>>>>         Do shards contain the index data while the location (hdfs)
>>>>>> contains
>>>>>>  the documents (what lucene referred to them as)? I read that the
>>>>>>> shard
>>>>>>> contains the index while the fs contains the data... I just wasn't
>>>>>>> quiet
>>>>>>> sure what the data was, because when I work with lucene, the
>>>>>>> directory contains the data as a document.
>>>>>>>   The shard is stored in HDFS, and it is a Lucene index.  We
>>>>>>> the
>>>>>> data
>>>>>> inside the Lucene index, so it's basically Lucene all the way down
>>>>>> HDFS.
>>>>>>   Ok, so basically a controller is a service which connects to all
>>>>> some?) shards a distributed query, which tells the shard to run a query
>>>>> against a certain data set, that shard then gets that data set either
>>>>> from
>>>>> memory or from the hadoop cluster, processes it, and returns the result
>>>>> to
>>>>> the controller which condenses the results from all the queried shards
>>>>> into
>>>>> a final result right?
>>>>>   Hope this helps.  Let us know if you have more questions.
>>>>>> Thanks,
>>>>>> Aaron
>>>>>>   Thanks,
>>>>>>> Colton McInroy
>>>>>>>     * Director of Security Engineering
>>>>>>> Phone
>>>>>>> (Toll Free)
>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>> My Extension    101
>>>>>>> 24/7 Support <>
>>>>>>> Email <>
>>>>>>> Website

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message