incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <>
Subject Re: Creating/Managing tables saying table already exists
Date Sun, 29 Sep 2013 05:39:37 GMT
How do you create a new table via java code? I see that the shell has a 
create command but I do not see anywhere in the docs how to do it via 
the thrift api call.

Colton McInroy

  * Director of Security Engineering

(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support <>
Email <>

On 9/28/2013 6:47 AM, Aaron McCurry wrote:
> On Sat, Sep 28, 2013 at 9:23 AM, Colton McInroy <>wrote:
>> So, basically your suggesting I use this undocumented Bulk MapReduce
>> method to add all of the data live as it comes in? Do you have an example
>> or any information on how I would accomplish this? What I could do is have
>> a flush period, where as the logs come in and get parsed, I build them up
>> to like 10000 entries or a timed interval, then bulk load them into blur.
> I will add some documentation on how to use it and probably an example, but
> I would try using the async client (maybe start with the regular client)
> first to see if it can keep up.  Just as an FYI I found a bug in 0.2.0
> mutateBatch that causes a deadlock.  I will resolve later today, but if you
> try it out before 0.2.1 is released (a couple of weeks) you will likely
> need to patch the code.  Here's the issue:
> Aaron
>> Thanks,
>> Colton McInroy
>>   * Director of Security Engineering
>> Phone
>> (Toll Free)
>> _US_    (888)-818-1344 Press 2
>> _UK_    0-800-635-0551 Press 2
>> My Extension    101
>> 24/7 Support <>
>> Email <>
>> Website
>> On 9/28/2013 6:06 AM, Aaron McCurry wrote:
>>> So there is a method that is not documented that the Bulk MapReduce uses
>>> that could fill the gaps between MR and NRT updates.  Let's say that there
>>> a table with 100 shards.  In a given table on hdfs the path would look
>>> like
>>> "/blur/tables/table12345/**shard-000010/<the main index goes here>".
>>> Now the way MapReduce works is that it creates a sub directory in the main
>>> index:
>>> "/blur/tables/table12345/**shard-000010/some_index_name.**tmp/<new data
>>> here>"
>>> Once the index is ready to be committed the writer is closed for the new
>>> index and the suddir is renamed to:
>>> "/blur/tables/table12345/**shard-000010/some_index_name.**commit/<new
>>> data
>>> here>"
>>> The act of having an index in the shard directory that ends with ".commit"
>>> makes the shard pick up the index and do an index merge through the
>>> writer.addDirectory(..) call.  It checks for this every 10 seconds.
>>> While this is not a really easy to integrate yet.  I think that if I build
>>> an Apache Flume integration it will likely make use of this feature, or at
>>> least have an option to use it.
>>> As far as searching multiple tables, this has been asked for before so I
>>> think that is something that we should add.  It actually shouldn't be that
>>> difficult.
>>> Aaron
>>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy <
>>>> wrote:
>>>   I actually didn't have any kind of loading interval, I loaded the new
>>>> event log entries into the index in real time. My code runs as a daemon
>>>> accepting syslog entries which indexes them live as they come in with a
>>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>>> And I don't want to have any limitation on lookback time. I want to be
>>>> able to look at the history of any site going back years if need be.
>>>> Sucks there is no multi table reader, that limits what I can do by a bit.
>>>> Thanks,
>>>> Colton McInroy
>>>>    * Director of Security Engineering
>>>> Phone
>>>> (Toll Free)
>>>> _US_    (888)-818-1344 Press 2
>>>> _UK_    0-800-635-0551 Press 2
>>>> My Extension    101
>>>> 24/7 Support <>
>>>> Email <>
>>>> Website
>>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>>   Mapreduce is a bulk entrypoint to loading blur. Much in the same way I
>>>>> bet
>>>>> you have some fancy code to grab up a bunch of log files, over some kind
>>>>> of
>>>>> interval and load them into your index,  MR replaces that process with
>>>>> an
>>>>> auto scaling (via hardware additions only) high bandwidth load that you
>>>>> could fire off at any interval you want. The MR bulk load writes a new
>>>>> index and merges that into the index already running when it completes.
>>>>> The
>>>>> catch is that it is NOT as efficient as your implementation is in terms
>>>>> of
>>>>> latency into the index. So your current impl that will load a small
>>>>> sites
>>>>> couple of mb real fast, MR might take 30 seconds to a minute to bring
>>>>> that
>>>>> online. Having said that blur has a realtime api for inserting that has
>>>>> low
>>>>> latency but you trade in your high bandwidth for it. Might be something
>>>>> you
>>>>> could detect on your front door and decide which way in the data comes.
>>>>> When I was in your shoes, highly optimizing your indexes based on size
>>>>> and
>>>>> load for a single badass machine and doing manual partitioning tricks
>>>>> keep things snappy was key.  The neat thing about blur is some of that
>>>>> you
>>>>> don't do anymore.  I would call it an early optimization at this point
>>>>> to
>>>>> do anything shorter than say a day or whatever your max lookback time
>>>>> is.
>>>>> (Oh btw you can't search across tables in blur, forgot to mention that.)
>>>>> Instead of the lots of tables route I suggest trying one large one and
>>>>> seeing where that goes. Utilize blurs cache initializing capabilities
>>>>> and
>>>>> load in your site and time columns to keep your logical partitioning
>>>>> columns in the block cache and thus very fast. I bet you will see good
>>>>> performance with this approach. Certainly better than es. Not as fast
>>>>> raw lucene, but there is always a price to pay for distributing and so
>>>>> far
>>>>> blur has the lowest overhead I've seen.
>>>>> Hope that helps some.
>>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <>
>>>>> wrote:
>>>>>    Coments inline...
>>>>>> Thanks,
>>>>>> Colton McInroy
>>>>>>     * Director of Security Engineering
>>>>>> Phone
>>>>>> (Toll Free)
>>>>>> _US_    (888)-818-1344 Press 2
>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>> My Extension    101
>>>>>> 24/7 Support <>
>>>>>> Email <>
>>>>>> Website
>>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>>    I have commented inline below:
>>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <
>>>>>>>   wrote:
>>>>>>>>         I do have a few question if you don't mind... I am
>>>>>>>> trying
>>>>>>>> to
>>>>>>>> wrap my head around how this works. In my current implementation
>>>>>>>> a
>>>>>>>> logging system I create new indexes for each hour because
I have a
>>>>>>>> massive
>>>>>>>> amount of data coming in. I take in live log data from syslog
>>>>>>>> parse/store it in hourly lucene indexes along with a facet
index. I
>>>>>>>> want
>>>>>>>> to
>>>>>>>> turn this into a distributed redundant system and blur appears
to be
>>>>>>>> the
>>>>>>>> way to go. I tried elasticsearch but it is just too slow
compared to
>>>>>>>> my
>>>>>>>> current implementation. Given I take in gigs of raw log data
>>>>>>>> hour, I
>>>>>>>> need something that is robust and able to keep up with in
flow of
>>>>>>>> data.
>>>>>>>>     Due to the current implementation of building up an index
for an
>>>>>>>> hour
>>>>>>>>   and
>>>>>>> then making available.  I would use MapReduce for this:
>>>>>>> <****blur/docs/0.2.0/using-blur.****<**blur/docs/0.2.0/using-blur.**>
>>>>>>> html#map-reduce<http://**incub****<**>
>>>>>>> docs/0.2.0/using-blur.html#****map-reduce<http://incubator.**
>>>>>>> That way all the shards in a table get a little more data each
>>>>>>> and
>>>>>>> it's very low impact on the running cluster.
>>>>>>>    Not sure I understand this. I would like data to be accessible
>>>>>>> as
>>>>>> it
>>>>>> comes in, not wait an hour before I can query against it.
>>>>>> I am also not sure where map-reduce comes in here. I thought mapreduce
>>>>>> is
>>>>>> something that blur used internally.
>>>>>>          When taking in lots of data constantly, how is it recommended
>>>>>>> that it
>>>>>>>   be stored? I mentioned above that I create a new index for
each hour
>>>>>>>> to
>>>>>>>> keep data separated and quicker to search. If I want to look
up a
>>>>>>>> specific
>>>>>>>> time frame, I only have to load the directories timestamped
with the
>>>>>>>> hours
>>>>>>>> I want to look at. So instead of having to look at a huge
index of
>>>>>>>> like a
>>>>>>>> years worth of data, i'm looking at a much smaller data set
>>>>>>>> results
>>>>>>>> in faster query response times. Should a new table be created
>>>>>>>> each
>>>>>>>> hour
>>>>>>>> of data? When I typed in the create command into the shell,
it takes
>>>>>>>> about
>>>>>>>> 6 seconds to create a table. If I have to create a table
for each
>>>>>>>> application each hour, this could create a lot of lag. Perhaps
>>>>>>>> is
>>>>>>>> just
>>>>>>>> in my test environment though. Any thoughts on this? I also
>>>>>>>> see
>>>>>>>> any
>>>>>>>> examples of how to create tables via code.
>>>>>>>>     First off Blur is designed to store very large amounts
of data.
>>>>>>>>   And
>>>>>>>>   while
>>>>>>> it can do NRT updates like Solr and ES it's main focus in on
>>>>>>> ingestion
>>>>>>> through MapReduce.  Given that, the real limiting factor is how
>>>>>>> hardware you have.  Let's play out a scenario.  If you are adding
>>>>>>> of
>>>>>>> data an hour and I would think that a good rough ballpark guess
>>>>>>> that
>>>>>>> you
>>>>>>> will need 10-15% of inbound data size as memory to make the search
>>>>>>> perform
>>>>>>> well.  However as the index sizes increase this % may decrease
>>>>>>> time.
>>>>>>>      Blur has an off-heap lru cache to make accessing hdfs faster,
>>>>>>> however if
>>>>>>> you don't have enough memory the searches (and the cluster for
>>>>>>> matter)
>>>>>>> won't fail, they will simply become slower.
>>>>>>> So it's really a question of how much hardware you have.  If
you have
>>>>>>> filling a table enough to where it does perform well given the
>>>>>>> you
>>>>>>> have.  You might have to break it into pieces.  But I think that
>>>>>>> hourly
>>>>>>> is
>>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>>    In my current system (which uses just lucene) I designed we
take in
>>>>>> mainly
>>>>>> web logs and separate them into indexes. Each web server gets it's
>>>>>> index for each hour. Then when I need to query the data, I use a
>>>>>> index reader to access the timeframe I need allowing me to keep the
>>>>>> size
>>>>>> of
>>>>>> index down to roughly what I need to search. If data was stored over
>>>>>> month, and I want to query data that happened in just a single hour,
>>>>>> or a
>>>>>> few minutes, it makes sense to me to keep things optimized. Also,
if I
>>>>>> wanted to compare one web server to another, I would just use the
>>>>>> index reader to load both indexes. This is all handled by a single
>>>>>> server
>>>>>> though, so it is limited by the hardware of the single server. If
>>>>>> something
>>>>>> fails, it's a big problem. When trying to query large data sets,
>>>>>> again, only a single server, so it takes longer than I would like
>>>>>> the
>>>>>> index it's reading is large.
>>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>>> imagining
>>>>>> that each "table" is an index. So I would have a table format like...
>>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>>>>>> tables... like a milti table reader or something? or am I limited
>>>>>> looking at a single table at a time?
>>>>>> For some web servers that have little traffic, an hour of data may
>>>>>> have a few mb of data in it while other may have like a 5-10gb index.
>>>>>> If
>>>>>> I
>>>>>> combined the index from a large site with the small sites, this should
>>>>>> make
>>>>>> everything slower for the queries against the small sites index
>>>>>> correct?
>>>>>> Or
>>>>>> would it all be the same due to how blur separates indexes into shards?
>>>>>> Would it perhaps be better to have an index for each web server,
>>>>>> configure small sites to have less shards while larger sites have
>>>>>> shards?
>>>>>> We just got a new really large powerful server to be our log server,
>>>>>> but
>>>>>> as I realize that it's a single point of failure, I want to change
>>>>>> configuration to use a clustered/distributed configuration. So we
>>>>>> start with probably a minimal configuration, and start adding more
>>>>>> shard
>>>>>> servers when ever we can afford it or need it.
>>>>>>          Do shards contain the index data while the location (hdfs)
>>>>>>> contains
>>>>>>>   the documents (what lucene referred to them as)? I read that
>>>>>>>> shard
>>>>>>>> contains the index while the fs contains the data... I just
>>>>>>>> quiet
>>>>>>>> sure what the data was, because when I work with lucene,
the index
>>>>>>>> directory contains the data as a document.
>>>>>>>>    The shard is stored in HDFS, and it is a Lucene index.
 We store
>>>>>>>> the
>>>>>>> data
>>>>>>> inside the Lucene index, so it's basically Lucene all the way
down to
>>>>>>> HDFS.
>>>>>>>    Ok, so basically a controller is a service which connects
to all (or
>>>>>> some?) shards a distributed query, which tells the shard to run a
>>>>>> against a certain data set, that shard then gets that data set either
>>>>>> from
>>>>>> memory or from the hadoop cluster, processes it, and returns the
>>>>>> to
>>>>>> the controller which condenses the results from all the queried shards
>>>>>> into
>>>>>> a final result right?
>>>>>>    Hope this helps.  Let us know if you have more questions.
>>>>>>> Thanks,
>>>>>>> Aaron
>>>>>>>    Thanks,
>>>>>>>> Colton McInroy
>>>>>>>>      * Director of Security Engineering
>>>>>>>> Phone
>>>>>>>> (Toll Free)
>>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>> My Extension    101
>>>>>>>> 24/7 Support <>
>>>>>>>> Email <>
>>>>>>>> Website

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message