incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <col...@dosarrest.com>
Subject Re: Creating/Managing tables saying table already exists
Date Sun, 29 Sep 2013 05:48:01 GMT
Well, I gues the first thing to do is check the list of current tables 
to see if a table with the desired name already exists, then if it does 
not, create a new table, otherwise, use the existing table to add 
records to.

Thanks,
Colton McInroy

  * Director of Security Engineering

	
Phone
(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support 	support@dosarrest.com <mailto:support@dosarrest.com>
Email 	colton@dosarrest.com <mailto:colton@dosarrest.com>
Website 	http://www.dosarrest.com

On 9/28/2013 10:39 PM, Colton McInroy wrote:
> How do you create a new table via java code? I see that the shell has 
> a create command but I do not see anywhere in the docs how to do it 
> via the thrift api call.
>
> Thanks,
> Colton McInroy
>
>  * Director of Security Engineering
>
>
> Phone
> (Toll Free)
> _US_     (888)-818-1344 Press 2
> _UK_     0-800-635-0551 Press 2
>
> My Extension     101
> 24/7 Support     support@dosarrest.com <mailto:support@dosarrest.com>
> Email     colton@dosarrest.com <mailto:colton@dosarrest.com>
> Website     http://www.dosarrest.com
>
> On 9/28/2013 6:47 AM, Aaron McCurry wrote:
>> On Sat, Sep 28, 2013 at 9:23 AM, Colton McInroy 
>> <colton@dosarrest.com>wrote:
>>
>>> So, basically your suggesting I use this undocumented Bulk MapReduce
>>> method to add all of the data live as it comes in? Do you have an 
>>> example
>>> or any information on how I would accomplish this? What I could do 
>>> is have
>>> a flush period, where as the logs come in and get parsed, I build 
>>> them up
>>> to like 10000 entries or a timed interval, then bulk load them into 
>>> blur.
>>
>> I will add some documentation on how to use it and probably an 
>> example, but
>> I would try using the async client (maybe start with the regular client)
>> first to see if it can keep up.  Just as an FYI I found a bug in 0.2.0
>> mutateBatch that causes a deadlock.  I will resolve later today, but 
>> if you
>> try it out before 0.2.1 is released (a couple of weeks) you will likely
>> need to patch the code.  Here's the issue:
>>
>> https://issues.apache.org/jira/browse/BLUR-245
>>
>> Aaron
>>
>>
>>
>>
>>> Thanks,
>>> Colton McInroy
>>>
>>>   * Director of Security Engineering
>>>
>>>
>>> Phone
>>> (Toll Free)
>>> _US_    (888)-818-1344 Press 2
>>> _UK_    0-800-635-0551 Press 2
>>>
>>> My Extension    101
>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>> Website         http://www.dosarrest.com
>>>
>>> On 9/28/2013 6:06 AM, Aaron McCurry wrote:
>>>
>>>> So there is a method that is not documented that the Bulk MapReduce 
>>>> uses
>>>> that could fill the gaps between MR and NRT updates.  Let's say 
>>>> that there
>>>> a table with 100 shards.  In a given table on hdfs the path would look
>>>> like
>>>> "/blur/tables/table12345/**shard-000010/<the main index goes here>".
>>>>
>>>> Now the way MapReduce works is that it creates a sub directory in 
>>>> the main
>>>> index:
>>>> "/blur/tables/table12345/**shard-000010/some_index_name.**tmp/<new 
>>>> data
>>>> here>"
>>>>
>>>> Once the index is ready to be committed the writer is closed for 
>>>> the new
>>>> index and the suddir is renamed to:
>>>> "/blur/tables/table12345/**shard-000010/some_index_name.**commit/<new
>>>> data
>>>> here>"
>>>>
>>>> The act of having an index in the shard directory that ends with 
>>>> ".commit"
>>>> makes the shard pick up the index and do an index merge through the
>>>> writer.addDirectory(..) call.  It checks for this every 10 seconds.
>>>>
>>>> While this is not a really easy to integrate yet.  I think that if 
>>>> I build
>>>> an Apache Flume integration it will likely make use of this 
>>>> feature, or at
>>>> least have an option to use it.
>>>>
>>>>
>>>> As far as searching multiple tables, this has been asked for before 
>>>> so I
>>>> think that is something that we should add.  It actually shouldn't 
>>>> be that
>>>> difficult.
>>>>
>>>> Aaron
>>>>
>>>>
>>>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy <colton@dosarrest.com
>>>>> wrote:
>>>>   I actually didn't have any kind of loading interval, I loaded the 
>>>> new
>>>>> event log entries into the index in real time. My code runs as a 
>>>>> daemon
>>>>> accepting syslog entries which indexes them live as they come in 
>>>>> with a
>>>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>>>>
>>>>> And I don't want to have any limitation on lookback time. I want 
>>>>> to be
>>>>> able to look at the history of any site going back years if need be.
>>>>>
>>>>> Sucks there is no multi table reader, that limits what I can do by 
>>>>> a bit.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Colton McInroy
>>>>>
>>>>>    * Director of Security Engineering
>>>>>
>>>>>
>>>>> Phone
>>>>> (Toll Free)
>>>>> _US_    (888)-818-1344 Press 2
>>>>> _UK_    0-800-635-0551 Press 2
>>>>>
>>>>> My Extension    101
>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>> Website         http://www.dosarrest.com
>>>>>
>>>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>>>
>>>>>   Mapreduce is a bulk entrypoint to loading blur. Much in the same 
>>>>> way I
>>>>>> bet
>>>>>> you have some fancy code to grab up a bunch of log files, over 
>>>>>> some kind
>>>>>> of
>>>>>> interval and load them into your index,  MR replaces that process

>>>>>> with
>>>>>> an
>>>>>> auto scaling (via hardware additions only) high bandwidth load 
>>>>>> that you
>>>>>> could fire off at any interval you want. The MR bulk load writes

>>>>>> a new
>>>>>> index and merges that into the index already running when it 
>>>>>> completes.
>>>>>> The
>>>>>> catch is that it is NOT as efficient as your implementation is in

>>>>>> terms
>>>>>> of
>>>>>> latency into the index. So your current impl that will load a small
>>>>>> sites
>>>>>> couple of mb real fast, MR might take 30 seconds to a minute to 
>>>>>> bring
>>>>>> that
>>>>>> online. Having said that blur has a realtime api for inserting 
>>>>>> that has
>>>>>> low
>>>>>> latency but you trade in your high bandwidth for it. Might be 
>>>>>> something
>>>>>> you
>>>>>> could detect on your front door and decide which way in the data

>>>>>> comes.
>>>>>>
>>>>>> When I was in your shoes, highly optimizing your indexes based on

>>>>>> size
>>>>>> and
>>>>>> load for a single badass machine and doing manual partitioning 
>>>>>> tricks to
>>>>>> keep things snappy was key.  The neat thing about blur is some of

>>>>>> that
>>>>>> you
>>>>>> don't do anymore.  I would call it an early optimization at this

>>>>>> point
>>>>>> to
>>>>>> do anything shorter than say a day or whatever your max lookback

>>>>>> time
>>>>>> is.
>>>>>> (Oh btw you can't search across tables in blur, forgot to mention

>>>>>> that.)
>>>>>>
>>>>>> Instead of the lots of tables route I suggest trying one large 
>>>>>> one and
>>>>>> seeing where that goes. Utilize blurs cache initializing 
>>>>>> capabilities
>>>>>> and
>>>>>> load in your site and time columns to keep your logical partitioning
>>>>>> columns in the block cache and thus very fast. I bet you will see

>>>>>> good
>>>>>> performance with this approach. Certainly better than es. Not as

>>>>>> fast as
>>>>>> raw lucene, but there is always a price to pay for distributing 
>>>>>> and so
>>>>>> far
>>>>>> blur has the lowest overhead I've seen.
>>>>>>
>>>>>> Hope that helps some.
>>>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <colton@dosarrest.com>
>>>>>> wrote:
>>>>>>
>>>>>>    Coments inline...
>>>>>>
>>>>>>> Thanks,
>>>>>>> Colton McInroy
>>>>>>>
>>>>>>>     * Director of Security Engineering
>>>>>>>
>>>>>>>
>>>>>>> Phone
>>>>>>> (Toll Free)
>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>
>>>>>>> My Extension    101
>>>>>>> 24/7 Support    support@dosarrest.com 
>>>>>>> <mailto:support@dosarrest.com>
>>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>> Website         http://www.dosarrest.com
>>>>>>>
>>>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>>>
>>>>>>>    I have commented inline below:
>>>>>>>
>>>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <
>>>>>>>> colton@dosarrest.com
>>>>>>>>
>>>>>>>>   wrote:
>>>>>>>>>         I do have a few question if you don't mind...
I am still
>>>>>>>>> trying
>>>>>>>>> to
>>>>>>>>> wrap my head around how this works. In my current 
>>>>>>>>> implementation for
>>>>>>>>> a
>>>>>>>>> logging system I create new indexes for each hour because
I 
>>>>>>>>> have a
>>>>>>>>> massive
>>>>>>>>> amount of data coming in. I take in live log data from
syslog and
>>>>>>>>> parse/store it in hourly lucene indexes along with a
facet 
>>>>>>>>> index. I
>>>>>>>>> want
>>>>>>>>> to
>>>>>>>>> turn this into a distributed redundant system and blur
appears 
>>>>>>>>> to be
>>>>>>>>> the
>>>>>>>>> way to go. I tried elasticsearch but it is just too slow

>>>>>>>>> compared to
>>>>>>>>> my
>>>>>>>>> current implementation. Given I take in gigs of raw log
data an
>>>>>>>>> hour, I
>>>>>>>>> need something that is robust and able to keep up with
in flow of
>>>>>>>>> data.
>>>>>>>>>
>>>>>>>>>     Due to the current implementation of building up
an index 
>>>>>>>>> for an
>>>>>>>>> hour
>>>>>>>>>
>>>>>>>>>   and
>>>>>>>> then making available.  I would use MapReduce for this:
>>>>>>>>
>>>>>>>> http://incubator.apache.org/******blur/docs/0.2.0/using-blur.****<http://incubator.apache.org/****blur/docs/0.2.0/using-blur.**>

>>>>>>>>
>>>>>>>> <http://incubator.apache.org/****blur/docs/0.2.0/using-blur.****<http://incubator.apache.org/**blur/docs/0.2.0/using-blur.**>

>>>>>>>>
>>>>>>>> html#map-reduce<http://**incub**ator.apache.org/blur/**<http://incubator.apache.org/blur/**>

>>>>>>>>
>>>>>>>> docs/0.2.0/using-blur.html#****map-reduce<http://incubator.**
>>>>>>>> apache.org/blur/docs/0.2.0/**using-blur.html#map-reduce<http://incubator.apache.org/blur/docs/0.2.0/using-blur.html#map-reduce>

>>>>>>>>
>>>>>>>> That way all the shards in a table get a little more data
each 
>>>>>>>> hour
>>>>>>>> and
>>>>>>>> it's very low impact on the running cluster.
>>>>>>>>
>>>>>>>>    Not sure I understand this. I would like data to be 
>>>>>>>> accessible live
>>>>>>>> as
>>>>>>>>
>>>>>>> it
>>>>>>> comes in, not wait an hour before I can query against it.
>>>>>>> I am also not sure where map-reduce comes in here. I thought

>>>>>>> mapreduce
>>>>>>> is
>>>>>>> something that blur used internally.
>>>>>>>
>>>>>>>          When taking in lots of data constantly, how is it 
>>>>>>> recommended
>>>>>>>
>>>>>>>> that it
>>>>>>>>
>>>>>>>>   be stored? I mentioned above that I create a new index
for 
>>>>>>>> each hour
>>>>>>>>> to
>>>>>>>>> keep data separated and quicker to search. If I want
to look up a
>>>>>>>>> specific
>>>>>>>>> time frame, I only have to load the directories timestamped

>>>>>>>>> with the
>>>>>>>>> hours
>>>>>>>>> I want to look at. So instead of having to look at a
huge 
>>>>>>>>> index of
>>>>>>>>> like a
>>>>>>>>> years worth of data, i'm looking at a much smaller data
set which
>>>>>>>>> results
>>>>>>>>> in faster query response times. Should a new table be
created for
>>>>>>>>> each
>>>>>>>>> hour
>>>>>>>>> of data? When I typed in the create command into the
shell, it 
>>>>>>>>> takes
>>>>>>>>> about
>>>>>>>>> 6 seconds to create a table. If I have to create a table
for each
>>>>>>>>> application each hour, this could create a lot of lag.
Perhaps 
>>>>>>>>> this
>>>>>>>>> is
>>>>>>>>> just
>>>>>>>>> in my test environment though. Any thoughts on this?
I also 
>>>>>>>>> didn't
>>>>>>>>> see
>>>>>>>>> any
>>>>>>>>> examples of how to create tables via code.
>>>>>>>>>
>>>>>>>>>     First off Blur is designed to store very large amounts
of 
>>>>>>>>> data.
>>>>>>>>>   And
>>>>>>>>>
>>>>>>>>>   while
>>>>>>>> it can do NRT updates like Solr and ES it's main focus in
on bulk
>>>>>>>> ingestion
>>>>>>>> through MapReduce.  Given that, the real limiting factor
is how 
>>>>>>>> much
>>>>>>>> hardware you have.  Let's play out a scenario.  If you are

>>>>>>>> adding 10GB
>>>>>>>> of
>>>>>>>> data an hour and I would think that a good rough ballpark
guess is
>>>>>>>> that
>>>>>>>> you
>>>>>>>> will need 10-15% of inbound data size as memory to make the
search
>>>>>>>> perform
>>>>>>>> well.  However as the index sizes increase this % may decrease

>>>>>>>> over
>>>>>>>> time.
>>>>>>>>      Blur has an off-heap lru cache to make accessing hdfs
faster,
>>>>>>>> however if
>>>>>>>> you don't have enough memory the searches (and the cluster
for 
>>>>>>>> that
>>>>>>>> matter)
>>>>>>>> won't fail, they will simply become slower.
>>>>>>>>
>>>>>>>> So it's really a question of how much hardware you have.
 If 
>>>>>>>> you have
>>>>>>>> filling a table enough to where it does perform well given
the 
>>>>>>>> cluster
>>>>>>>> you
>>>>>>>> have.  You might have to break it into pieces.  But I think
that
>>>>>>>> hourly
>>>>>>>> is
>>>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>>>
>>>>>>>>    In my current system (which uses just lucene) I designed
we 
>>>>>>>> take in
>>>>>>>>
>>>>>>> mainly
>>>>>>> web logs and separate them into indexes. Each web server gets

>>>>>>> it's own
>>>>>>> index for each hour. Then when I need to query the data, I use
a 
>>>>>>> multi
>>>>>>> index reader to access the timeframe I need allowing me to keep
the
>>>>>>> size
>>>>>>> of
>>>>>>> index down to roughly what I need to search. If data was stored

>>>>>>> over a
>>>>>>> month, and I want to query data that happened in just a single

>>>>>>> hour,
>>>>>>> or a
>>>>>>> few minutes, it makes sense to me to keep things optimized. 
>>>>>>> Also, if I
>>>>>>> wanted to compare one web server to another, I would just use

>>>>>>> the multi
>>>>>>> index reader to load both indexes. This is all handled by a single
>>>>>>> server
>>>>>>> though, so it is limited by the hardware of the single server.
If
>>>>>>> something
>>>>>>> fails, it's a big problem. When trying to query large data sets,

>>>>>>> it's
>>>>>>> again, only a single server, so it takes longer than I would

>>>>>>> like if
>>>>>>> the
>>>>>>> index it's reading is large.
>>>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>>>> imagining
>>>>>>> that each "table" is an index. So I would have a table format

>>>>>>> like...
>>>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query

>>>>>>> multiple
>>>>>>> tables... like a milti table reader or something? or am I 
>>>>>>> limited to
>>>>>>> looking at a single table at a time?
>>>>>>> For some web servers that have little traffic, an hour of data

>>>>>>> may only
>>>>>>> have a few mb of data in it while other may have like a 5-10gb

>>>>>>> index.
>>>>>>> If
>>>>>>> I
>>>>>>> combined the index from a large site with the small sites, this

>>>>>>> should
>>>>>>> make
>>>>>>> everything slower for the queries against the small sites index
>>>>>>> correct?
>>>>>>> Or
>>>>>>> would it all be the same due to how blur separates indexes into

>>>>>>> shards?
>>>>>>> Would it perhaps be better to have an index for each web server,

>>>>>>> and
>>>>>>> configure small sites to have less shards while larger sites

>>>>>>> have more
>>>>>>> shards?
>>>>>>> We just got a new really large powerful server to be our log

>>>>>>> server,
>>>>>>> but
>>>>>>> as I realize that it's a single point of failure, I want to 
>>>>>>> change our
>>>>>>> configuration to use a clustered/distributed configuration. So

>>>>>>> we would
>>>>>>> start with probably a minimal configuration, and start adding
more
>>>>>>> shard
>>>>>>> servers when ever we can afford it or need it.
>>>>>>>
>>>>>>>          Do shards contain the index data while the location
(hdfs)
>>>>>>>
>>>>>>>> contains
>>>>>>>>
>>>>>>>>   the documents (what lucene referred to them as)? I read
that the
>>>>>>>>> shard
>>>>>>>>> contains the index while the fs contains the data...
I just 
>>>>>>>>> wasn't
>>>>>>>>> quiet
>>>>>>>>> sure what the data was, because when I work with lucene,
the 
>>>>>>>>> index
>>>>>>>>> directory contains the data as a document.
>>>>>>>>>
>>>>>>>>>    The shard is stored in HDFS, and it is a Lucene index.
 We 
>>>>>>>>> store
>>>>>>>>> the
>>>>>>>>>
>>>>>>>> data
>>>>>>>> inside the Lucene index, so it's basically Lucene all the
way 
>>>>>>>> down to
>>>>>>>> HDFS.
>>>>>>>>
>>>>>>>>    Ok, so basically a controller is a service which connects
to 
>>>>>>>> all (or
>>>>>>>>
>>>>>>> some?) shards a distributed query, which tells the shard to run

>>>>>>> a query
>>>>>>> against a certain data set, that shard then gets that data set

>>>>>>> either
>>>>>>> from
>>>>>>> memory or from the hadoop cluster, processes it, and returns
the 
>>>>>>> result
>>>>>>> to
>>>>>>> the controller which condenses the results from all the queried

>>>>>>> shards
>>>>>>> into
>>>>>>> a final result right?
>>>>>>>
>>>>>>>    Hope this helps.  Let us know if you have more questions.
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Aaron
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    Thanks,
>>>>>>>>
>>>>>>>>> Colton McInroy
>>>>>>>>>
>>>>>>>>>      * Director of Security Engineering
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Phone
>>>>>>>>> (Toll Free)
>>>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>>>
>>>>>>>>> My Extension    101
>>>>>>>>> 24/7 Support    support@dosarrest.com 
>>>>>>>>> <mailto:support@dosarrest.com>
>>>>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>>>> Website         http://www.dosarrest.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message