incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <col...@dosarrest.com>
Subject Re: Creating/Managing tables saying table already exists
Date Sun, 29 Sep 2013 05:39:37 GMT
How do you create a new table via java code? I see that the shell has a 
create command but I do not see anywhere in the docs how to do it via 
the thrift api call.

Thanks,
Colton McInroy

  * Director of Security Engineering

	
Phone
(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support 	support@dosarrest.com <mailto:support@dosarrest.com>
Email 	colton@dosarrest.com <mailto:colton@dosarrest.com>
Website 	http://www.dosarrest.com

On 9/28/2013 6:47 AM, Aaron McCurry wrote:
> On Sat, Sep 28, 2013 at 9:23 AM, Colton McInroy <colton@dosarrest.com>wrote:
>
>> So, basically your suggesting I use this undocumented Bulk MapReduce
>> method to add all of the data live as it comes in? Do you have an example
>> or any information on how I would accomplish this? What I could do is have
>> a flush period, where as the logs come in and get parsed, I build them up
>> to like 10000 entries or a timed interval, then bulk load them into blur.
>
> I will add some documentation on how to use it and probably an example, but
> I would try using the async client (maybe start with the regular client)
> first to see if it can keep up.  Just as an FYI I found a bug in 0.2.0
> mutateBatch that causes a deadlock.  I will resolve later today, but if you
> try it out before 0.2.1 is released (a couple of weeks) you will likely
> need to patch the code.  Here's the issue:
>
> https://issues.apache.org/jira/browse/BLUR-245
>
> Aaron
>
>
>
>
>> Thanks,
>> Colton McInroy
>>
>>   * Director of Security Engineering
>>
>>
>> Phone
>> (Toll Free)
>> _US_    (888)-818-1344 Press 2
>> _UK_    0-800-635-0551 Press 2
>>
>> My Extension    101
>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>> Website         http://www.dosarrest.com
>>
>> On 9/28/2013 6:06 AM, Aaron McCurry wrote:
>>
>>> So there is a method that is not documented that the Bulk MapReduce uses
>>> that could fill the gaps between MR and NRT updates.  Let's say that there
>>> a table with 100 shards.  In a given table on hdfs the path would look
>>> like
>>> "/blur/tables/table12345/**shard-000010/<the main index goes here>".
>>>
>>> Now the way MapReduce works is that it creates a sub directory in the main
>>> index:
>>> "/blur/tables/table12345/**shard-000010/some_index_name.**tmp/<new data
>>> here>"
>>>
>>> Once the index is ready to be committed the writer is closed for the new
>>> index and the suddir is renamed to:
>>> "/blur/tables/table12345/**shard-000010/some_index_name.**commit/<new
>>> data
>>> here>"
>>>
>>> The act of having an index in the shard directory that ends with ".commit"
>>> makes the shard pick up the index and do an index merge through the
>>> writer.addDirectory(..) call.  It checks for this every 10 seconds.
>>>
>>> While this is not a really easy to integrate yet.  I think that if I build
>>> an Apache Flume integration it will likely make use of this feature, or at
>>> least have an option to use it.
>>>
>>>
>>> As far as searching multiple tables, this has been asked for before so I
>>> think that is something that we should add.  It actually shouldn't be that
>>> difficult.
>>>
>>> Aaron
>>>
>>>
>>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy <colton@dosarrest.com
>>>> wrote:
>>>   I actually didn't have any kind of loading interval, I loaded the new
>>>> event log entries into the index in real time. My code runs as a daemon
>>>> accepting syslog entries which indexes them live as they come in with a
>>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>>>
>>>> And I don't want to have any limitation on lookback time. I want to be
>>>> able to look at the history of any site going back years if need be.
>>>>
>>>> Sucks there is no multi table reader, that limits what I can do by a bit.
>>>>
>>>>
>>>> Thanks,
>>>> Colton McInroy
>>>>
>>>>    * Director of Security Engineering
>>>>
>>>>
>>>> Phone
>>>> (Toll Free)
>>>> _US_    (888)-818-1344 Press 2
>>>> _UK_    0-800-635-0551 Press 2
>>>>
>>>> My Extension    101
>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>> Website         http://www.dosarrest.com
>>>>
>>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>>
>>>>   Mapreduce is a bulk entrypoint to loading blur. Much in the same way I
>>>>> bet
>>>>> you have some fancy code to grab up a bunch of log files, over some kind
>>>>> of
>>>>> interval and load them into your index,  MR replaces that process with
>>>>> an
>>>>> auto scaling (via hardware additions only) high bandwidth load that you
>>>>> could fire off at any interval you want. The MR bulk load writes a new
>>>>> index and merges that into the index already running when it completes.
>>>>> The
>>>>> catch is that it is NOT as efficient as your implementation is in terms
>>>>> of
>>>>> latency into the index. So your current impl that will load a small
>>>>> sites
>>>>> couple of mb real fast, MR might take 30 seconds to a minute to bring
>>>>> that
>>>>> online. Having said that blur has a realtime api for inserting that has
>>>>> low
>>>>> latency but you trade in your high bandwidth for it. Might be something
>>>>> you
>>>>> could detect on your front door and decide which way in the data comes.
>>>>>
>>>>> When I was in your shoes, highly optimizing your indexes based on size
>>>>> and
>>>>> load for a single badass machine and doing manual partitioning tricks
to
>>>>> keep things snappy was key.  The neat thing about blur is some of that
>>>>> you
>>>>> don't do anymore.  I would call it an early optimization at this point
>>>>> to
>>>>> do anything shorter than say a day or whatever your max lookback time
>>>>> is.
>>>>> (Oh btw you can't search across tables in blur, forgot to mention that.)
>>>>>
>>>>> Instead of the lots of tables route I suggest trying one large one and
>>>>> seeing where that goes. Utilize blurs cache initializing capabilities
>>>>> and
>>>>> load in your site and time columns to keep your logical partitioning
>>>>> columns in the block cache and thus very fast. I bet you will see good
>>>>> performance with this approach. Certainly better than es. Not as fast
as
>>>>> raw lucene, but there is always a price to pay for distributing and so
>>>>> far
>>>>> blur has the lowest overhead I've seen.
>>>>>
>>>>> Hope that helps some.
>>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <colton@dosarrest.com>
>>>>> wrote:
>>>>>
>>>>>    Coments inline...
>>>>>
>>>>>> Thanks,
>>>>>> Colton McInroy
>>>>>>
>>>>>>     * Director of Security Engineering
>>>>>>
>>>>>>
>>>>>> Phone
>>>>>> (Toll Free)
>>>>>> _US_    (888)-818-1344 Press 2
>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>
>>>>>> My Extension    101
>>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>> Website         http://www.dosarrest.com
>>>>>>
>>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>>
>>>>>>    I have commented inline below:
>>>>>>
>>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <
>>>>>>> colton@dosarrest.com
>>>>>>>
>>>>>>>   wrote:
>>>>>>>>         I do have a few question if you don't mind... I am
still
>>>>>>>> trying
>>>>>>>> to
>>>>>>>> wrap my head around how this works. In my current implementation
for
>>>>>>>> a
>>>>>>>> logging system I create new indexes for each hour because
I have a
>>>>>>>> massive
>>>>>>>> amount of data coming in. I take in live log data from syslog
and
>>>>>>>> parse/store it in hourly lucene indexes along with a facet
index. I
>>>>>>>> want
>>>>>>>> to
>>>>>>>> turn this into a distributed redundant system and blur appears
to be
>>>>>>>> the
>>>>>>>> way to go. I tried elasticsearch but it is just too slow
compared to
>>>>>>>> my
>>>>>>>> current implementation. Given I take in gigs of raw log data
an
>>>>>>>> hour, I
>>>>>>>> need something that is robust and able to keep up with in
flow of
>>>>>>>> data.
>>>>>>>>
>>>>>>>>     Due to the current implementation of building up an index
for an
>>>>>>>> hour
>>>>>>>>
>>>>>>>>   and
>>>>>>> then making available.  I would use MapReduce for this:
>>>>>>>
>>>>>>> http://incubator.apache.org/******blur/docs/0.2.0/using-blur.****<http://incubator.apache.org/****blur/docs/0.2.0/using-blur.**>
>>>>>>> <http://incubator.apache.org/****blur/docs/0.2.0/using-blur.****<http://incubator.apache.org/**blur/docs/0.2.0/using-blur.**>
>>>>>>> html#map-reduce<http://**incub**ator.apache.org/blur/**<http://incubator.apache.org/blur/**>
>>>>>>> docs/0.2.0/using-blur.html#****map-reduce<http://incubator.**
>>>>>>> apache.org/blur/docs/0.2.0/**using-blur.html#map-reduce<http://incubator.apache.org/blur/docs/0.2.0/using-blur.html#map-reduce>
>>>>>>> That way all the shards in a table get a little more data each
hour
>>>>>>> and
>>>>>>> it's very low impact on the running cluster.
>>>>>>>
>>>>>>>    Not sure I understand this. I would like data to be accessible
live
>>>>>>> as
>>>>>>>
>>>>>> it
>>>>>> comes in, not wait an hour before I can query against it.
>>>>>> I am also not sure where map-reduce comes in here. I thought mapreduce
>>>>>> is
>>>>>> something that blur used internally.
>>>>>>
>>>>>>          When taking in lots of data constantly, how is it recommended
>>>>>>
>>>>>>> that it
>>>>>>>
>>>>>>>   be stored? I mentioned above that I create a new index for
each hour
>>>>>>>> to
>>>>>>>> keep data separated and quicker to search. If I want to look
up a
>>>>>>>> specific
>>>>>>>> time frame, I only have to load the directories timestamped
with the
>>>>>>>> hours
>>>>>>>> I want to look at. So instead of having to look at a huge
index of
>>>>>>>> like a
>>>>>>>> years worth of data, i'm looking at a much smaller data set
which
>>>>>>>> results
>>>>>>>> in faster query response times. Should a new table be created
for
>>>>>>>> each
>>>>>>>> hour
>>>>>>>> of data? When I typed in the create command into the shell,
it takes
>>>>>>>> about
>>>>>>>> 6 seconds to create a table. If I have to create a table
for each
>>>>>>>> application each hour, this could create a lot of lag. Perhaps
this
>>>>>>>> is
>>>>>>>> just
>>>>>>>> in my test environment though. Any thoughts on this? I also
didn't
>>>>>>>> see
>>>>>>>> any
>>>>>>>> examples of how to create tables via code.
>>>>>>>>
>>>>>>>>     First off Blur is designed to store very large amounts
of data.
>>>>>>>>   And
>>>>>>>>
>>>>>>>>   while
>>>>>>> it can do NRT updates like Solr and ES it's main focus in on
bulk
>>>>>>> ingestion
>>>>>>> through MapReduce.  Given that, the real limiting factor is how
much
>>>>>>> hardware you have.  Let's play out a scenario.  If you are adding
10GB
>>>>>>> of
>>>>>>> data an hour and I would think that a good rough ballpark guess
is
>>>>>>> that
>>>>>>> you
>>>>>>> will need 10-15% of inbound data size as memory to make the search
>>>>>>> perform
>>>>>>> well.  However as the index sizes increase this % may decrease
over
>>>>>>> time.
>>>>>>>      Blur has an off-heap lru cache to make accessing hdfs faster,
>>>>>>> however if
>>>>>>> you don't have enough memory the searches (and the cluster for
that
>>>>>>> matter)
>>>>>>> won't fail, they will simply become slower.
>>>>>>>
>>>>>>> So it's really a question of how much hardware you have.  If
you have
>>>>>>> filling a table enough to where it does perform well given the
cluster
>>>>>>> you
>>>>>>> have.  You might have to break it into pieces.  But I think that
>>>>>>> hourly
>>>>>>> is
>>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>>
>>>>>>>    In my current system (which uses just lucene) I designed we
take in
>>>>>>>
>>>>>> mainly
>>>>>> web logs and separate them into indexes. Each web server gets it's
own
>>>>>> index for each hour. Then when I need to query the data, I use a
multi
>>>>>> index reader to access the timeframe I need allowing me to keep the
>>>>>> size
>>>>>> of
>>>>>> index down to roughly what I need to search. If data was stored over
a
>>>>>> month, and I want to query data that happened in just a single hour,
>>>>>> or a
>>>>>> few minutes, it makes sense to me to keep things optimized. Also,
if I
>>>>>> wanted to compare one web server to another, I would just use the
multi
>>>>>> index reader to load both indexes. This is all handled by a single
>>>>>> server
>>>>>> though, so it is limited by the hardware of the single server. If
>>>>>> something
>>>>>> fails, it's a big problem. When trying to query large data sets,
it's
>>>>>> again, only a single server, so it takes longer than I would like
if
>>>>>> the
>>>>>> index it's reading is large.
>>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>>> imagining
>>>>>> that each "table" is an index. So I would have a table format like...
>>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>>>>>> tables... like a milti table reader or something? or am I limited
to
>>>>>> looking at a single table at a time?
>>>>>> For some web servers that have little traffic, an hour of data may
only
>>>>>> have a few mb of data in it while other may have like a 5-10gb index.
>>>>>> If
>>>>>> I
>>>>>> combined the index from a large site with the small sites, this should
>>>>>> make
>>>>>> everything slower for the queries against the small sites index
>>>>>> correct?
>>>>>> Or
>>>>>> would it all be the same due to how blur separates indexes into shards?
>>>>>> Would it perhaps be better to have an index for each web server,
and
>>>>>> configure small sites to have less shards while larger sites have
more
>>>>>> shards?
>>>>>> We just got a new really large powerful server to be our log server,
>>>>>> but
>>>>>> as I realize that it's a single point of failure, I want to change
our
>>>>>> configuration to use a clustered/distributed configuration. So we
would
>>>>>> start with probably a minimal configuration, and start adding more
>>>>>> shard
>>>>>> servers when ever we can afford it or need it.
>>>>>>
>>>>>>          Do shards contain the index data while the location (hdfs)
>>>>>>
>>>>>>> contains
>>>>>>>
>>>>>>>   the documents (what lucene referred to them as)? I read that
the
>>>>>>>> shard
>>>>>>>> contains the index while the fs contains the data... I just
wasn't
>>>>>>>> quiet
>>>>>>>> sure what the data was, because when I work with lucene,
the index
>>>>>>>> directory contains the data as a document.
>>>>>>>>
>>>>>>>>    The shard is stored in HDFS, and it is a Lucene index.
 We store
>>>>>>>> the
>>>>>>>>
>>>>>>> data
>>>>>>> inside the Lucene index, so it's basically Lucene all the way
down to
>>>>>>> HDFS.
>>>>>>>
>>>>>>>    Ok, so basically a controller is a service which connects
to all (or
>>>>>>>
>>>>>> some?) shards a distributed query, which tells the shard to run a
query
>>>>>> against a certain data set, that shard then gets that data set either
>>>>>> from
>>>>>> memory or from the hadoop cluster, processes it, and returns the
result
>>>>>> to
>>>>>> the controller which condenses the results from all the queried shards
>>>>>> into
>>>>>> a final result right?
>>>>>>
>>>>>>    Hope this helps.  Let us know if you have more questions.
>>>>>>
>>>>>>> Thanks,
>>>>>>> Aaron
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    Thanks,
>>>>>>>
>>>>>>>> Colton McInroy
>>>>>>>>
>>>>>>>>      * Director of Security Engineering
>>>>>>>>
>>>>>>>>
>>>>>>>> Phone
>>>>>>>> (Toll Free)
>>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>>
>>>>>>>> My Extension    101
>>>>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>>> Website         http://www.dosarrest.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message