incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Creating/Managing tables saying table already exists
Date Wed, 02 Oct 2013 01:59:02 GMT
Hi,

I'm joining this very informative thread a little late. Splunk is
pricey.  But why not go for Loggly or something along those lines?
Building a good, scalable, fast logging system with more than just
basic features and with a nice UI takes time, and time is priceless.
If it's of interest, we just released Logsene internally here at
Sematext ( http://sematext.com/logsene/ ) and have Kibana running on
top of it.  May be something to consider if you want a logging system
todayish.

Back to  observer mode...

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Sat, Sep 28, 2013 at 9:13 AM, Colton McInroy <colton@dosarrest.com> wrote:
> Thanks for the info, appreciate it.
>
> Yes I have looked at splunk... for the amount of data I am dealing with,
> they want $500,000 outright, with a $200,000 yearly subscription. As we
> scale up, the price goes up as well, their pricing scheme is absolutely
> nuts. I figure for less than the outright cost, I could build myself a
> hadoop cluster, a blur custer, and build the software myself that does just
> as good if not better than what splunk does. That's why I am working on this
> project right now. I made something with with lucene itself, but to ensure
> redundancy and scalability, I wanted to go with something distributed.
> Before I started developing my own project from scratch, I took a look at
> what was already out there for merging hadoop and lucene, which is what led
> me to blur. I am hoping that using blur with hadoop will allow me to manage
> the large amount of log data that I have constantly flowing in.
>
> Right now it's all on one server that has over 100tb of disk space, lots of
> ram and cores, but as I watch the load of the system right now, I realize
> that at some point, the single server just isn't going to cut it. At some
> point the level of data will go above what any single hardware box can do.
>
> Being the data is log entries, my goal is to be able to store/index all log
> data that comes in real time and make it easily searchable while using facet
> data to obtain metrics while being distributed across a redundant
> infrastructure.
>
> My plan is to use this information to correlate stuff that occurs across
> large timespans. Like seeing what traffic levels last year during this month
> where compared to the month this year. Or seeing what servers an IP has
> accessed over the past year, etc.
>
> From what I see so far, I will be building a hadoop cluster to store the
> data, a blur cluster to process the data, and then making a parser which
> takes in data with various formats to takes the data and passes it off to
> blur. Then I will have a client which handles the search queries against
> it... which actually brings up another question... If I parse data one way,
> but then craft a new parser, how well does blur handle changing records?...
> Like say I do not have a parser that handles a particular log entry. So that
> line ends up being logged as just a message field with the contents of the
> data stored in the message field. But then later I figure out a way to parse
> that line into custom fields. Does the mutation system work well for then
> when manipulating a lot of records... like say going over a month, or even a
> years worth of entries matching a certain query?
>
>
> Thanks,
> Colton McInroy
>
>  * Director of Security Engineering
>
>
> Phone
> (Toll Free)
> _US_    (888)-818-1344 Press 2
> _UK_    0-800-635-0551 Press 2
>
> My Extension    101
> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
> Website         http://www.dosarrest.com
>
> On 9/28/2013 5:29 AM, Garrett Barton wrote:
>>
>> The going back over a certain time was just a suggestion based on a guess
>> of your query needs.  Personally I would go for one large index to begin
>> with, if performance ever became a problem and I could not add a few more
>> nodes to the cluster, I would then consider splitting the index.
>>
>>   In the next month or so I will be prototyping 2 very interesting
>> indexes.
>>   One will be a massive, massive full text index that I plan on using bulk
>> MR running every hour into.  The other has to be able to load several
>> TB/hour and I will be trying that on a much shorter MR schedule, say every
>> 5-10 minutes. I expect both to work fine.
>>
>>    I don't think its that far of a stretch for you to go to the minute
>> level
>> like you have today with the MR approach, or hell try the thrift api, with
>> enough nodes I bet it would handle that kinda load as well.
>>
>> Just slightly off topic, have you looked at splunk?  Does what your trying
>> to do out of the box.
>>
>>
>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy
>> <colton@dosarrest.com>wrote:
>>
>>> I actually didn't have any kind of loading interval, I loaded the new
>>> event log entries into the index in real time. My code runs as a daemon
>>> accepting syslog entries which indexes them live as they come in with a
>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>>
>>> And I don't want to have any limitation on lookback time. I want to be
>>> able to look at the history of any site going back years if need be.
>>>
>>> Sucks there is no multi table reader, that limits what I can do by a bit.
>>>
>>>
>>> Thanks,
>>> Colton McInroy
>>>
>>>   * Director of Security Engineering
>>>
>>>
>>> Phone
>>> (Toll Free)
>>> _US_    (888)-818-1344 Press 2
>>> _UK_    0-800-635-0551 Press 2
>>>
>>> My Extension    101
>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>> Website         http://www.dosarrest.com
>>>
>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>
>>>> Mapreduce is a bulk entrypoint to loading blur. Much in the same way I
>>>> bet
>>>> you have some fancy code to grab up a bunch of log files, over some kind
>>>> of
>>>> interval and load them into your index,  MR replaces that process with
>>>> an
>>>> auto scaling (via hardware additions only) high bandwidth load that you
>>>> could fire off at any interval you want. The MR bulk load writes a new
>>>> index and merges that into the index already running when it completes.
>>>> The
>>>> catch is that it is NOT as efficient as your implementation is in terms
>>>> of
>>>> latency into the index. So your current impl that will load a small
>>>> sites
>>>> couple of mb real fast, MR might take 30 seconds to a minute to bring
>>>> that
>>>> online. Having said that blur has a realtime api for inserting that has
>>>> low
>>>> latency but you trade in your high bandwidth for it. Might be something
>>>> you
>>>> could detect on your front door and decide which way in the data comes.
>>>>
>>>> When I was in your shoes, highly optimizing your indexes based on size
>>>> and
>>>> load for a single badass machine and doing manual partitioning tricks to
>>>> keep things snappy was key.  The neat thing about blur is some of that
>>>> you
>>>> don't do anymore.  I would call it an early optimization at this point
>>>> to
>>>> do anything shorter than say a day or whatever your max lookback time
>>>> is.
>>>> (Oh btw you can't search across tables in blur, forgot to mention that.)
>>>>
>>>> Instead of the lots of tables route I suggest trying one large one and
>>>> seeing where that goes. Utilize blurs cache initializing capabilities
>>>> and
>>>> load in your site and time columns to keep your logical partitioning
>>>> columns in the block cache and thus very fast. I bet you will see good
>>>> performance with this approach. Certainly better than es. Not as fast as
>>>> raw lucene, but there is always a price to pay for distributing and so
>>>> far
>>>> blur has the lowest overhead I've seen.
>>>>
>>>> Hope that helps some.
>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <colton@dosarrest.com> wrote:
>>>>
>>>>   Coments inline...
>>>>>
>>>>> Thanks,
>>>>> Colton McInroy
>>>>>
>>>>>    * Director of Security Engineering
>>>>>
>>>>>
>>>>> Phone
>>>>> (Toll Free)
>>>>> _US_    (888)-818-1344 Press 2
>>>>> _UK_    0-800-635-0551 Press 2
>>>>>
>>>>> My Extension    101
>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>> Website         http://www.dosarrest.com
>>>>>
>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>
>>>>>   I have commented inline below:
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <colton@dosarrest.com
>>>>>>
>>>>>>> wrote:
>>>>>>>        I do have a few question if you don't mind... I am still
>>>>>>> trying
>>>>>>> to
>>>>>>> wrap my head around how this works. In my current implementation
for
>>>>>>> a
>>>>>>> logging system I create new indexes for each hour because I have
a
>>>>>>> massive
>>>>>>> amount of data coming in. I take in live log data from syslog
and
>>>>>>> parse/store it in hourly lucene indexes along with a facet index.
I
>>>>>>> want
>>>>>>> to
>>>>>>> turn this into a distributed redundant system and blur appears
to be
>>>>>>> the
>>>>>>> way to go. I tried elasticsearch but it is just too slow compared
to
>>>>>>> my
>>>>>>> current implementation. Given I take in gigs of raw log data
an hour,
>>>>>>> I
>>>>>>> need something that is robust and able to keep up with in flow
of
>>>>>>> data.
>>>>>>>
>>>>>>>    Due to the current implementation of building up an index
for an
>>>>>>> hour
>>>>>>>
>>>>>> and
>>>>>> then making available.  I would use MapReduce for this:
>>>>>>
>>>>>>
>>>>>> http://incubator.apache.org/****blur/docs/0.2.0/using-blur.**<http://incubator.apache.org/**blur/docs/0.2.0/using-blur.**>
>>>>>> html#map-reduce<http://**incubator.apache.org/blur/**
>>>>>>
>>>>>> docs/0.2.0/using-blur.html#**map-reduce<http://incubator.apache.org/blur/docs/0.2.0/using-blur.html#map-reduce>
>>>>>>
>>>>>> That way all the shards in a table get a little more data each hour
>>>>>> and
>>>>>> it's very low impact on the running cluster.
>>>>>>
>>>>>>   Not sure I understand this. I would like data to be accessible
live
>>>>>> as
>>>>>
>>>>> it
>>>>> comes in, not wait an hour before I can query against it.
>>>>> I am also not sure where map-reduce comes in here. I thought mapreduce
>>>>> is
>>>>> something that blur used internally.
>>>>>
>>>>>         When taking in lots of data constantly, how is it recommended
>>>>>>
>>>>>> that it
>>>>>>
>>>>>>> be stored? I mentioned above that I create a new index for each
hour
>>>>>>> to
>>>>>>> keep data separated and quicker to search. If I want to look
up a
>>>>>>> specific
>>>>>>> time frame, I only have to load the directories timestamped with
the
>>>>>>> hours
>>>>>>> I want to look at. So instead of having to look at a huge index
of
>>>>>>> like a
>>>>>>> years worth of data, i'm looking at a much smaller data set which
>>>>>>> results
>>>>>>> in faster query response times. Should a new table be created
for
>>>>>>> each
>>>>>>> hour
>>>>>>> of data? When I typed in the create command into the shell, it
takes
>>>>>>> about
>>>>>>> 6 seconds to create a table. If I have to create a table for
each
>>>>>>> application each hour, this could create a lot of lag. Perhaps
this
>>>>>>> is
>>>>>>> just
>>>>>>> in my test environment though. Any thoughts on this? I also didn't
>>>>>>> see
>>>>>>> any
>>>>>>> examples of how to create tables via code.
>>>>>>>
>>>>>>>    First off Blur is designed to store very large amounts of
data.
>>>>>>> And
>>>>>>>
>>>>>> while
>>>>>> it can do NRT updates like Solr and ES it's main focus in on bulk
>>>>>> ingestion
>>>>>> through MapReduce.  Given that, the real limiting factor is how much
>>>>>> hardware you have.  Let's play out a scenario.  If you are adding
10GB
>>>>>> of
>>>>>> data an hour and I would think that a good rough ballpark guess is
>>>>>> that
>>>>>> you
>>>>>> will need 10-15% of inbound data size as memory to make the search
>>>>>> perform
>>>>>> well.  However as the index sizes increase this % may decrease over
>>>>>> time.
>>>>>>     Blur has an off-heap lru cache to make accessing hdfs faster,
>>>>>> however if
>>>>>> you don't have enough memory the searches (and the cluster for that
>>>>>> matter)
>>>>>> won't fail, they will simply become slower.
>>>>>>
>>>>>> So it's really a question of how much hardware you have.  If you
have
>>>>>> filling a table enough to where it does perform well given the cluster
>>>>>> you
>>>>>> have.  You might have to break it into pieces.  But I think that
>>>>>> hourly
>>>>>> is
>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>
>>>>>>   In my current system (which uses just lucene) I designed we take
in
>>>>>
>>>>> mainly
>>>>> web logs and separate them into indexes. Each web server gets it's own
>>>>> index for each hour. Then when I need to query the data, I use a multi
>>>>> index reader to access the timeframe I need allowing me to keep the
>>>>> size
>>>>> of
>>>>> index down to roughly what I need to search. If data was stored over
a
>>>>> month, and I want to query data that happened in just a single hour,
or
>>>>> a
>>>>> few minutes, it makes sense to me to keep things optimized. Also, if
I
>>>>> wanted to compare one web server to another, I would just use the multi
>>>>> index reader to load both indexes. This is all handled by a single
>>>>> server
>>>>> though, so it is limited by the hardware of the single server. If
>>>>> something
>>>>> fails, it's a big problem. When trying to query large data sets, it's
>>>>> again, only a single server, so it takes longer than I would like if
>>>>> the
>>>>> index it's reading is large.
>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>> imagining
>>>>> that each "table" is an index. So I would have a table format like...
>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>>>>> tables... like a milti table reader or something? or am I limited to
>>>>> looking at a single table at a time?
>>>>> For some web servers that have little traffic, an hour of data may only
>>>>> have a few mb of data in it while other may have like a 5-10gb index.
>>>>> If
>>>>> I
>>>>> combined the index from a large site with the small sites, this should
>>>>> make
>>>>> everything slower for the queries against the small sites index
>>>>> correct?
>>>>> Or
>>>>> would it all be the same due to how blur separates indexes into shards?
>>>>> Would it perhaps be better to have an index for each web server, and
>>>>> configure small sites to have less shards while larger sites have more
>>>>> shards?
>>>>> We just got a new really large powerful server to be our log server,
>>>>> but
>>>>> as I realize that it's a single point of failure, I want to change our
>>>>> configuration to use a clustered/distributed configuration. So we would
>>>>> start with probably a minimal configuration, and start adding more
>>>>> shard
>>>>> servers when ever we can afford it or need it.
>>>>>
>>>>>         Do shards contain the index data while the location (hdfs)
>>>>>>
>>>>>> contains
>>>>>>
>>>>>>> the documents (what lucene referred to them as)? I read that
the
>>>>>>> shard
>>>>>>> contains the index while the fs contains the data... I just wasn't
>>>>>>> quiet
>>>>>>> sure what the data was, because when I work with lucene, the
index
>>>>>>> directory contains the data as a document.
>>>>>>>
>>>>>>>   The shard is stored in HDFS, and it is a Lucene index.  We
store
>>>>>>> the
>>>>>>
>>>>>> data
>>>>>> inside the Lucene index, so it's basically Lucene all the way down
to
>>>>>> HDFS.
>>>>>>
>>>>>>   Ok, so basically a controller is a service which connects to all
(or
>>>>>
>>>>> some?) shards a distributed query, which tells the shard to run a query
>>>>> against a certain data set, that shard then gets that data set either
>>>>> from
>>>>> memory or from the hadoop cluster, processes it, and returns the result
>>>>> to
>>>>> the controller which condenses the results from all the queried shards
>>>>> into
>>>>> a final result right?
>>>>>
>>>>>   Hope this helps.  Let us know if you have more questions.
>>>>>>
>>>>>> Thanks,
>>>>>> Aaron
>>>>>>
>>>>>>
>>>>>>
>>>>>>   Thanks,
>>>>>>>
>>>>>>> Colton McInroy
>>>>>>>
>>>>>>>     * Director of Security Engineering
>>>>>>>
>>>>>>>
>>>>>>> Phone
>>>>>>> (Toll Free)
>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>
>>>>>>> My Extension    101
>>>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>> Website         http://www.dosarrest.com
>>>>>>>
>>>>>>>
>>>>>>>
>

Mime
View raw message