incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <col...@dosarrest.com>
Subject Re: Creating/Managing tables saying table already exists
Date Wed, 02 Oct 2013 21:52:59 GMT
I have tried all of those systems, and with the amount of data we deal 
with, they don't stand up very well to what we need done. I have already 
designed 2 log systems, and this will be hopefully my third and final 
implementation. This time being a distributed version of my previous work.

It does take time to develop things, but for the amount of time I would 
spend developing with this, it's well worth it. I have tried other free 
systems out there like ElasticSearch and the likes, but they all fall 
short. With my software running on a dual intel xeon W5580@3.2Ghz with 
32hdd's in hardware raid (5 or 6, I forget which) can process about 
10,000-20,000 entries per second while putting the server load at about 
2.00-5.00 over 5 minutes and about 50-70% disk utilization. When using 
other systems like ElasticSearch, the server hits a load average of over 
50.00 and 100% disk utilization which makes most of those solutions 
inadequate for what I am trying to do.

I also need something that I can adjust the log formats on the fly, 
adding new ones, detecting entries that have no patterns, etc. Splunk is 
the only thing I have seen which can properly handle the amount of log 
data I am dealing with in terms of both indexing and searching, but it 
is just extremely expensive. I have no idea how they can get away with 
such a ridiculas pricing scheme. I have talked to their developers at 
conferences and they agreed the prices are just crazy. For the price of 
just one years subscription fee, I can and will pay myself to build 
something that will do roughly the same job as well as buy the hardware 
to start the clusters.

Thanks,
Colton McInroy

  * Director of Security Engineering

	
Phone
(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support 	support@dosarrest.com <mailto:support@dosarrest.com>
Email 	colton@dosarrest.com <mailto:colton@dosarrest.com>
Website 	http://www.dosarrest.com

On 10/1/2013 6:59 PM, Otis Gospodnetic wrote:
> Hi,
>
> I'm joining this very informative thread a little late. Splunk is
> pricey.  But why not go for Loggly or something along those lines?
> Building a good, scalable, fast logging system with more than just
> basic features and with a nice UI takes time, and time is priceless.
> If it's of interest, we just released Logsene internally here at
> Sematext ( http://sematext.com/logsene/ ) and have Kibana running on
> top of it.  May be something to consider if you want a logging system
> todayish.
>
> Back to  observer mode...
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
>
> On Sat, Sep 28, 2013 at 9:13 AM, Colton McInroy <colton@dosarrest.com> wrote:
>> Thanks for the info, appreciate it.
>>
>> Yes I have looked at splunk... for the amount of data I am dealing with,
>> they want $500,000 outright, with a $200,000 yearly subscription. As we
>> scale up, the price goes up as well, their pricing scheme is absolutely
>> nuts. I figure for less than the outright cost, I could build myself a
>> hadoop cluster, a blur custer, and build the software myself that does just
>> as good if not better than what splunk does. That's why I am working on this
>> project right now. I made something with with lucene itself, but to ensure
>> redundancy and scalability, I wanted to go with something distributed.
>> Before I started developing my own project from scratch, I took a look at
>> what was already out there for merging hadoop and lucene, which is what led
>> me to blur. I am hoping that using blur with hadoop will allow me to manage
>> the large amount of log data that I have constantly flowing in.
>>
>> Right now it's all on one server that has over 100tb of disk space, lots of
>> ram and cores, but as I watch the load of the system right now, I realize
>> that at some point, the single server just isn't going to cut it. At some
>> point the level of data will go above what any single hardware box can do.
>>
>> Being the data is log entries, my goal is to be able to store/index all log
>> data that comes in real time and make it easily searchable while using facet
>> data to obtain metrics while being distributed across a redundant
>> infrastructure.
>>
>> My plan is to use this information to correlate stuff that occurs across
>> large timespans. Like seeing what traffic levels last year during this month
>> where compared to the month this year. Or seeing what servers an IP has
>> accessed over the past year, etc.
>>
>>  From what I see so far, I will be building a hadoop cluster to store the
>> data, a blur cluster to process the data, and then making a parser which
>> takes in data with various formats to takes the data and passes it off to
>> blur. Then I will have a client which handles the search queries against
>> it... which actually brings up another question... If I parse data one way,
>> but then craft a new parser, how well does blur handle changing records?...
>> Like say I do not have a parser that handles a particular log entry. So that
>> line ends up being logged as just a message field with the contents of the
>> data stored in the message field. But then later I figure out a way to parse
>> that line into custom fields. Does the mutation system work well for then
>> when manipulating a lot of records... like say going over a month, or even a
>> years worth of entries matching a certain query?
>>
>>
>> Thanks,
>> Colton McInroy
>>
>>   * Director of Security Engineering
>>
>>
>> Phone
>> (Toll Free)
>> _US_    (888)-818-1344 Press 2
>> _UK_    0-800-635-0551 Press 2
>>
>> My Extension    101
>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>> Website         http://www.dosarrest.com
>>
>> On 9/28/2013 5:29 AM, Garrett Barton wrote:
>>> The going back over a certain time was just a suggestion based on a guess
>>> of your query needs.  Personally I would go for one large index to begin
>>> with, if performance ever became a problem and I could not add a few more
>>> nodes to the cluster, I would then consider splitting the index.
>>>
>>>    In the next month or so I will be prototyping 2 very interesting
>>> indexes.
>>>    One will be a massive, massive full text index that I plan on using bulk
>>> MR running every hour into.  The other has to be able to load several
>>> TB/hour and I will be trying that on a much shorter MR schedule, say every
>>> 5-10 minutes. I expect both to work fine.
>>>
>>>     I don't think its that far of a stretch for you to go to the minute
>>> level
>>> like you have today with the MR approach, or hell try the thrift api, with
>>> enough nodes I bet it would handle that kinda load as well.
>>>
>>> Just slightly off topic, have you looked at splunk?  Does what your trying
>>> to do out of the box.
>>>
>>>
>>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy
>>> <colton@dosarrest.com>wrote:
>>>
>>>> I actually didn't have any kind of loading interval, I loaded the new
>>>> event log entries into the index in real time. My code runs as a daemon
>>>> accepting syslog entries which indexes them live as they come in with a
>>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>>>
>>>> And I don't want to have any limitation on lookback time. I want to be
>>>> able to look at the history of any site going back years if need be.
>>>>
>>>> Sucks there is no multi table reader, that limits what I can do by a bit.
>>>>
>>>>
>>>> Thanks,
>>>> Colton McInroy
>>>>
>>>>    * Director of Security Engineering
>>>>
>>>>
>>>> Phone
>>>> (Toll Free)
>>>> _US_    (888)-818-1344 Press 2
>>>> _UK_    0-800-635-0551 Press 2
>>>>
>>>> My Extension    101
>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>> Website         http://www.dosarrest.com
>>>>
>>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>>
>>>>> Mapreduce is a bulk entrypoint to loading blur. Much in the same way
I
>>>>> bet
>>>>> you have some fancy code to grab up a bunch of log files, over some kind
>>>>> of
>>>>> interval and load them into your index,  MR replaces that process with
>>>>> an
>>>>> auto scaling (via hardware additions only) high bandwidth load that you
>>>>> could fire off at any interval you want. The MR bulk load writes a new
>>>>> index and merges that into the index already running when it completes.
>>>>> The
>>>>> catch is that it is NOT as efficient as your implementation is in terms
>>>>> of
>>>>> latency into the index. So your current impl that will load a small
>>>>> sites
>>>>> couple of mb real fast, MR might take 30 seconds to a minute to bring
>>>>> that
>>>>> online. Having said that blur has a realtime api for inserting that has
>>>>> low
>>>>> latency but you trade in your high bandwidth for it. Might be something
>>>>> you
>>>>> could detect on your front door and decide which way in the data comes.
>>>>>
>>>>> When I was in your shoes, highly optimizing your indexes based on size
>>>>> and
>>>>> load for a single badass machine and doing manual partitioning tricks
to
>>>>> keep things snappy was key.  The neat thing about blur is some of that
>>>>> you
>>>>> don't do anymore.  I would call it an early optimization at this point
>>>>> to
>>>>> do anything shorter than say a day or whatever your max lookback time
>>>>> is.
>>>>> (Oh btw you can't search across tables in blur, forgot to mention that.)
>>>>>
>>>>> Instead of the lots of tables route I suggest trying one large one and
>>>>> seeing where that goes. Utilize blurs cache initializing capabilities
>>>>> and
>>>>> load in your site and time columns to keep your logical partitioning
>>>>> columns in the block cache and thus very fast. I bet you will see good
>>>>> performance with this approach. Certainly better than es. Not as fast
as
>>>>> raw lucene, but there is always a price to pay for distributing and so
>>>>> far
>>>>> blur has the lowest overhead I've seen.
>>>>>
>>>>> Hope that helps some.
>>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <colton@dosarrest.com>
wrote:
>>>>>
>>>>>    Coments inline...
>>>>>> Thanks,
>>>>>> Colton McInroy
>>>>>>
>>>>>>     * Director of Security Engineering
>>>>>>
>>>>>>
>>>>>> Phone
>>>>>> (Toll Free)
>>>>>> _US_    (888)-818-1344 Press 2
>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>
>>>>>> My Extension    101
>>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>> Website         http://www.dosarrest.com
>>>>>>
>>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>>
>>>>>>    I have commented inline below:
>>>>>>>
>>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <colton@dosarrest.com
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>         I do have a few question if you don't mind... I am
still
>>>>>>>> trying
>>>>>>>> to
>>>>>>>> wrap my head around how this works. In my current implementation
for
>>>>>>>> a
>>>>>>>> logging system I create new indexes for each hour because
I have a
>>>>>>>> massive
>>>>>>>> amount of data coming in. I take in live log data from syslog
and
>>>>>>>> parse/store it in hourly lucene indexes along with a facet
index. I
>>>>>>>> want
>>>>>>>> to
>>>>>>>> turn this into a distributed redundant system and blur appears
to be
>>>>>>>> the
>>>>>>>> way to go. I tried elasticsearch but it is just too slow
compared to
>>>>>>>> my
>>>>>>>> current implementation. Given I take in gigs of raw log data
an hour,
>>>>>>>> I
>>>>>>>> need something that is robust and able to keep up with in
flow of
>>>>>>>> data.
>>>>>>>>
>>>>>>>>     Due to the current implementation of building up an index
for an
>>>>>>>> hour
>>>>>>>>
>>>>>>> and
>>>>>>> then making available.  I would use MapReduce for this:
>>>>>>>
>>>>>>>
>>>>>>> http://incubator.apache.org/****blur/docs/0.2.0/using-blur.**<http://incubator.apache.org/**blur/docs/0.2.0/using-blur.**>
>>>>>>> html#map-reduce<http://**incubator.apache.org/blur/**
>>>>>>>
>>>>>>> docs/0.2.0/using-blur.html#**map-reduce<http://incubator.apache.org/blur/docs/0.2.0/using-blur.html#map-reduce>
>>>>>>>
>>>>>>> That way all the shards in a table get a little more data each
hour
>>>>>>> and
>>>>>>> it's very low impact on the running cluster.
>>>>>>>
>>>>>>>    Not sure I understand this. I would like data to be accessible
live
>>>>>>> as
>>>>>> it
>>>>>> comes in, not wait an hour before I can query against it.
>>>>>> I am also not sure where map-reduce comes in here. I thought mapreduce
>>>>>> is
>>>>>> something that blur used internally.
>>>>>>
>>>>>>          When taking in lots of data constantly, how is it recommended
>>>>>>> that it
>>>>>>>
>>>>>>>> be stored? I mentioned above that I create a new index for
each hour
>>>>>>>> to
>>>>>>>> keep data separated and quicker to search. If I want to look
up a
>>>>>>>> specific
>>>>>>>> time frame, I only have to load the directories timestamped
with the
>>>>>>>> hours
>>>>>>>> I want to look at. So instead of having to look at a huge
index of
>>>>>>>> like a
>>>>>>>> years worth of data, i'm looking at a much smaller data set
which
>>>>>>>> results
>>>>>>>> in faster query response times. Should a new table be created
for
>>>>>>>> each
>>>>>>>> hour
>>>>>>>> of data? When I typed in the create command into the shell,
it takes
>>>>>>>> about
>>>>>>>> 6 seconds to create a table. If I have to create a table
for each
>>>>>>>> application each hour, this could create a lot of lag. Perhaps
this
>>>>>>>> is
>>>>>>>> just
>>>>>>>> in my test environment though. Any thoughts on this? I also
didn't
>>>>>>>> see
>>>>>>>> any
>>>>>>>> examples of how to create tables via code.
>>>>>>>>
>>>>>>>>     First off Blur is designed to store very large amounts
of data.
>>>>>>>> And
>>>>>>>>
>>>>>>> while
>>>>>>> it can do NRT updates like Solr and ES it's main focus in on
bulk
>>>>>>> ingestion
>>>>>>> through MapReduce.  Given that, the real limiting factor is how
much
>>>>>>> hardware you have.  Let's play out a scenario.  If you are adding
10GB
>>>>>>> of
>>>>>>> data an hour and I would think that a good rough ballpark guess
is
>>>>>>> that
>>>>>>> you
>>>>>>> will need 10-15% of inbound data size as memory to make the search
>>>>>>> perform
>>>>>>> well.  However as the index sizes increase this % may decrease
over
>>>>>>> time.
>>>>>>>      Blur has an off-heap lru cache to make accessing hdfs faster,
>>>>>>> however if
>>>>>>> you don't have enough memory the searches (and the cluster for
that
>>>>>>> matter)
>>>>>>> won't fail, they will simply become slower.
>>>>>>>
>>>>>>> So it's really a question of how much hardware you have.  If
you have
>>>>>>> filling a table enough to where it does perform well given the
cluster
>>>>>>> you
>>>>>>> have.  You might have to break it into pieces.  But I think that
>>>>>>> hourly
>>>>>>> is
>>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>>
>>>>>>>    In my current system (which uses just lucene) I designed we
take in
>>>>>> mainly
>>>>>> web logs and separate them into indexes. Each web server gets it's
own
>>>>>> index for each hour. Then when I need to query the data, I use a
multi
>>>>>> index reader to access the timeframe I need allowing me to keep the
>>>>>> size
>>>>>> of
>>>>>> index down to roughly what I need to search. If data was stored over
a
>>>>>> month, and I want to query data that happened in just a single hour,
or
>>>>>> a
>>>>>> few minutes, it makes sense to me to keep things optimized. Also,
if I
>>>>>> wanted to compare one web server to another, I would just use the
multi
>>>>>> index reader to load both indexes. This is all handled by a single
>>>>>> server
>>>>>> though, so it is limited by the hardware of the single server. If
>>>>>> something
>>>>>> fails, it's a big problem. When trying to query large data sets,
it's
>>>>>> again, only a single server, so it takes longer than I would like
if
>>>>>> the
>>>>>> index it's reading is large.
>>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>>> imagining
>>>>>> that each "table" is an index. So I would have a table format like...
>>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>>>>>> tables... like a milti table reader or something? or am I limited
to
>>>>>> looking at a single table at a time?
>>>>>> For some web servers that have little traffic, an hour of data may
only
>>>>>> have a few mb of data in it while other may have like a 5-10gb index.
>>>>>> If
>>>>>> I
>>>>>> combined the index from a large site with the small sites, this should
>>>>>> make
>>>>>> everything slower for the queries against the small sites index
>>>>>> correct?
>>>>>> Or
>>>>>> would it all be the same due to how blur separates indexes into shards?
>>>>>> Would it perhaps be better to have an index for each web server,
and
>>>>>> configure small sites to have less shards while larger sites have
more
>>>>>> shards?
>>>>>> We just got a new really large powerful server to be our log server,
>>>>>> but
>>>>>> as I realize that it's a single point of failure, I want to change
our
>>>>>> configuration to use a clustered/distributed configuration. So we
would
>>>>>> start with probably a minimal configuration, and start adding more
>>>>>> shard
>>>>>> servers when ever we can afford it or need it.
>>>>>>
>>>>>>          Do shards contain the index data while the location (hdfs)
>>>>>>> contains
>>>>>>>
>>>>>>>> the documents (what lucene referred to them as)? I read that
the
>>>>>>>> shard
>>>>>>>> contains the index while the fs contains the data... I just
wasn't
>>>>>>>> quiet
>>>>>>>> sure what the data was, because when I work with lucene,
the index
>>>>>>>> directory contains the data as a document.
>>>>>>>>
>>>>>>>>    The shard is stored in HDFS, and it is a Lucene index.
 We store
>>>>>>>> the
>>>>>>> data
>>>>>>> inside the Lucene index, so it's basically Lucene all the way
down to
>>>>>>> HDFS.
>>>>>>>
>>>>>>>    Ok, so basically a controller is a service which connects
to all (or
>>>>>> some?) shards a distributed query, which tells the shard to run a
query
>>>>>> against a certain data set, that shard then gets that data set either
>>>>>> from
>>>>>> memory or from the hadoop cluster, processes it, and returns the
result
>>>>>> to
>>>>>> the controller which condenses the results from all the queried shards
>>>>>> into
>>>>>> a final result right?
>>>>>>
>>>>>>    Hope this helps.  Let us know if you have more questions.
>>>>>>> Thanks,
>>>>>>> Aaron
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    Thanks,
>>>>>>>> Colton McInroy
>>>>>>>>
>>>>>>>>      * Director of Security Engineering
>>>>>>>>
>>>>>>>>
>>>>>>>> Phone
>>>>>>>> (Toll Free)
>>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>>
>>>>>>>> My Extension    101
>>>>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>>> Website         http://www.dosarrest.com
>>>>>>>>
>>>>>>>>
>>>>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message