incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <>
Subject Re: Creating/Managing tables saying table already exists
Date Sat, 28 Sep 2013 13:42:43 GMT
On Sat, Sep 28, 2013 at 9:13 AM, Colton McInroy <>wrote:

> Thanks for the info, appreciate it.
> Yes I have looked at splunk... for the amount of data I am dealing with,
> they want $500,000 outright, with a $200,000 yearly subscription. As we
> scale up, the price goes up as well, their pricing scheme is absolutely
> nuts. I figure for less than the outright cost, I could build myself a
> hadoop cluster, a blur custer, and build the software myself that does just
> as good if not better than what splunk does. That's why I am working on
> this project right now. I made something with with lucene itself, but to
> ensure redundancy and scalability, I wanted to go with something
> distributed. Before I started developing my own project from scratch, I
> took a look at what was already out there for merging hadoop and lucene,
> which is what led me to blur. I am hoping that using blur with hadoop will
> allow me to manage the large amount of log data that I have constantly
> flowing in.

I have thought for a long time that we would need to support multiple
partitioning strategies.  For example it sounds like you would want to
round robin loading data into the shards once every few seconds to minutes.
 If we added a feature to not allow force a hash partitioner then you could
load any row into any shard.  This would allow for an easier loading
experience but at the cost of fetching a single row when the id is known.
 A single row fetch would become as expensive as a search (which really
isn't that bad in practice).

> Right now it's all on one server that has over 100tb of disk space, lots
> of ram and cores, but as I watch the load of the system right now, I
> realize that at some point, the single server just isn't going to cut it.
> At some point the level of data will go above what any single hardware box
> can do.
> Being the data is log entries, my goal is to be able to store/index all
> log data that comes in real time and make it easily searchable while using
> facet data to obtain metrics while being distributed across a redundant
> infrastructure.
> My plan is to use this information to correlate stuff that occurs across
> large timespans. Like seeing what traffic levels last year during this
> month where compared to the month this year. Or seeing what servers an IP
> has accessed over the past year, etc.

Yeah I believe the date type should be able to handle such a thing.  I can
think of a couple of different implementations based on the use cases.  One
would be a date with a millisecond timeUnit for very precise queries.  Blur
allows for indexing the same field different ways, so if you needed to (and
I'm not sure at what point you would) you could create a sub column off the
primary date field that would index the file with a different timeunit say
minutes or hours.  This would probably help with year wide count queries
when you are logging a lot of entry per millisecond.  The sub column field
is a index only field and it not stored, and you can have as many as you
want per field.  Just trying throw out some more options.

> From what I see so far, I will be building a hadoop cluster to store the
> data, a blur cluster to process the data, and then making a parser which
> takes in data with various formats to takes the data and passes it off to
> blur. Then I will have a client which handles the search queries against
> it... which actually brings up another question... If I parse data one way,
> but then craft a new parser, how well does blur handle changing records?...
> Like say I do not have a parser that handles a particular log entry. So
> that line ends up being logged as just a message field with the contents of
> the data stored in the message field. But then later I figure out a way to
> parse that line into custom fields. Does the mutation system work well for
> then when manipulating a lot of records... like say going over a month, or
> even a years worth of entries matching a certain query?

I would suggest MapReduce for large reprocessing of indexes.  Mutation
could get the work done, but it will likely take a lot longer.  I have been
load testing 0.2.0 on a small/medium cluster and I was able to achieve
about 50K mutates per second with a single client using the async client
using the mutate batch API call.  But if you need to reprocess/reparse an
entire table I think that a MapReduce program would be better.

> Thanks,
> Colton McInroy
>  * Director of Security Engineering
> Phone
> (Toll Free)
> _US_    (888)-818-1344 Press 2
> _UK_    0-800-635-0551 Press 2
> My Extension    101
> 24/7 Support <>
> Email <>
> Website
> On 9/28/2013 5:29 AM, Garrett Barton wrote:
>> The going back over a certain time was just a suggestion based on a guess
>> of your query needs.  Personally I would go for one large index to begin
>> with, if performance ever became a problem and I could not add a few more
>> nodes to the cluster, I would then consider splitting the index.
>>   In the next month or so I will be prototyping 2 very interesting
>> indexes.
>>   One will be a massive, massive full text index that I plan on using bulk
>> MR running every hour into.  The other has to be able to load several
>> TB/hour and I will be trying that on a much shorter MR schedule, say every
>> 5-10 minutes. I expect both to work fine.
>>    I don't think its that far of a stretch for you to go to the minute
>> level
>> like you have today with the MR approach, or hell try the thrift api, with
>> enough nodes I bet it would handle that kinda load as well.
>> Just slightly off topic, have you looked at splunk?  Does what your trying
>> to do out of the box.
>> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy <
>> >wrote:
>>  I actually didn't have any kind of loading interval, I loaded the new
>>> event log entries into the index in real time. My code runs as a daemon
>>> accepting syslog entries which indexes them live as they come in with a
>>> flush call every 10000 entries or 1 minute, which ever comes first.
>>> And I don't want to have any limitation on lookback time. I want to be
>>> able to look at the history of any site going back years if need be.
>>> Sucks there is no multi table reader, that limits what I can do by a bit.
>>> Thanks,
>>> Colton McInroy
>>>   * Director of Security Engineering
>>> Phone
>>> (Toll Free)
>>> _US_    (888)-818-1344 Press 2
>>> _UK_    0-800-635-0551 Press 2
>>> My Extension    101
>>> 24/7 Support <>
>>> Email <>
>>> Website
>>> On 9/28/2013 12:48 AM, Garrett Barton wrote:
>>>  Mapreduce is a bulk entrypoint to loading blur. Much in the same way I
>>>> bet
>>>> you have some fancy code to grab up a bunch of log files, over some kind
>>>> of
>>>> interval and load them into your index,  MR replaces that process with
>>>> an
>>>> auto scaling (via hardware additions only) high bandwidth load that you
>>>> could fire off at any interval you want. The MR bulk load writes a new
>>>> index and merges that into the index already running when it completes.
>>>> The
>>>> catch is that it is NOT as efficient as your implementation is in terms
>>>> of
>>>> latency into the index. So your current impl that will load a small
>>>> sites
>>>> couple of mb real fast, MR might take 30 seconds to a minute to bring
>>>> that
>>>> online. Having said that blur has a realtime api for inserting that has
>>>> low
>>>> latency but you trade in your high bandwidth for it. Might be something
>>>> you
>>>> could detect on your front door and decide which way in the data comes.
>>>> When I was in your shoes, highly optimizing your indexes based on size
>>>> and
>>>> load for a single badass machine and doing manual partitioning tricks to
>>>> keep things snappy was key.  The neat thing about blur is some of that
>>>> you
>>>> don't do anymore.  I would call it an early optimization at this point
>>>> to
>>>> do anything shorter than say a day or whatever your max lookback time
>>>> is.
>>>> (Oh btw you can't search across tables in blur, forgot to mention that.)
>>>> Instead of the lots of tables route I suggest trying one large one and
>>>> seeing where that goes. Utilize blurs cache initializing capabilities
>>>> and
>>>> load in your site and time columns to keep your logical partitioning
>>>> columns in the block cache and thus very fast. I bet you will see good
>>>> performance with this approach. Certainly better than es. Not as fast as
>>>> raw lucene, but there is always a price to pay for distributing and so
>>>> far
>>>> blur has the lowest overhead I've seen.
>>>> Hope that helps some.
>>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <>
>>>> wrote:
>>>>   Coments inline...
>>>>> Thanks,
>>>>> Colton McInroy
>>>>>    * Director of Security Engineering
>>>>> Phone
>>>>> (Toll Free)
>>>>> _US_    (888)-818-1344 Press 2
>>>>> _UK_    0-800-635-0551 Press 2
>>>>> My Extension    101
>>>>> 24/7 Support <>
>>>>> Email <>
>>>>> Website
>>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote:
>>>>>   I have commented inline below:
>>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <
>>>>>>  wrote:
>>>>>>>        I do have a few question if you don't mind... I am still
>>>>>>> trying
>>>>>>> to
>>>>>>> wrap my head around how this works. In my current implementation
>>>>>>> a
>>>>>>> logging system I create new indexes for each hour because I have
>>>>>>> massive
>>>>>>> amount of data coming in. I take in live log data from syslog
>>>>>>> parse/store it in hourly lucene indexes along with a facet index.
>>>>>>> want
>>>>>>> to
>>>>>>> turn this into a distributed redundant system and blur appears
to be
>>>>>>> the
>>>>>>> way to go. I tried elasticsearch but it is just too slow compared
>>>>>>> my
>>>>>>> current implementation. Given I take in gigs of raw log data
>>>>>>> hour, I
>>>>>>> need something that is robust and able to keep up with in flow
>>>>>>> data.
>>>>>>>    Due to the current implementation of building up an index
for an
>>>>>>> hour
>>>>>>>  and
>>>>>> then making available.  I would use MapReduce for this:
>>>>>> <****blur/docs/0.2.0/using-blur.****<**blur/docs/0.2.0/using-blur.**>
>>>>>> >
>>>>>> html#map-reduce<http://**incub****<**>
>>>>>> docs/0.2.0/using-blur.html#****map-reduce<http://incubator.**
>>>>>> >
>>>>>> That way all the shards in a table get a little more data each hour
>>>>>> and
>>>>>> it's very low impact on the running cluster.
>>>>>>   Not sure I understand this. I would like data to be accessible
>>>>>> as
>>>>> it
>>>>> comes in, not wait an hour before I can query against it.
>>>>> I am also not sure where map-reduce comes in here. I thought mapreduce
>>>>> is
>>>>> something that blur used internally.
>>>>>         When taking in lots of data constantly, how is it recommended
>>>>>> that it
>>>>>>  be stored? I mentioned above that I create a new index for each
>>>>>>> to
>>>>>>> keep data separated and quicker to search. If I want to look
up a
>>>>>>> specific
>>>>>>> time frame, I only have to load the directories timestamped with
>>>>>>> hours
>>>>>>> I want to look at. So instead of having to look at a huge index
>>>>>>> like a
>>>>>>> years worth of data, i'm looking at a much smaller data set which
>>>>>>> results
>>>>>>> in faster query response times. Should a new table be created
>>>>>>> each
>>>>>>> hour
>>>>>>> of data? When I typed in the create command into the shell, it
>>>>>>> about
>>>>>>> 6 seconds to create a table. If I have to create a table for
>>>>>>> application each hour, this could create a lot of lag. Perhaps
>>>>>>> is
>>>>>>> just
>>>>>>> in my test environment though. Any thoughts on this? I also didn't
>>>>>>> see
>>>>>>> any
>>>>>>> examples of how to create tables via code.
>>>>>>>    First off Blur is designed to store very large amounts of
>>>>>>>  And
>>>>>>>  while
>>>>>> it can do NRT updates like Solr and ES it's main focus in on bulk
>>>>>> ingestion
>>>>>> through MapReduce.  Given that, the real limiting factor is how much
>>>>>> hardware you have.  Let's play out a scenario.  If you are adding
>>>>>> of
>>>>>> data an hour and I would think that a good rough ballpark guess is
>>>>>> that
>>>>>> you
>>>>>> will need 10-15% of inbound data size as memory to make the search
>>>>>> perform
>>>>>> well.  However as the index sizes increase this % may decrease over
>>>>>> time.
>>>>>>     Blur has an off-heap lru cache to make accessing hdfs faster,
>>>>>> however if
>>>>>> you don't have enough memory the searches (and the cluster for that
>>>>>> matter)
>>>>>> won't fail, they will simply become slower.
>>>>>> So it's really a question of how much hardware you have.  If you
>>>>>> filling a table enough to where it does perform well given the cluster
>>>>>> you
>>>>>> have.  You might have to break it into pieces.  But I think that
>>>>>> hourly
>>>>>> is
>>>>>> too small.  Daily, Weekly, Monthly, etc.
>>>>>>   In my current system (which uses just lucene) I designed we take
>>>>> mainly
>>>>> web logs and separate them into indexes. Each web server gets it's own
>>>>> index for each hour. Then when I need to query the data, I use a multi
>>>>> index reader to access the timeframe I need allowing me to keep the
>>>>> size
>>>>> of
>>>>> index down to roughly what I need to search. If data was stored over
>>>>> month, and I want to query data that happened in just a single hour,
>>>>> or a
>>>>> few minutes, it makes sense to me to keep things optimized. Also, if
>>>>> wanted to compare one web server to another, I would just use the multi
>>>>> index reader to load both indexes. This is all handled by a single
>>>>> server
>>>>> though, so it is limited by the hardware of the single server. If
>>>>> something
>>>>> fails, it's a big problem. When trying to query large data sets, it's
>>>>> again, only a single server, so it takes longer than I would like if
>>>>> the
>>>>> index it's reading is large.
>>>>> I am not entirely sure how to go about doing this in blur. I'm
>>>>> imagining
>>>>> that each "table" is an index. So I would have a table format like...
>>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
>>>>> tables... like a milti table reader or something? or am I limited to
>>>>> looking at a single table at a time?
>>>>> For some web servers that have little traffic, an hour of data may only
>>>>> have a few mb of data in it while other may have like a 5-10gb index.
>>>>> If
>>>>> I
>>>>> combined the index from a large site with the small sites, this should
>>>>> make
>>>>> everything slower for the queries against the small sites index
>>>>> correct?
>>>>> Or
>>>>> would it all be the same due to how blur separates indexes into shards?
>>>>> Would it perhaps be better to have an index for each web server, and
>>>>> configure small sites to have less shards while larger sites have more
>>>>> shards?
>>>>> We just got a new really large powerful server to be our log server,
>>>>> but
>>>>> as I realize that it's a single point of failure, I want to change our
>>>>> configuration to use a clustered/distributed configuration. So we would
>>>>> start with probably a minimal configuration, and start adding more
>>>>> shard
>>>>> servers when ever we can afford it or need it.
>>>>>         Do shards contain the index data while the location (hdfs)
>>>>>> contains
>>>>>>  the documents (what lucene referred to them as)? I read that the
>>>>>>> shard
>>>>>>> contains the index while the fs contains the data... I just wasn't
>>>>>>> quiet
>>>>>>> sure what the data was, because when I work with lucene, the
>>>>>>> directory contains the data as a document.
>>>>>>>   The shard is stored in HDFS, and it is a Lucene index.  We
>>>>>>> the
>>>>>> data
>>>>>> inside the Lucene index, so it's basically Lucene all the way down
>>>>>> HDFS.
>>>>>>   Ok, so basically a controller is a service which connects to all
>>>>> some?) shards a distributed query, which tells the shard to run a query
>>>>> against a certain data set, that shard then gets that data set either
>>>>> from
>>>>> memory or from the hadoop cluster, processes it, and returns the result
>>>>> to
>>>>> the controller which condenses the results from all the queried shards
>>>>> into
>>>>> a final result right?
>>>>>   Hope this helps.  Let us know if you have more questions.
>>>>>> Thanks,
>>>>>> Aaron
>>>>>>   Thanks,
>>>>>>> Colton McInroy
>>>>>>>     * Director of Security Engineering
>>>>>>> Phone
>>>>>>> (Toll Free)
>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>> My Extension    101
>>>>>>> 24/7 Support <>
>>>>>>> Email <>
>>>>>>> Website

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message