flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ehsan ul Haq <m.ehsan....@gmail.com>
Subject Re: Extract data using regex into HBase
Date Sun, 26 Oct 2014 20:17:21 GMT
Hi,
   Here are my thoughts.

Using Batch Approach (Inspired by Lambda)
1. Store your syslog events as it is in the Hbase using Sync or Async Hbase
sink.
2. Write a map reduce job with your regular expression extraction and
output either in HDFS or Hbase whatever you need.
3. Run your mapreduce job periodically.
+ Once the syslog is imported in Hbase you can easily discard syslogs from
the actual source.
+ Syslog will be stored as immutable data in Hbase table, allowing you to
fix your regular expression extraction without destroying the log events.
- Need to periodically rerun the mapreduce job.
(I assume you want to expose the output as Hbase table)

Using Realtime approach
You can use the RegexHbaseEventSerializer. You can look at the following
usage
http://stackoverflow.com/questions/12304826/regular-expression-confiuration-in-flumeng
+ Your data is available immediately.
- Hard to fix errors.
- Can't to add more fields to already processed syslog events. (You will
have to run a mapreduce or reimport the whole syslog events again)

Regards
Ehsan




On Sun, Oct 26, 2014 at 7:50 PM, Alaa Ali <contact.alaa@gmail.com> wrote:

> Hello! I want to receive syslog, parse out the input using regex into
> fields (for example username, source IP, destination IP), and store the
> data in HBase into columns corresponding to those fields. I know how to do
> the syslog source, but how do I go about doing the extraction+storing?
>
> My thoughts:
>
> 1. Can I use a Regex Extractor Interceptor to make my own serializer
> implementation that extracts data into multiple headers in the event? Then
> use the AsyncHBase sink serializer to simply store the header values into
> columns? Can I do that?
>
> 2. Should I pass the data to the AsyncHBase sink unaltered, and implement
> everything in the sink's serializer.
>
> It is worth noting that the input is in different formats, so my regex
> implementation isn't one simple regex and will probably contain a lot of
> ifs to, for example, extract the username because it won't always be in the
> same place in the log. Which approach is best, or is there another
> approach, or am I getting it wrong?
>
> ​- ​
> Alaa Ali
>

Mime
View raw message