flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: Flume log4j-appender
Date Wed, 24 Sep 2014 22:11:53 GMT
Why do you need to remove the timestamp from the file names?

As far as creating the files on an hourly basis, it sounds like what
you want is a dataset partitioned by hour. This is very easy to do
using Kite and Flume's Kite integration. You'd create a partitioned
dataset similar to the ratings dataset shown in this example:

https://github.com/kite-sdk/kite/wiki/Hands-on-Kite-Lab-1:-Using-the-Kite-CLI-(Solution)

and then you'd configure Flume to write to the dataset similar to how
we do here:

https://github.com/kite-sdk/kite-examples/blob/snapshot/logging/flume.properties

This will write your data into a directory structure that looks like:

/flume/events/year=2014/month=09/day=24/hour=14

You can then access the dataset using Kite's view API to get just the
last hour of data:

view:hdfs:/flume/events?timestamp=($AN_HOUR_AGO,)

That would get all of the data from an hour ago until now.

Feel free to email the Kite user list (cdk-dev@cloudera.org) or my
offline if you want more help with those aspects of it.

If you wanted to do the same thing with the regular Flume HDFS Sink,
you need to either have a timestamp header in your events or set
hdfs.useLocalTimeStamp to true to use a local timestamp. And then you
need to define your path with a pattern:

agent1.sinks.hdfs-sink1.hdfs.path =
hdfs://192.168.145.127:9000/flume/events/year=%y/month=%m/day=%d/hour=%H

See this link for more details:

http://flume.apache.org/FlumeUserGuide.html#hdfs-sink

-Joey

On Wed, Sep 24, 2014 at 2:31 PM, Hanish Bansal
<hanish.bansal.agarwal@gmail.com> wrote:
> Thanks for reply !!
>
> Reposting a post since it was skipped.
>
> I have configured flume and my application to use log4j-appender. As
> described in log4j-appender configuration i have configured agent properly
> (avro source, memory channel and hdfs sink).
>
> I am able to collect logs in hdfs.
>
> hdfs-sink configuration is:
>
> agent1.sinks.hdfs-sink1.type = hdfs
> agent1.sinks.hdfs-sink1.hdfs.path =
> hdfs://192.168.145.127:9000/flume/events/
> agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
> agent1.sinks.hdfs-sink1.hdfs.rollCount=0
> agent1.sinks.hdfs-sink1.hdfs.rollSize=0
> agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
> agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream
>
> Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
> configuration and also includes current-timestamp in name.
>
> I am having hdfs files as below format:
> /flume/events/FlumeData.1411543838171
> /flume/events/FlumeData.1411544272696
> ....
> ....
>
> Custom file name can be specified using hdfs.filePrefix and hdfs.fileSuffix
> configuration.
>
> I want to know Could i remove timestamp from file name somehow?
>
> I am thinking to use flume log4j-appender in following way, please let me
> know if there is any issue:
>
> 1. A distributed application(say map-reduce) is running on different nodes.
> 2. Flume agent is running only on one node.
> 3.Application on all nodes will be configured to use same agent. This way
> all nodes will publish logs to same avro source.
> 4. Flume agent will write data to hdfs using hdfs sink. Files should be
> created on hour basis. One file will be created for logs of all nodes of
> last 1 hour.
>
> Thanks !!
>
> On 24/09/2014 9:57 pm, "Hari Shreedharan" <hshreedharan@cloudera.com> wrote:
>>
>> You can’t write to the same path from more than one sink - either the
>> directories have to be different or the hdfs.filePrefix has to be different.
>> This is because HDFS actually allows only one writer per file. If multiple
>> JVMs try to write to same file, HDFS will throw an exception indicating that
>> the lease has expired to other JVMs.
>>
>> Thanks, Hari
>>
>>
>> On Wed, Sep 24, 2014 at 7:04 AM, Hanish Bansal
>> <hanish.bansal.agarwal@gmail.com> wrote:
>>>
>>> One other way to use is:
>>>
>>> Run Flume agent on each node and configure hdfs sink to use same path.
>>>
>>> This way also logs of all nodes can be kept in single file (log
>>> aggregation of all nodes) in hdfs.
>>>
>>> Please let me know if there is any issue in any of use case.
>>>
>>>
>>>
>>> On Wed, Sep 24, 2014 at 7:21 PM, Hanish Bansal
>>> <hanish.bansal.agarwal@gmail.com> wrote:
>>>>
>>>> Thanks Joey !!
>>>>
>>>> Great help for me.
>>>>
>>>> I have configured flume and my application to use log4j-appender. As
>>>> described in log4j-appender configuration i have configured agent properly
>>>> (avro source, memory channel and hdfs sink).
>>>>
>>>> I am able to collect logs in hdfs.
>>>>
>>>> hdfs-sink configuration is:
>>>>
>>>> agent1.sinks.hdfs-sink1.type = hdfs
>>>> agent1.sinks.hdfs-sink1.hdfs.path =
>>>> hdfs://192.168.145.127:9000/flume/events/
>>>> agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
>>>> agent1.sinks.hdfs-sink1.hdfs.rollCount=0
>>>> agent1.sinks.hdfs-sink1.hdfs.rollSize=0
>>>> agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
>>>> agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream
>>>>
>>>> Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
>>>> configuration and also includes current-timestamp in name.
>>>>
>>>> I am having hdfs files as below format:
>>>> /flume/events/FlumeData.1411543838171
>>>> /flume/events/FlumeData.1411544272696
>>>> ....
>>>> ....
>>>>
>>>> Custom file name can be specified using hdfs.filePrefix and
>>>> hdfs.fileSuffix configuration.
>>>>
>>>> I want to know Could i remove timestamp from file name somehow?
>>>>
>>>> I am thinking to use flume log4j-appender in following way, please let
>>>> me know if there is any issue:
>>>>
>>>> A distributed application(say map-reduce) is running on different nodes.
>>>> Flume agent is running only on one node.
>>>> Application on all nodes will be configured to use same agent. This way
>>>> all nodes will publish logs to same avro source.
>>>> Flume agent will write data to hdfs. Files should be created on hour
>>>> basis. One file will be created for logs of all nodes of last 1 hour.
>>>>
>>>> Thanks !!
>>>>
>>>> On Tue, Sep 23, 2014 at 9:38 PM, Joey Echeverria <joey@cloudera.com>
>>>> wrote:
>>>>>
>>>>> Inline below.
>>>>>
>>>>> On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal
>>>>> <hanish.bansal.agarwal@gmail.com> wrote:
>>>>>>
>>>>>> Thanks for reply !!
>>>>>>
>>>>>> According to architecture multiple agents will run(on each node one
>>>>>> agent) and all agents will send log data(as events) to common collector,
and
>>>>>> that collector will write data to hdfs.
>>>>>>
>>>>>> Here I have some questions:
>>>>>> How collector will receive the data from agents and write to hdfs?
>>>>>
>>>>>
>>>>> Agents are typically connected by having the AvroSink of one agent
>>>>> connect to the AvroSource of the next agent in the chain.
>>>>>>
>>>>>> Does data coming from different agents is mixed up with each other
>>>>>> since collector will also write the data to hdfs in order as it arrives
to
>>>>>> collector.
>>>>>
>>>>> Data will enter the collector's channel in the order they arrive. Flume
>>>>> doesn't do any re-ordering of data.
>>>>>>
>>>>>> Can we use this logging feature for real time logging? We can accept
>>>>>> this if all logs  are in same order in hdfs as these were generated
from
>>>>>> application.
>>>>>
>>>>> It depends on your requirements. If you need events to be sorted by
>>>>> timestamp, then writing directly out of Flume to HDFS is not sufficient.
If
>>>>> you want your data in HDFS, then the best bet is to write to a partitioned
>>>>> Dataset and run a batch job to sort partitions. For example, if you
>>>>> partitioned by hour, you'd have an hourly job to sort the last hours
worth
>>>>> of data. Alternatively, you could write the data to HBase which can sort
the
>>>>> data by timestamp during ingest. In either case, Kite would make it easier
>>>>> for you to integrate with the underlying store as it can write to both
HDFS
>>>>> and HBase.
>>>>>
>>>>> -Joey
>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Hanish
>>>>>>
>>>>>> On 22/09/2014 10:12 pm, "Joey Echeverria" <joey@cloudera.com>
wrote:
>>>>>>>
>>>>>>> Hi Hanish,
>>>>>>>
>>>>>>> The Log4jAppender is designed to connect to a Flume agent running
an
>>>>>>> AvroSource. So, you'd configure Flume similar to [1] and then
point
>>>>>>> the Log4jAppender to your agent using the log4j properties you
linked
>>>>>>> to.
>>>>>>>
>>>>>>> The Log4jAppender will use Avro inspect the object being logged
to
>>>>>>> determine it's schema and to serialize it to bytes which becomes
the
>>>>>>> body of the events sent to Flume. If you're logging Strings,
which is
>>>>>>> most common, then the Schema will just be a Schema.String. There
are
>>>>>>> two ways that schema information can be passed. You can configure
the
>>>>>>> Log4jAppender with a Schema URL that will be sent in the event
>>>>>>> headers
>>>>>>> or you can leave that out and a JSON-reperesentation of the Schema
>>>>>>> will be sent as a header with each event. The URL is more efficient
>>>>>>> as
>>>>>>> it avoids sending extra information with each record, but you
can
>>>>>>> leave it out to start your testing.
>>>>>>>
>>>>>>> With regards to your second question, the answer is no. Flume
does
>>>>>>> not
>>>>>>> attempt to re-order events so your logs will appear in arrival
order.
>>>>>>> What I would do, is write the data to a partitioned directory
>>>>>>> structure and then have a Crunch job that sorts each partition
as it
>>>>>>> closes.
>>>>>>>
>>>>>>> You might consider taking a look at the Kite SDK[2] as we have
some
>>>>>>> examples that show how to do the logging[3] and can also handle
>>>>>>> getting the data properly partitioned on HDFS.
>>>>>>>
>>>>>>> HTH,
>>>>>>>
>>>>>>> -Joey
>>>>>>>
>>>>>>> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source
>>>>>>> [2] http://kitesdk.org/docs/current/
>>>>>>> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging
>>>>>>>
>>>>>>> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
>>>>>>> <hanish.bansal.agarwal@gmail.com> wrote:
>>>>>>> > Hi All,
>>>>>>> >
>>>>>>> > I want to use flume log4j-appender for logging of a map-reduce
>>>>>>> > application
>>>>>>> > which is running on different nodes. My use case is have
the logs
>>>>>>> > from all
>>>>>>> > nodes at centralized location(say HDFS) with time based
>>>>>>> > synchronization.
>>>>>>> >
>>>>>>> > As described in below links Flume has its own appender which
can be
>>>>>>> > used to
>>>>>>> > logging of an application in HDFS direct from log4j.
>>>>>>> >
>>>>>>> >
>>>>>>> > http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
>>>>>>> >
>>>>>>> >
>>>>>>> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender
>>>>>>> >
>>>>>>> > Could anyone please tell me if these logs are synchronized
on
>>>>>>> > time-basis or
>>>>>>> > not ?
>>>>>>> >
>>>>>>> > What i mean by time based synchronization is: Having logs
from
>>>>>>> > different
>>>>>>> > nodes in sorted order of time.
>>>>>>> >
>>>>>>> > Also could anyone provide a link that describes how flume
>>>>>>> > log4j-appender
>>>>>>> > internally works?
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Thanks & Regards
>>>>>>> > Hanish Bansal
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joey Echeverria
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joey Echeverria
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Hanish Bansal
>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Hanish Bansal
>>
>>
>



-- 
Joey Echeverria

Mime
View raw message