flume-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mpe...@apache.org
Subject svn commit: r1498792 [2/3] - in /flume/site/trunk: ./ content/sphinx/ content/sphinx/releases/
Date Tue, 02 Jul 2013 06:06:19 GMT
Modified: flume/site/trunk/content/sphinx/FlumeUserGuide.rst
URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/FlumeUserGuide.rst?rev=1498792&r1=1498791&r2=1498792&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/FlumeUserGuide.rst (original)
+++ flume/site/trunk/content/sphinx/FlumeUserGuide.rst Tue Jul  2 06:06:18 2013
@@ -15,7 +15,7 @@
 
 
 ======================================
-Flume 1.3.0 User Guide
+Flume 1.4.0 User Guide
 ======================================
 
 Introduction
@@ -28,16 +28,32 @@ Apache Flume is a distributed, reliable,
 collecting, aggregating and moving large amounts of log data from many
 different sources to a centralized data store.
 
+The use of Apache Flume is not only restricted to log data aggregation. 
+Since data sources are customizable, Flume can be used to transport massive quantities
+of event data including but not limited to network traffic data, social-media-generated data, 
+email messages and pretty much any data source possible.
+
 Apache Flume is a top level project at the Apache Software Foundation.
+
 There are currently two release code lines available, versions 0.9.x and 1.x.
-This documentation applies to the 1.x codeline.
-Please click here for
+
+Documentation for the 0.9.x track is available at 
 `the Flume 0.9.x User Guide <http://archive.cloudera.com/cdh/3/flume/UserGuide/>`_.
 
+This documentation applies to the 1.4.x track.
+
+New and existing users are encouraged to use the 1.x releases so as to 
+leverage the performance improvements and configuration flexibilities available 
+in the latest architecture.
+
+
 System Requirements
 -------------------
 
-TBD
+#. Java Runtime Environment - Java 1.6 or later (Java 1.7 Recommended)
+#. Memory - Sufficient memory for configurations used by sources, channels or sinks
+#. Disk Space - Sufficient disk space for configurations used by channels or sinks
+#. Directory Permissions - Read/Write permissions for directories used by agent
 
 Architecture
 ------------
@@ -58,7 +74,10 @@ A Flume source consumes events delivered
 server. The external source sends events to Flume in a format that is
 recognized by the target Flume source. For example, an Avro Flume source can be
 used to receive Avro events from Avro clients or other Flume agents in the flow
-that send events from an Avro sink. When a Flume source receives an event, it
+that send events from an Avro sink. A similar flow can be defined using
+a Thrift Flume Source to receive events from a Thrift Sink or a Flume
+Thrift Rpc Client or Thrift clients written in any language generated from
+the Flume thrift protocol.When a Flume source receives an event, it
 stores it into one or more channels. The channel is a passive store that keeps
 the event until it's consumed by a Flume sink. The file channel is one example
 -- it is backed by the local filesystem. The sink removes the event
@@ -183,12 +202,12 @@ This configuration lets a user generate 
 
 This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel
 that buffers event data in memory, and a sink that logs event data to the console. The configuration file names the
-various components, then describes their types and configuration parameters. A given configuration file might define 
+various components, then describes their types and configuration parameters. A given configuration file might define
 several named agents; when a given Flume process is launched a flag is passed telling it which named agent to manifest.
 
 Given this configuration file, we can start Flume as follows::
 
-  $ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
+  $ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
 
 Note that in a full deployment we would typically include one more option: ``--conf=<conf-dir>``.
 The ``<conf-dir>`` directory would include a shell script *flume-env.sh* and potentially a log4j properties file.
@@ -215,6 +234,51 @@ The original Flume terminal will output 
 
 Congratulations - you've successfully configured and deployed a Flume agent! Subsequent sections cover agent configuration in much more detail.
 
+Installing third-party plugins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Flume has a fully plugin-based architecture. While Flume ships with many
+out-of-the-box sources, channels, sinks, serializers, and the like, many
+implementations exist which ship separately from Flume.
+
+While it has always been possible to include custom Flume components by
+adding their jars to the FLUME_CLASSPATH variable in the flume-env.sh file,
+Flume now supports a special directory called ``plugins.d`` which automatically
+picks up plugins that are packaged in a specific format. This allows for easier
+management of plugin packaging issues as well as simpler debugging and
+troubleshooting of several classes of issues, especially library dependency
+conflicts.
+
+The plugins.d directory
+'''''''''''''''''''''''
+
+The ``plugins.d`` directory is located at ``$FLUME_HOME/plugins.d``. At startup
+time, the ``flume-ng`` start script looks in the ``plugins.d`` directory for
+plugins that conform to the below format and includes them in proper paths when
+starting up ``java``.
+
+Directory layout for plugins
+''''''''''''''''''''''''''''
+
+Each plugin (subdirectory) within ``plugins.d`` can have up to three
+sub-directories:
+
+#. lib - the plugin's jar(s)
+#. libext - the plugin's dependency jar(s)
+#. native - any required native libraries, such as ``.so`` files
+
+Example of two plugins within the plugins.d directory:
+
+.. code-block:: none
+
+  plugins.d/
+  plugins.d/custom-source-1/
+  plugins.d/custom-source-1/lib/my-source.jar
+  plugins.d/custom-source-1/libext/spring-core-2.5.6.jar
+  plugins.d/custom-source-2/
+  plugins.d/custom-source-2/lib/custom.jar
+  plugins.d/custom-source-2/native/gettext.so
+
 Data ingestion
 --------------
 
@@ -247,6 +311,7 @@ Flume supports the following mechanisms 
 types, such as:
 
 #. Avro
+#. Thrift
 #. Syslog
 #. Netcat
 
@@ -266,7 +331,7 @@ Consolidation
 
 A very common scenario in log collection is a large number of log producing
 clients sending data to a few consumer agents that are attached to the storage
-subsystem. For examples, logs collected from hundreds of web servers sent to a
+subsystem. For example, logs collected from hundreds of web servers sent to a
 dozen of agents that write to HDFS cluster.
 
 .. figure:: images/UserGuide_image02.png
@@ -274,9 +339,10 @@ dozen of agents that write to HDFS clust
    :alt: A fan-in flow using Avro RPC to consolidate events in one place
 
 This can be achieved in Flume by configuring a number of first tier agents with
-an avro sink, all pointing to an avro source of single agent. This source on
-the second tier agent consolidates the received events into a single channel
-which is consumed by a sink to its final destination.
+an avro sink, all pointing to an avro source of single agent (Again you could
+use the thrift sources/sinks/clients in such a scenario). This source
+on the second tier agent consolidates the received events into a single
+channel which is consumed by a sink to its final destination.
 
 Multiplexing the flow
 ---------------------
@@ -311,7 +377,7 @@ Defining the flow
 To define the flow within a single agent, you need to link the sources and
 sinks via a channel. You need to list the sources, sinks and channels for the
 given agent, and then point the source and sink to a channel. A source instance
-can specify multiple channels, but a sink instance can only specify on channel.
+can specify multiple channels, but a sink instance can only specify one channel.
 The format is as follows:
 
 .. code-block:: properties
@@ -327,7 +393,7 @@ The format is as follows:
   # set channel for sink
   <Agent>.sinks.<Sink>.channel = <Channel1>
 
-For example an agent named agent_foo is reading data from an external avro client and sending
+For example, an agent named agent_foo is reading data from an external avro client and sending
 it to HDFS via a memory channel. The config file weblog.config could look like:
 
 .. code-block:: properties
@@ -436,9 +502,9 @@ config to do that:
 Configuring a multi agent flow
 ------------------------------
 
-To setup a multi-tier flow, you need to have an avro sink of first hop pointing
-to avro source of the next hop. This will result in the first Flume agent
-forwarding events to the next Flume agent. For example, if you are
+To setup a multi-tier flow, you need to have an avro/thrift sink of first hop
+pointing to avro/thrift source of the next hop. This will result in the first
+Flume agent forwarding events to the next Flume agent. For example, if you are
 periodically sending files (1 file per event) using avro client to a local
 Flume agent, then this local agent can forward it to another agent that has the
 mounted for storage.
@@ -495,15 +561,15 @@ from the external appserver source event
 Fan out flow
 ------------
 
-As discussed in previous section, Flume support fanning out the flow from one
+As discussed in previous section, Flume supports fanning out the flow from one
 source to multiple channels. There are two modes of fan out, replicating and
-multiplexing. In the replicating flow the event is sent to all the configured
+multiplexing. In the replicating flow, the event is sent to all the configured
 channels. In case of multiplexing, the event is sent to only a subset of
 qualifying channels. To fan out the flow, one needs to specify a list of
 channels for a source and the policy for the fanning it out. This is done by
 adding a channel "selector" that can be replicating or multiplexing. Then
 further specify the selection rules if it's a multiplexer. If you don't specify
-an selector, then by default it's replicating:
+a selector, then by default it's replicating:
 
 .. code-block:: properties
 
@@ -540,8 +606,7 @@ configured as default:
 
   <Agent>.sources.<Source1>.selector.default = <Channel2>
 
-The mapping allows overlapping the channels for each value. The default must be
-set for a multiplexing select which can also contain any number of channels.
+The mapping allows overlapping the channels for each value.
 
 The following example has a single flow that multiplexed to two paths. The
 agent named agent_foo has a single avro source and two channels linked to two sinks:
@@ -607,7 +672,9 @@ Note that if a header does not have any 
 be written to the default channels and will be attempted to be written to the
 optional channels for that header. Specifying optional channels will still cause
 the event to be written to the default channels, if no required channels are
-specified.
+specified. If no channels are designated as default and there are no required,
+the selector will attempt to write the events to the optional channels. Any
+failures are simply ignored in that case.
 
 
 Flume Sources
@@ -617,23 +684,28 @@ Avro Source
 ~~~~~~~~~~~
 
 Listens on Avro port and receives events from external Avro client streams.
-When paired with the built-in AvroSink on another (previous hop) Flume agent,
+When paired with the built-in Avro Sink on another (previous hop) Flume agent,
 it can create tiered collection topologies.
 Required properties are in **bold**.
 
-==============  ===========  ===================================================
-Property Name   Default      Description
-==============  ===========  ===================================================
-**channels**    --
-**type**        --           The component type name, needs to be ``avro``
-**bind**        --           hostname or IP address to listen on
-**port**        --           Port # to bind to
-threads         --           Maximum number of worker threads to spawn
+==================   ===========  ===================================================
+Property Name        Default      Description
+==================   ===========  ===================================================
+**channels**         --
+**type**             --           The component type name, needs to be ``avro``
+**bind**             --           hostname or IP address to listen on
+**port**             --           Port # to bind to
+threads              --           Maximum number of worker threads to spawn
 selector.type
 selector.*
-interceptors    --           Space separated list of interceptors
+interceptors         --           Space-separated list of interceptors
 interceptors.*
-==============  ===========  ===================================================
+compression-type     none         This can be "none" or "deflate".  The compression-type must match the compression-type of matching AvroSource
+ssl                  false        Set this to true to enable SSL encryption. You must also specify a "keystore" and a "keystore-password".
+keystore             --           This is the path to a Java keystore file. Required for SSL.
+keystore-password    --           The password for the Java keystore. Required for SSL.
+keystore-type        JKS          The type of the Java keystore. This can be "JKS" or "PKCS12".
+==================   ===========  ===================================================
 
 Example for agent named a1:
 
@@ -646,6 +718,39 @@ Example for agent named a1:
   a1.sources.r1.bind = 0.0.0.0
   a1.sources.r1.port = 4141
 
+Thrift Source
+~~~~~~~~~~~~~
+
+Listens on Thrift port and receives events from external Thrift client streams.
+When paired with the built-in ThriftSink on another (previous hop) Flume agent,
+it can create tiered collection topologies.
+Required properties are in **bold**.
+
+==================   ===========  ===================================================
+Property Name        Default      Description
+==================   ===========  ===================================================
+**channels**         --
+**type**             --           The component type name, needs to be ``thrift``
+**bind**             --           hostname or IP address to listen on
+**port**             --           Port # to bind to
+threads              --           Maximum number of worker threads to spawn
+selector.type
+selector.*
+interceptors         --           Space separated list of interceptors
+interceptors.*
+==================   ===========  ===================================================
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+  a1.sources = r1
+  a1.channels = c1
+  a1.sources.r1.type = thrift
+  a1.sources.r1.channels = c1
+  a1.sources.r1.bind = 0.0.0.0
+  a1.sources.r1.port = 4141
+
 Exec Source
 ~~~~~~~~~~~
 
@@ -665,13 +770,14 @@ Property Name    Default      Descriptio
 **channels**     --
 **type**         --           The component type name, needs to be ``exec``
 **command**      --           The command to execute
+shell            --           A shell invocation used to run the command.  e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.
 restartThrottle  10000        Amount of time (in millis) to wait before attempting a restart
 restart          false        Whether the executed cmd should be restarted if it dies
 logStdErr        false        Whether the command's stderr should be logged
 batchSize        20           The max number of lines to read and send to the channel at a time
 selector.type    replicating  replicating or multiplexing
 selector.*                    Depends on the selector.type value
-interceptors     --           Space separated list of interceptors
+interceptors     --           Space-separated list of interceptors
 interceptors.*
 ===============  ===========  ==============================================================
 
@@ -708,56 +814,211 @@ Example for agent named a1:
   a1.sources.r1.command = tail -F /var/log/secure
   a1.sources.r1.channels = c1
 
+The 'shell' config is used to invoke the 'command' through a command shell (such as Bash
+or Powershell). The 'command' is passed as an argument to 'shell' for execution. This
+allows the 'command' to use features from the shell such as wildcards, back ticks, pipes,
+loops, conditionals etc. In the absence of the 'shell' config, the 'command' will be
+invoked directly.  Common values for 'shell' :  '/bin/sh -c', '/bin/ksh -c',
+'cmd /c',  'powershell -Command', etc.
+
+.. code-block:: properties
+
+  agent_foo.sources.tailsource-1.type = exec
+  agent_foo.sources.tailsource-1.shell = /bin/bash -c
+  agent_foo.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done
+
+JMS Source
+~~~~~~~~~~~
+
+JMS Source reads messages from a JMS destination such as a queue or topic. Being a JMS
+application it should work with any JMS provider but has only been tested with ActiveMQ.
+The JMS source provides configurable batch size, message selector, user/pass, and message
+to flume event converter. Note that the vendor provided JMS jars should be included in the
+Flume classpath using plugins.d directory (preferred), --classpath on command line, or
+via FLUME_CLASSPATH variable in flume-env.sh.
+
+Required properties are in **bold**.
+
+=========================   ===========  ==============================================================
+Property Name               Default      Description
+=========================   ===========  ==============================================================
+**channels**                --
+**type**                    --           The component type name, needs to be ``jms``
+**initialContextFactory**   --           Inital Context Factory, e.g: org.apache.activemq.jndi.ActiveMQInitialContextFactory
+**connectionFactory**       --           The JNDI name the connection factory shoulld appear as
+**providerURL**             --           The JMS provider URL
+**destinationName**         --           Destination name
+**destinationType**         --           Destination type (queue or topic)
+messageSelector             --           Message selector to use when creating the consumer
+userName                    --           Username for the destination/provider
+passwordFile                --           File containing the password for the destination/provider
+batchSize                   100          Number of messages to consume in one batch
+converter.type              DEFAULT      Class to use to convert messages to flume events. See below.
+converter.*                 --           Converter properties.
+converter.charset           UTF-8        Default converter only. Charset to use when converting JMS TextMessages to byte arrays.
+=========================   ===========  ==============================================================
+
+
+Converter
+'''''''''''
+The JMS source allows pluggable converters, though it's likely the default converter will work
+for most purposes. The default converter is able to convert Bytes, Text, and Object messages
+to FlumeEvents. In all cases, the properties in the message are added as headers to the
+FlumeEvent.
+
+BytesMessage:
+  Bytes of message are copied to body of the FlumeEvent. Cannot convert more than 2GB
+  of data per message.
+
+TextMessage:
+  Text of message is converted to a byte array and copied to the body of the
+  FlumeEvent. The default converter uses UTF-8 by default but this is configurable.
+
+ObjectMessage:
+  Object is written out to a ByteArrayOutputStream wrapped in an ObjectOutputStream and
+  the resulting array is copied to the body of the FlumeEvent.
+
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+  a1.sources = r1
+  a1.channels = c1
+  a1.sources.r1.type = jms
+  a1.sources.r1.channels = c1
+  a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
+  a1.sources.r1.connectionFactory = GenericConnectionFactory
+  a1.sources.r1.providerURL = tcp://mqserver:61616
+  a1.sources.r1.destinationName = BUSINESS_DATA
+  a1.sources.r1.destinationType = QUEUE
+
 Spooling Directory Source
 ~~~~~~~~~~~~~~~~~~~~~~~~~
-This source lets you ingest data by dropping files in a spooling directory on
-disk. **Unlike other asynchronous sources, this source
-avoids data loss even if Flume is restarted or fails.**
-Flume will watch the directory for new files and read then ingest them
-as they appear. After a given file has been fully read into the channel,
-it is renamed to indicate completion. This allows a cleaner process to remove
-completed files periodically. Note, however,
-that events may be duplicated if failures occur, consistent with the semantics
-offered by other Flume components. The channel optionally inserts the full path of
-the origin file into a header field of each event. This source buffers file data
-in memory during reads; be sure to set the `bufferMaxLineLength` option to a number
-greater than the longest line you expect to see in your input data.
-
-.. warning:: This channel expects that only immutable, uniquely named files
-             are dropped in the spooling directory. If duplicate names are
-             used, or files are modified while being read, the source will
-             fail with an error message. For some use cases this may require
-             adding unique identifiers (such as a timestamp) to log file names
-             when they are copied into the spooling directory.
+This source lets you ingest data by placing files to be ingested into a
+"spooling" directory on disk.
+This source will watch the specified directory for new files, and will parse
+events out of new files as they appear.
+The event parsing logic is pluggable.
+After a given file has been fully read
+into the channel, it is renamed to indicate completion (or optionally deleted).
+
+Unlike the Exec source, this source is reliable and will not miss data, even if
+Flume is restarted or killed. In exchange for this reliability, only immutable,
+uniquely-named files must be dropped into the spooling directory. Flume tries
+to detect these problem conditions and will fail loudly if they are violated:
+
+#. If a file is written to after being placed into the spooling directory,
+   Flume will print an error to its log file and stop processing.
+#. If a file name is reused at a later time, Flume will print an error to its
+   log file and stop processing.
+
+To avoid the above issues, it may be useful to add a unique identifier
+(such as a timestamp) to log file names when they are moved into the spooling
+directory.
+
+Despite the reliability guarantees of this source, there are still
+cases in which events may be duplicated if certain downstream failures occur.
+This is consistent with the guarantees offered by other Flume components.
 
 ====================  ==============  ==========================================================
 Property Name         Default         Description
 ====================  ==============  ==========================================================
 **channels**          --
-**type**              --              The component type name, needs to be ``spooldir``
-**spoolDir**          --              The directory where log files will be spooled
+**type**              --              The component type name, needs to be ``spooldir``.
+**spoolDir**          --              The directory from which to read files from.
 fileSuffix            .COMPLETED      Suffix to append to completely ingested files
+deletePolicy          never           When to delete completed files: ``never`` or ``immediate``
 fileHeader            false           Whether to add a header storing the filename
 fileHeaderKey         file            Header key to use when appending filename to header
-batchSize             10              Granularity at which to batch transfer to the channel
-bufferMaxLines        100             Maximum number of lines the commit buffer can hold
-bufferMaxLineLength   5000            Maximum length of a line in the commit buffer
+ignorePattern         ^$              Regular expression specifying which files to ignore (skip)
+trackerDir            .flumespool     Directory to store metadata related to processing of files.
+                                      If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
+batchSize             100             Granularity at which to batch transfer to the channel
+inputCharset          UTF-8           Character set used by deserializers that treat the input file as text.
+deserializer          ``LINE``        Specify the deserializer used to parse the file into events.
+                                      Defaults to parsing each line as an event. The class specified must implement
+                                      ``EventDeserializer.Builder``.
+deserializer.*                        Varies per event deserializer.
+bufferMaxLines        --              (Obselete) This option is now ignored.
+bufferMaxLineLength   5000            (Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
 selector.type         replicating     replicating or multiplexing
 selector.*                            Depends on the selector.type value
-interceptors          --              Space separated list of interceptors
+interceptors          --              Space-separated list of interceptors
 interceptors.*
 ====================  ==============  ==========================================================
 
-Example for agent named a1:
+Example for an agent named agent-1:
 
 .. code-block:: properties
 
-  a1.sources = r1
-  a1.channels = c1
-  a1.sources.r1.type = spooldir
-  a1.sources.r1.spoolDir = /var/log/apache/flumeSpool
-  a1.sources.r1.fileHeader = true
-  a1.sources.r1.channels = c1
+  agent-1.channels = ch-1
+  agent-1.sources = src-1
+
+  agent-1.sources.src-1.type = spooldir
+  agent-1.sources.src-1.channels = ch-1
+  agent-1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
+  agent-1.sources.src-1.fileHeader = true
+
+Event Deserializers
+'''''''''''''''''''
+
+The following event deserializers ship with Flume.
+
+LINE
+^^^^
+
+This deserializer generates one event per line of text input.
+
+==============================  ==============  ==========================================================
+Property Name                   Default         Description
+==============================  ==============  ==========================================================
+deserializer.maxLineLength      2048            Maximum number of characters to include in a single event.
+                                                If a line exceeds this length, it is truncated, and the
+                                                remaining characters on the line will appear in a
+                                                subsequent event.
+deserializer.outputCharset      UTF-8           Charset to use for encoding events put into the channel.
+==============================  ==============  ==========================================================
+
+AVRO
+^^^^
+
+This deserializer is able to read an Avro container file, and it generates
+one event per Avro record in the file.
+Each event is annotated with a header that indicates the schema used.
+The body of the event is the binary Avro record data, not
+including the schema or the rest of the container file elements.
+
+Note that if the spool directory source must retry putting one of these events
+onto a channel (for example, because the channel is full), then it will reset
+and retry from the most recent Avro container file sync point. To reduce
+potential event duplication in such a failure scenario, write sync markers more
+frequently in your Avro input files.
+
+==============================  ==============  ======================================================================
+Property Name                   Default         Description
+==============================  ==============  ======================================================================
+deserializer.schemaType         HASH            How the schema is represented. By default, or when the value ``HASH``
+                                                is specified, the Avro schema is hashed and
+                                                the hash is stored in every event in the event header
+                                                "flume.avro.schema.hash". If ``LITERAL`` is specified, the JSON-encoded
+                                                schema itself is stored in every event in the event header
+                                                "flume.avro.schema.literal". Using ``LITERAL`` mode is relatively
+                                                inefficient compared to ``HASH`` mode.
+==============================  ==============  ======================================================================
+
+BlobDeserializer
+^^^^^^^^^^^^^^^^
+
+This deserializer reads a Binary Large Object (BLOB) per event, typically one BLOB per file. For example a PDF or JPG file. Note that this approach is not suitable for very large objects because the entire BLOB is buffered in RAM.
+
+==========================  ==================  =======================================================================
+Property Name               Default             Description
+==========================  ==================  =======================================================================
+**deserializer**            --                  The FQCN of this class: ``org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder``
+deserializer.maxBlobLength  100000000           The maximum number of bytes to read and buffer for a given request
+==========================  ==================  =======================================================================
+
 
 NetCat Source
 ~~~~~~~~~~~~~
@@ -781,7 +1042,7 @@ max-line-length  512          Max line l
 ack-every-event  true         Respond with an "OK" for every event received
 selector.type    replicating  replicating or multiplexing
 selector.*                    Depends on the selector.type value
-interceptors     --           Space separated list of interceptors
+interceptors     --           Space-separated list of interceptors
 interceptors.*
 ===============  ===========  ===========================================
 
@@ -810,7 +1071,7 @@ Property Name   Default      Description
 **type**        --           The component type name, needs to be ``seq``
 selector.type                replicating or multiplexing
 selector.*      replicating  Depends on the selector.type value
-interceptors    --           Space separated list of interceptors
+interceptors    --           Space-separated list of interceptors
 interceptors.*
 batchSize       1
 ==============  ===========  ========================================
@@ -848,7 +1109,7 @@ Property Name    Default      Descriptio
 eventSize        2500         Maximum size of a single event line, in bytes
 selector.type                 replicating or multiplexing
 selector.*       replicating  Depends on the selector.type value
-interceptors     --           Space separated list of interceptors
+interceptors     --           Space-separated list of interceptors
 interceptors.*
 ==============   ===========  ==============================================
 
@@ -890,7 +1151,7 @@ readBufferSize        1024              
 numProcessors         (auto-detected)   Number of processors available on the system for use while processing messages. Default is to auto-detect # of CPUs using the Java Runtime API. Mina will spawn 2 request-processing threads per detected CPU, which is often reasonable.
 selector.type         replicating       replicating, multiplexing, or custom
 selector.*            --                Depends on the ``selector.type`` value
-interceptors          --                Space separated list of interceptors.
+interceptors          --                Space-separated list of interceptors.
 interceptors.*
 ====================  ================  ==============================================
 
@@ -918,7 +1179,7 @@ Property Name   Default      Description
 **port**        --           Port # to bind to
 selector.type                replicating or multiplexing
 selector.*      replicating  Depends on the selector.type value
-interceptors    --           Space separated list of interceptors
+interceptors    --           Space-separated list of interceptors
 interceptors.*
 ==============  ===========  ==============================================
 
@@ -940,9 +1201,9 @@ A source which accepts Flume Events by H
 for experimentation only. HTTP requests are converted into flume events by
 a pluggable "handler" which must implement the HTTPSourceHandler interface.
 This handler takes a HttpServletRequest and returns a list of
-flume events. All events handler from one Http request are committed to the channel
+flume events. All events handled from one Http request are committed to the channel
 in one transaction, thus allowing for increased efficiency on channels like
-the file channel. If the handler throws an exception this source will
+the file channel. If the handler throws an exception, this source will
 return a HTTP status of 400. If the channel is full, or the source is unable to
 append events to the channel, the source will return a HTTP 503 - Temporarily
 unavailable status.
@@ -950,18 +1211,19 @@ unavailable status.
 All events sent in one post request are considered to be one batch and
 inserted into the channel in one transaction.
 
-==============  ===========================================  ====================================================================
-Property Name   Default                                      Description
-==============  ===========================================  ====================================================================
-**type**                                                     The FQCN of this class:  ``org.apache.flume.source.http.HTTPSource``
-**port**        --                                           The port the source should bind to.
-handler         ``org.apache.flume.http.JSONHandler``        The FQCN of the handler class.
-handler.*       --                                           Config parameters for the handler
-selector.type   replicating                                  replicating or multiplexing
-selector.*                                                   Depends on the selector.type value
-interceptors    --                                           Space separated list of interceptors
+==============  ============================================  ====================================================================
+Property Name   Default                                       Description
+==============  ============================================  ====================================================================
+**type**                                                      The component type name, needs to be ``http``
+**port**        --                                            The port the source should bind to.
+bind            0.0.0.0                                       The hostname or IP address to listen on
+handler         ``org.apache.flume.source.http.JSONHandler``  The FQCN of the handler class.
+handler.*       --                                            Config parameters for the handler
+selector.type   replicating                                   replicating or multiplexing
+selector.*                                                    Depends on the selector.type value
+interceptors    --                                            Space-separated list of interceptors
 interceptors.*
-=================================================================================================================================
+==================================================================================================================================
 
 For example, a http source for agent named a1:
 
@@ -969,7 +1231,7 @@ For example, a http source for agent nam
 
   a1.sources = r1
   a1.channels = c1
-  a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
+  a1.sources.r1.type = http
   a1.sources.r1.port = 5140
   a1.sources.r1.channels = c1
   a1.sources.r1.handler = org.example.rest.RestHandler
@@ -1006,7 +1268,7 @@ To set the charset, the request must hav
 ``application/json; charset=UTF-8`` (replace UTF-8 with UTF-16 or UTF-32 as
 required).
 
-One way to create an event in the format expected by this handler, is to
+One way to create an event in the format expected by this handler is to
 use JSONEvent provided in the Flume SDK and use Google Gson to create the JSON
 string using the Gson#fromJson(Object, Type)
 method. The type token to pass as the 2nd argument of this method
@@ -1016,6 +1278,17 @@ for list of events can be created by:
 
   Type type = new TypeToken<List<JSONEvent>>() {}.getType();
 
+BlobHandler
+'''''''''''
+By default HTTPSource splits JSON input into Flume events. As an alternative, BlobHandler is a handler for HTTPSource that returns an event that contains the request parameters as well as the Binary Large Object (BLOB) uploaded with this request. For example a PDF or JPG file. Note that this approach is not suitable for very large objects because it buffers up the entire BLOB in RAM.
+
+=====================  ==================  ============================================================================
+Property Name          Default             Description
+=====================  ==================  ============================================================================
+**handler**            --                  The FQCN of this class: ``org.apache.flume.sink.solr.morphline.BlobHandler``
+handler.maxBlobLength  100000000           The maximum number of bytes to read and buffer for a given request
+=====================  ==================  ============================================================================
+
 Legacy Sources
 ~~~~~~~~~~~~~~
 
@@ -1050,7 +1323,7 @@ Property Name   Default      Description
 **port**        --           The port # to listen on
 selector.type                replicating or multiplexing
 selector.*      replicating  Depends on the selector.type value
-interceptors    --           Space separated list of interceptors
+interceptors    --           Space-separated list of interceptors
 interceptors.*
 ==============  ===========  ========================================================================================
 
@@ -1077,7 +1350,7 @@ Property Name   Default      Description
 **port**        --           The port # to listen on
 selector.type                replicating or multiplexing
 selector.*      replicating  Depends on the selector.type value
-interceptors    --           Space separated list of interceptors
+interceptors    --           Space-separated list of interceptors
 interceptors.*
 ==============  ===========  ======================================================================================
 
@@ -1106,7 +1379,7 @@ Property Name   Default      Description
 **type**        --           The component type name, needs to be your FQCN
 selector.type                ``replicating`` or ``multiplexing``
 selector.*      replicating  Depends on the selector.type value
-interceptors    --           Space separated list of interceptors
+interceptors    --           Space-separated list of interceptors
 interceptors.*
 ==============  ===========  ==============================================
 
@@ -1201,7 +1474,7 @@ complete files in the directory.
 Required properties are in **bold**.
 
 .. note:: For all of the time related escape sequences, a header with the key
-          "timestamp" must exist among the headers of the event. One way to add
+          "timestamp" must exist among the headers of the event (unless ``hdfs.useLocalTimeStamp`` is set to ``true``). One way to add
           this automatically is to use the TimestampInterceptor.
 
 ======================  ============  ======================================================================
@@ -1212,6 +1485,8 @@ Name                    Default       De
 **hdfs.path**           --            HDFS directory path (eg hdfs://namenode/flume/webdata/)
 hdfs.filePrefix         FlumeData     Name prefixed to files created by Flume in hdfs directory
 hdfs.fileSuffix         --            Suffix to append to file (eg ``.avro`` - *NOTE: period is not automatically added*)
+hdfs.inUsePrefix        --            Prefix that is used for temporal files that flume actively writes into
+hdfs.inUseSuffix        ``.tmp``      Suffix that is used for temporal files that flume actively writes into
 hdfs.rollInterval       30            Number of seconds to wait before rolling current file
                                       (0 = never roll based on time interval)
 hdfs.rollSize           1024          File size to trigger roll, in bytes (0: never roll based on file size)
@@ -1220,12 +1495,13 @@ hdfs.rollCount          10            Nu
 hdfs.idleTimeout        0             Timeout after which inactive files get closed
                                       (0 = disable automatic closing of idle files)
 hdfs.batchSize          100           number of events written to file before it is flushed to HDFS
-hdfs.codeC              --            Compression codec. one of following : gzip, bzip2, lzo, snappy
+hdfs.codeC              --            Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
 hdfs.fileType           SequenceFile  File format: currently ``SequenceFile``, ``DataStream`` or ``CompressedStream``
                                       (1)DataStream will not compress output file and please don't set codeC
                                       (2)CompressedStream requires set hdfs.codeC with an available codeC
 hdfs.maxOpenFiles       5000          Allow only this number of open files. If this number is exceeded, the oldest file is closed.
-hdfs.writeFormat        --            "Text" or "Writable"
+hdfs.minBlockReplicas   --            Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
+hdfs.writeFormat        --            Format for sequence file records. One of "Text" or "Writable" (the default).
 hdfs.callTimeout        10000         Number of milliseconds allowed for HDFS operations, such as open, write, flush, close.
                                       This number should be increased if many HDFS timeout operations are occurring.
 hdfs.threadsPoolSize    10            Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
@@ -1237,6 +1513,7 @@ hdfs.round              false         Sh
 hdfs.roundValue         1             Rounded down to the highest multiple of this (in the unit configured using ``hdfs.roundUnit``), less than current time.
 hdfs.roundUnit          second        The unit of the round down value - ``second``, ``minute`` or ``hour``.
 hdfs.timeZone           Local Time    Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
+hdfs.useLocalTimeStamp  false         Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
 serializer              ``TEXT``      Other possible options include ``avro_event`` or the
                                       fully-qualified class name of an implementation of the
                                       ``EventSerializer.Builder`` interface.
@@ -1248,7 +1525,7 @@ Example for agent named a1:
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.sinks.k1.type = hdfs
   a1.sinks.k1.channel = c1
   a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
@@ -1279,7 +1556,7 @@ Example for agent named a1:
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.sinks.k1.type = logger
   a1.sinks.k1.channel = c1
 
@@ -1292,30 +1569,70 @@ hostname / port pair. The events are tak
 batches of the configured batch size.
 Required properties are in **bold**.
 
-===============  =======  ==============================================
-Property Name    Default  Description
-===============  =======  ==============================================
-**channel**      --
-**type**         --       The component type name, needs to be ``avro``.
-**hostname**     --       The hostname or IP address to bind to.
-**port**         --       The port # to listen on.
-batch-size       100      number of event to batch together for send.
-connect-timeout  20000    Amount of time (ms) to allow for the first (handshake) request.
-request-timeout  20000    Amount of time (ms) to allow for requests after the first.
-
-===============  =======  ==============================================
+==========================   =======  ==============================================
+Property Name                Default  Description
+==========================   =======  ==============================================
+**channel**                  --
+**type**                     --       The component type name, needs to be ``avro``.
+**hostname**                 --       The hostname or IP address to bind to.
+**port**                     --       The port # to listen on.
+batch-size                   100      number of event to batch together for send.
+connect-timeout              20000    Amount of time (ms) to allow for the first (handshake) request.
+request-timeout              20000    Amount of time (ms) to allow for requests after the first.
+reset-connection-interval    none     Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent.
+compression-type             none     This can be "none" or "deflate".  The compression-type must match the compression-type of matching AvroSource
+compression-level            6        The level of compression to compress event. 0 = no compression and 1-9 is compression.  The higher the number the more compression
+ssl                          false    Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a "truststore", "truststore-password", "truststore-type", and specify whether to "trust-all-certs".
+trust-all-certs              false    If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and "listen in" on the encrypted connection.
+truststore                   --       The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source's SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically "jssecacerts" or "cacerts" in the Oracle JRE) will be used.
+truststore-password          --       The password for the specified truststore.
+truststore-type              JKS      The type of the Java truststore. This can be "JKS" or other supported Java truststore type.
+==========================   =======  ==============================================
 
 Example for agent named a1:
 
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.sinks.k1.type = avro
   a1.sinks.k1.channel = c1
   a1.sinks.k1.hostname = 10.10.10.10
   a1.sinks.k1.port = 4545
 
+Thrift Sink
+~~~~~~~~~~~
+
+This sink forms one half of Flume's tiered collection support. Flume events
+sent to this sink are turned into Thrift events and sent to the configured
+hostname / port pair. The events are taken from the configured Channel in
+batches of the configured batch size.
+Required properties are in **bold**.
+
+==========================   =======  ==============================================
+Property Name                Default  Description
+==========================   =======  ==============================================
+**channel**                  --
+**type**                     --       The component type name, needs to be ``thrift``.
+**hostname**                 --       The hostname or IP address to bind to.
+**port**                     --       The port # to listen on.
+batch-size                   100      number of event to batch together for send.
+connect-timeout              20000    Amount of time (ms) to allow for the first (handshake) request.
+request-timeout              20000    Amount of time (ms) to allow for requests after the first.
+connection-reset-interval    none     Amount of time (s) before the connection to the next hop is reset. This will force the Thrift Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent.
+==========================   =======  ==============================================
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+  a1.channels = c1
+  a1.sinks = k1
+  a1.sinks.k1.type = thrift
+  a1.sinks.k1.channel = c1
+  a1.sinks.k1.hostname = 10.10.10.10
+  a1.sinks.k1.port = 4545
+
 IRC Sink
 ~~~~~~~~
 
@@ -1346,7 +1663,7 @@ Example for agent named a1:
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.sinks.k1.type = irc
   a1.sinks.k1.channel = c1
   a1.sinks.k1.hostname = irc.yourdomain.com
@@ -1375,7 +1692,7 @@ Example for agent named a1:
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.sinks.k1.type = file_roll
   a1.sinks.k1.channel = c1
   a1.sinks.k1.sink.directory = /var/log/flume
@@ -1399,7 +1716,7 @@ Example for agent named a1:
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.sinks.k1.type = null
   a1.sinks.k1.channel = c1
 
@@ -1416,36 +1733,47 @@ HBase puts and/or increments. These puts
 to HBase. This sink provides the same consistency guarantees as HBase,
 which is currently row-wise atomicity. In the event of Hbase failing to
 write certain events, the sink will replay all events in that transaction.
-For convenience two serializers are provided with flume. The
+
+The HBaseSink supports writing data to secure HBase. To write to secure HBase, the user
+the agent is running as must have write permissions to the table the sink is configured
+to write to. The principal and keytab to use to authenticate against the KDC can be specified
+in the configuration. The hbase-site.xml in the Flume agent's classpath
+must have authentication set to ``kerberos`` (For details on how to do this, please refer to
+HBase documentation).
+
+For convenience, two serializers are provided with Flume. The
 SimpleHbaseEventSerializer (org.apache.flume.sink.hbase.SimpleHbaseEventSerializer)
 writes the event body
-as is to HBase, and optionally increments a column in Hbase. This is primarily
+as-is to HBase, and optionally increments a column in Hbase. This is primarily
 an example implementation. The RegexHbaseEventSerializer
 (org.apache.flume.sink.hbase.RegexHbaseEventSerializer) breaks the event body
 based on the given regex and writes each part into different columns.
 
 The type is the FQCN: org.apache.flume.sink.hbase.HBaseSink.
+
 Required properties are in **bold**.
 
-================  ======================================================  ========================================================================
-Property Name     Default                                                 Description
-================  ======================================================  ========================================================================
-**channel**       --
-**type**          --                                                      The component type name, needs to be ``org.apache.flume.sink.hbase.HBaseSink``
-**table**         --                                                      The name of the table in Hbase to write to.
-**columnFamily**  --                                                      The column family in Hbase to write to.
-batchSize         100                                                     Number of events to be written per txn.
-serializer        org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
-serializer.*      --                                                      Properties to be passed to the serializer.
-================  ======================================================  ========================================================================
+==================  ======================================================  ==============================================================================
+Property Name       Default                                                 Description
+==================  ======================================================  ==============================================================================
+**channel**         --
+**type**            --                                                      The component type name, needs to be ``hbase``
+**table**           --                                                      The name of the table in Hbase to write to.
+**columnFamily**    --                                                      The column family in Hbase to write to.
+batchSize           100                                                     Number of events to be written per txn.
+serializer          org.apache.flume.sink.hbase.SimpleHbaseEventSerializer  Default increment column = "iCol", payload column = "pCol".
+serializer.*        --                                                      Properties to be passed to the serializer.
+kerberosPrincipal   --                                                      Kerberos user principal for accessing secure HBase
+kerberosKeytab      --                                                      Kerberos keytab for accessing secure HBase
+==================  ======================================================  ==============================================================================
 
 Example for agent named a1:
 
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
-  a1.sinks.k1.type = org.apache.flume.sink.hbase.HBaseSink
+  a1.sinks = k1
+  a1.sinks.k1.type = hbase
   a1.sinks.k1.table = foo_table
   a1.sinks.k1.columnFamily = bar_cf
   a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
@@ -1461,7 +1789,6 @@ HBase puts and/or increments. These puts
 to HBase. This sink provides the same consistency guarantees as HBase,
 which is currently row-wise atomicity. In the event of Hbase failing to
 write certain events, the sink will replay all events in that transaction.
-This sink is still experimental.
 The type is the FQCN: org.apache.flume.sink.hbase.AsyncHBaseSink.
 Required properties are in **bold**.
 
@@ -1469,62 +1796,138 @@ Required properties are in **bold**.
 Property Name     Default                                                       Description
 ================  ============================================================  ====================================================================================
 **channel**       --
-**type**          --                                                            The component type name, needs to be ``org.apache.flume.sink.hbase.AsyncHBaseSink``
+**type**          --                                                            The component type name, needs to be ``asynchbase``
 **table**         --                                                            The name of the table in Hbase to write to.
+zookeeperQuorum   --                                                            The quorum spec. This is the value for the property ``hbase.zookeeper.quorum`` in hbase-site.xml
+znodeParent       /hbase                                                        The base path for the znode for the -ROOT- region. Value of ``zookeeper.znode.parent`` in hbase-site.xml
 **columnFamily**  --                                                            The column family in Hbase to write to.
 batchSize         100                                                           Number of events to be written per txn.
-timeout           --                                                            The length of time (in milliseconds) the sink waits for acks from hbase for
-                                                                                all events in a transaction. If no timeout is specified, the sink will wait forever.
+timeout           60000                                                         The length of time (in milliseconds) the sink waits for acks from hbase for
+                                                                                all events in a transaction.
 serializer        org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
 serializer.*      --                                                            Properties to be passed to the serializer.
 ================  ============================================================  ====================================================================================
 
+Note that this sink takes the Zookeeper Quorum and parent znode information in
+the configuration. Zookeeper Quorum and parent node configuration may be
+specified in the flume configuration file. Alternatively, these configuration
+values are taken from the first hbase-site.xml file in the classpath.
+
+If these are not provided in the configuration, then the sink
+will read this information from the first hbase-site.xml file in the classpath.
+
 Example for agent named a1:
 
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
-  a1.sinks.k1.type = org.apache.flume.sink.hbase.AsyncHBaseSink
+  a1.sinks = k1
+  a1.sinks.k1.type = asynchbase
   a1.sinks.k1.table = foo_table
   a1.sinks.k1.columnFamily = bar_cf
   a1.sinks.k1.serializer = org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
   a1.sinks.k1.channel = c1
 
+MorphlineSolrSink
+~~~~~~~~~~~~~~~~~
+
+This sink extracts data from Flume events, transforms it, and loads it in near-real-time into Apache Solr servers, which in turn serve queries to end users or search applications.
+
+This sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr (via MorphlineSolrSink). In particular, this sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a data model that is useful to Search applications.
+
+The ETL functionality is customizable using a `morphline configuration file <http://cloudera.github.io/cdk/docs/0.4.0/cdk-morphlines/index.html>`_ that defines a chain of transformation commands that pipe event records from one command to another. 
+
+Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records, including arbitrary binary payloads. A morphline command is a bit like a Flume Interceptor. Morphlines can be embedded into Hadoop components such as Flume.
+
+Commands to parse and transform a set of standard data formats such as log files, Avro, CSV, Text, HTML, XML, PDF, Word, Excel, etc. are provided out of the box, and additional custom commands and parsers for additional data formats can be added as morphline plugins. Any kind of data format can be indexed and any Solr documents for any kind of Solr schema can be generated, and any custom ETL logic can be registered and executed.
+
+Morphlines manipulate continuous streams of records. The data model can be described as follows: A record is a set of named fields where each field has an ordered list of one or more values. A value can be any Java Object. That is, a record is essentially a hash table where each hash table entry contains a String key and a list of Java Objects as values. (The implementation uses Guava's ``ArrayListMultimap``, which is a ``ListMultimap``). Note that a field can have multiple values and any two records need not use common field names. 
+
+This sink fills the body of the Flume event into the ``_attachment_body`` field of the morphline record, as well as copies the headers of the Flume event into record fields of the same name. The commands can then act on this data.
+
+Routing to a SolrCloud cluster is supported to improve scalability. Indexing load can be spread across a large number of MorphlineSolrSinks for improved scalability. Indexing load can be replicated across multiple MorphlineSolrSinks for high availability, for example using Flume features such as Load balancing Sink Processor. MorphlineInterceptor can also help to implement dynamic routing to multiple Solr collections (e.g. for multi-tenancy).
+
+The morphline and solr jars required for your environment must be placed in the lib directory of the Apache Flume installation. 
+
+The type is the FQCN: org.apache.flume.sink.solr.morphline.MorphlineSolrSink
+
+Required properties are in **bold**.
+
+===================  =======================================================================  ========================
+Property Name        Default                                                                  Description
+===================  =======================================================================  ========================
+**channel**          --
+**type**             --                                                                       The component type name, needs to be ``org.apache.flume.sink.solr.morphline.MorphlineSolrSink``
+**morphlineFile**    --                                                                       The relative or absolute path on the local file system to the morphline configuration file. Example: ``/etc/flume-ng/conf/morphline.conf``
+morphlineId          null                                                                     Optional name used to identify a morphline if there are multiple morphlines in a morphline config file
+batchSize            1000                                                                     The maximum number of events to take per flume transaction.
+batchDurationMillis  1000                                                                     The maximum duration per flume transaction (ms). The transaction commits after this duration or when batchSize is exceeded, whichever comes first.
+handlerClass         org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl                The FQCN of a class implementing org.apache.flume.sink.solr.morphline.MorphlineHandler
+===================  =======================================================================  ========================
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+  a1.channels = c1
+  a1.sinks = k1
+  a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
+  a1.sinks.k1.channel = c1
+  a1.sinks.k1.morphlineFile = /etc/flume-ng/conf/morphline.conf
+  # a1.sinks.k1.morphlineId = morphline1
+  # a1.sinks.k1.batchSize = 1000
+  # a1.sinks.k1.batchDurationMillis = 1000
+
 ElasticSearchSink
-'''''''''''''''''
+~~~~~~~~~~~~~~~~~
+
+This sink writes data to an elasticsearch cluster. By default, events will be written so that the `Kibana <http://kibana.org>`_ graphical interface
+can display them - just as if `logstash <https://logstash.net>`_ wrote them. 
+
+The elasticsearch and lucene-core jars required for your environment must be placed in the lib directory of the Apache Flume installation. 
+Elasticsearch requires that the major version of the client JAR match that of the server and that both are running the same minor version
+of the JVM. SerializationExceptions will appear if this is incorrect. To 
+select the required version first determine the version of elasticsearch and the JVM version the target cluster is running. Then select an elasticsearch client
+library which matches the major version. A 0.19.x client can talk to a 0.19.x cluster; 0.20.x can talk to 0.20.x and 0.90.x can talk to 0.90.x. Once the
+elasticsearch version has been determined then read the pom.xml file to determine the correct lucene-core JAR version to use. The Flume agent
+which is running the ElasticSearchSink should also match the JVM the target cluster is running down to the minor version.
+
+Events will be written to a new index every day. The name will be <indexName>-yyyy-MM-dd where <indexName> is the indexName parameter. The sink
+will start writing to a new index at midnight UTC.
+
+Events are serialized for elasticsearch by the ElasticSearchLogStashEventSerializer by default. This behaviour can be
+overridden with the serializer parameter. This parameter accepts implementations of org.apache.flume.sink.elasticsearch.ElasticSearchEventSerializer
+or org.apache.flume.sink.elasticsearch.ElasticSearchIndexRequestBuilderFactory. Implementing ElasticSearchEventSerializer is deprecated in favour of
+the more powerful ElasticSearchIndexRequestBuilderFactory.
 
-This sink writes data to ElasticSearch. A class implementing
-ElasticSearchEventSerializer which is specified by the configuration is used to convert the events into
-XContentBuilder which detail the fields and mappings which will be indexed. These are then then written
-to ElasticSearch. The sink will generate an index per day allowing easier management instead of dealing with
-a single large index
 The type is the FQCN: org.apache.flume.sink.elasticsearch.ElasticSearchSink
+
 Required properties are in **bold**.
 
-================  ==================================================================  =======================================================================================================
-Property Name     Default                                                             Description
-================  ==================================================================  =======================================================================================================
+================  ======================================================================== =======================================================================================================
+Property Name     Default                                                                  Description
+================  ======================================================================== =======================================================================================================
 **channel**       --
-**type**          --                                                                  The component type name, needs to be ``org.apache.flume.sink.elasticsearch.ElasticSearchSink``
-**hostNames**     --                                                                  Comma separated list of hostname:port, if the port is not present the default port '9300' will be used
-indexName         flume                                                               The name of the index which the date will be appended to. Example 'flume' -> 'flume-yyyy-MM-dd'
-indexType         logs                                                                The type to index the document to, defaults to 'log'
-clusterName       elasticsearch                                                       Name of the ElasticSearch cluster to connect to
-batchSize         100                                                                 Number of events to be written per txn.
-ttl               --                                                                  TTL in days, when set will cause the expired documents to be deleted automatically,
-                                                                                      if not set documents will never be automatically deleted
-serializer        org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer
-serializer.*      --                                                                  Properties to be passed to the serializer.
-================  ==================================================================  =======================================================================================================
+**type**          --                                                                       The component type name, needs to be ``org.apache.flume.sink.elasticsearch.ElasticSearchSink``
+**hostNames**     --                                                                       Comma separated list of hostname:port, if the port is not present the default port '9300' will be used
+indexName         flume                                                                    The name of the index which the date will be appended to. Example 'flume' -> 'flume-yyyy-MM-dd'
+indexType         logs                                                                     The type to index the document to, defaults to 'log'
+clusterName       elasticsearch                                                            Name of the ElasticSearch cluster to connect to
+batchSize         100                                                                      Number of events to be written per txn.
+ttl               --                                                                       TTL in days, when set will cause the expired documents to be deleted automatically,
+                                                                                           if not set documents will never be automatically deleted
+serializer        org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer The ElasticSearchIndexRequestBuilderFactory or ElasticSearchEventSerializer to use. Implementations of
+                                                                                           either class are accepted but ElasticSearchIndexRequestBuilderFactory is preferred.
+serializer.*      --                                                                       Properties to be passed to the serializer.
+================  ======================================================================== =======================================================================================================
 
 Example for agent named a1:
 
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
-  a1.sinks.k1.type = org.apache.flume.sink.elasticsearch.ElasticSearchSink
+  a1.sinks = k1
+  a1.sinks.k1.type = elasticsearch
   a1.sinks.k1.hostNames = 127.0.0.1:9200,127.0.0.2:9300
   a1.sinks.k1.indexName = foo_index
   a1.sinks.k1.indexType = bar_type
@@ -1554,7 +1957,7 @@ Example for agent named a1:
 .. code-block:: properties
 
   a1.channels = c1
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.sinks.k1.type = org.example.MySink
   a1.sinks.k1.channel = c1
 
@@ -1567,19 +1970,33 @@ Source adds the events and Sink removes 
 Memory Channel
 ~~~~~~~~~~~~~~
 
-The events are stored in a an in-memory queue with configurable max size. It's
-ideal for flow that needs higher throughput and prepared to lose the staged
+The events are stored in an in-memory queue with configurable max size. It's
+ideal for flows that need higher throughput and are prepared to lose the staged
 data in the event of a agent failures.
 Required properties are in **bold**.
 
-===================  =======  ==============================================================
-Property Name        Default  Description
-===================  =======  ==============================================================
-**type**             --       The component type name, needs to be ``memory``
-capacity             100      The max number of events stored in the channel
-transactionCapacity  100      The max number of events stored in the channel per transaction
-keep-alive           3        Timeout in seconds for adding or removing an event
-===================  =======  ==============================================================
+============================  ================  ===============================================================================
+Property Name                 Default           Description
+============================  ================  ===============================================================================
+**type**                      --                The component type name, needs to be ``memory``
+capacity                      100               The maximum number of events stored in the channel
+transactionCapacity           100               The maximum number of events the channel will take from a source or give to a
+                                                sink per transaction
+keep-alive                    3                 Timeout in seconds for adding or removing an event
+byteCapacityBufferPercentage  20                Defines the percent of buffer between byteCapacity and the estimated total size
+                                                of all events in the channel, to account for data in headers. See below.
+byteCapacity                  see description   Maximum total **bytes** of memory allowed as a sum of all events in this channel.
+                                                The implementation only counts the Event ``body``, which is the reason for
+                                                providing the ``byteCapacityBufferPercentage`` configuration parameter as well.
+                                                Defaults to a computed value equal to 80% of the maximum memory available to
+                                                the JVM (i.e. 80% of the -Xmx value passed on the command line).
+                                                Note that if you have multiple memory channels on a single JVM, and they happen
+                                                to hold the same physical events (i.e. if you are using a replicating channel
+                                                selector from a single source) then those event sizes may be double-counted for
+                                                channel byteCapacity purposes.
+                                                Setting this value to ``0`` will cause this value to fall back to a hard
+                                                internal limit of about 200 GB.
+============================  ================  ===============================================================================
 
 Example for agent named a1:
 
@@ -1587,14 +2004,18 @@ Example for agent named a1:
 
   a1.channels = c1
   a1.channels.c1.type = memory
-  a1.channels.c1.capacity = 1000
+  a1.channels.c1.capacity = 10000
+  a1.channels.c1.transactionCapacity = 10000
+  a1.channels.c1.byteCapacityBufferPercentage = 20
+  a1.channels.c1.byteCapacity = 800000
+  
 
 JDBC Channel
 ~~~~~~~~~~~~
 
 The events are stored in a persistent storage that's backed by a database.
 The JDBC channel currently supports embedded Derby. This is a durable channel
-that's ideal for the flows where recoverability is important.
+that's ideal for flows where recoverability is important.
 Required properties are in **bold**.
 
 ==========================  ====================================  =================================================
@@ -1625,31 +2046,6 @@ Example for agent named a1:
   a1.channels = c1
   a1.channels.c1.type = jdbc
 
-Recoverable Memory Channel
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-
-.. warning:: The Recoverable Memory Channel has been deprecated
-             in favor of the FileChannel. FileChannel is durable channel
-             and performs better than the Recoverable Memory Channel.
-
-Required properties are in **bold**.
-
-======================  ===============================================  =========================================================================
-Property Name           Default                                          Description
-======================  ===============================================  =========================================================================
-**type**                --                                               The component type name, needs to be
-                                                                         ``org.apache.flume.channel.recoverable.memory.RecoverableMemoryChannel``
-wal.dataDir             ${user.home}/.flume/recoverable-memory-channel
-wal.rollSize            (0x04000000)                                     Max size (in bytes) of a single file before we roll
-wal.minRetentionPeriod  300000                                           Min amount of time (in millis) to keep a log
-wal.workerInterval      60000                                            How often (in millis) the background worker checks for old logs
-wal.maxLogsSize         (0x20000000)                                     Total amt (in bytes) of logs to keep, excluding the current log
-capacity                100
-transactionCapacity     100
-keep-alive              3
-======================  ===============================================  =========================================================================
-
 
 File Channel
 ~~~~~~~~~~~~
@@ -1661,11 +2057,13 @@ Property Name         Default           
 ================================================  ================================  ========================================================
 **type**                                          --                                The component type name, needs to be ``file``.
 checkpointDir                                     ~/.flume/file-channel/checkpoint  The directory where checkpoint file will be stored
+useDualCheckpoints                                false                             Backup the checkpoint. If this is set to ``true``, ``backupCheckpointDir`` **must** be set
+backupCheckpointDir                               --                                The directory where the checkpoint is backed up to. This directory **must not** be the same as the data directories or the checkpoint directory
 dataDirs                                          ~/.flume/file-channel/data        The directory where log files will be stored
 transactionCapacity                               1000                              The maximum size of transaction supported by the channel
 checkpointInterval                                30000                             Amount of time (in millis) between checkpoints
 maxFileSize                                       2146435071                        Max size (in bytes) of a single log file
-minimumRequiredSpace                              524288000                         Minimum Required free space (in bytes)
+minimumRequiredSpace                              524288000                         Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value
 capacity                                          1000000                           Maximum capacity of the channel
 keep-alive                                        3                                 Amount of time (in sec) to wait for a put operation
 write-timeout                                     3                                 Amount of time (in sec) to wait for a write operation
@@ -1714,14 +2112,14 @@ Generating a key with a password seperat
    -keysize 128 -validity 9000 -keystore test.keystore \
    -storetype jceks -storepass keyStorePassword
 
-Generating a key with the password the same as the key store password:      
+Generating a key with the password the same as the key store password:
 
 .. code-block:: bash
 
   keytool -genseckey -alias key-1 -keyalg AES -keysize 128 -validity 9000 \
     -keystore src/test/resources/test.keystore -storetype jceks \
     -storepass keyStorePassword
-      
+
 
 .. code-block:: properties
 
@@ -1744,7 +2142,7 @@ Let's say you have aged key-0 out and ne
   a1.channels.c1.encryption.keyProvider.keyStorePasswordFile = /path/to/my.keystore.password
   a1.channels.c1.encryption.keyProvider.keys = key-0 key-1
 
-The same scenerio as above, however key-0 has it's own password:
+The same scenerio as above, however key-0 has its own password:
 
 .. code-block:: properties
 
@@ -1806,11 +2204,12 @@ Replicating Channel Selector (default)
 
 Required properties are in **bold**.
 
-=============  ===========  ================================================
-Property Name  Default      Description
-=============  ===========  ================================================
-selector.type  replicating  The component type name, needs to be ``replicating``
-=============  ===========  ================================================
+==================  ===========  ====================================================
+Property Name       Default      Description
+==================  ===========  ====================================================
+selector.type       replicating  The component type name, needs to be ``replicating``
+selector.optional   --           Set of channels to be marked as ``optional``
+==================  ===========  ====================================================
 
 Example for agent named a1 and it's source called r1:
 
@@ -1820,6 +2219,12 @@ Example for agent named a1 and it's sour
   a1.channels = c1 c2 c3
   a1.source.r1.selector.type = replicating
   a1.source.r1.channels = c1 c2 c3
+  a1.source.r1.selector.optional = c3
+
+In the above configuration, c3 is an optional channel. Failure to write to c3 is
+simply ignored. Since c1 and c2 are not marked optional, failure to write to
+those channels will cause the transaction to fail.
+
 
 Multiplexing Channel Selector
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1861,7 +2266,7 @@ Property Name  Default  Description
 selector.type  --       The component type name, needs to be your FQCN
 =============  =======  ==============================================
 
-Example for agent named a1 and it's source called r1:
+Example for agent named a1 and its source called r1:
 
 .. code-block:: properties
 
@@ -1882,7 +2287,7 @@ Required properties are in **bold**.
 ===================  ===========  =================================================================================
 Property Name        Default      Description
 ===================  ===========  =================================================================================
-**sinks**            --           Space separated list of sinks that are participating in the group
+**sinks**            --           Space-separated list of sinks that are participating in the group
 **processor.type**   ``default``  The component type name, needs to be ``default``, ``failover`` or ``load_balance``
 ===================  ===========  =================================================================================
 
@@ -1909,14 +2314,14 @@ Failover Sink Processor
 Failover Sink Processor maintains a prioritized list of sinks, guaranteeing
 that so long as one is available events will be processed (delivered).
 
-The fail over mechanism works by relegating failed sinks to a pool where
+The failover mechanism works by relegating failed sinks to a pool where
 they are assigned a cool down period, increasing with sequential failures
-before they are retried. Once a sink successfully sends an event it is
+before they are retried. Once a sink successfully sends an event, it is
 restored to the live pool.
 
 To configure, set a sink groups processor to ``failover`` and set
 priorities for all individual sinks. All specified priorities must
-be unique. Furthermore, upper limit to fail over time can be set
+be unique. Furthermore, upper limit to failover time can be set
 (in milliseconds) using ``maxpenalty`` property.
 
 Required properties are in **bold**.
@@ -1924,7 +2329,7 @@ Required properties are in **bold**.
 =================================  ===========  ===================================================================================
 Property Name                      Default      Description
 =================================  ===========  ===================================================================================
-**sinks**                          --           Space separated list of sinks that are participating in the group
+**sinks**                          --           Space-separated list of sinks that are participating in the group
 **processor.type**                 ``default``  The component type name, needs to be ``failover``
 **processor.priority.<sinkName>**  --             <sinkName> must be one of the sink instances associated with the current sink group
 processor.maxpenalty               30000        (in millis)
@@ -1965,22 +2370,23 @@ If ``backoff`` is enabled, the sink proc
 sinks that fail, removing them for selection for a given timeout. When the
 timeout ends, if the sink is still unresponsive timeout is increased
 exponentially to avoid potentially getting stuck in long waits on unresponsive
-sinks.
+sinks. With this disabled, in round-robin all the failed sinks load will be
+passed to the next sink in line and thus not evenly balanced
 
 
 
 Required properties are in **bold**.
 
-====================================  ===============  ==========================================================================
-Property Name                         Default          Description
-====================================  ===============  ==========================================================================
-**processor.sinks**                   --               Space separated list of sinks that are participating in the group
-**processor.type**                    ``default``      The component type name, needs to be ``load_balance``
-processor.backoff                     true             Should failed sinks be backed off exponentially.
-processor.selector                    ``round_robin``  Selection mechanism. Must be either ``round_robin``, ``random``
-                                                       or FQCN of custom class that inherits from ``AbstractSinkSelector``
-processor.selector.maxBackoffMillis   30000            used by backoff selectors to limit exponential backoff in miliseconds
-====================================  ===============  ==========================================================================
+=============================  ===============  ==========================================================================
+Property Name                  Default          Description
+=============================  ===============  ==========================================================================
+**processor.sinks**            --               Space-separated list of sinks that are participating in the group
+**processor.type**             ``default``      The component type name, needs to be ``load_balance``
+processor.backoff              false            Should failed sinks be backed off exponentially.
+processor.selector             ``round_robin``  Selection mechanism. Must be either ``round_robin``, ``random``
+                                                or FQCN of custom class that inherits from ``AbstractSinkSelector``
+processor.selector.maxTimeOut  30000            Used by backoff selectors to limit exponential backoff (in milliseconds)
+=============================  ===============  ==========================================================================
 
 Example for agent named a1:
 
@@ -2023,7 +2429,7 @@ Example for agent named a1:
 
 .. code-block:: properties
 
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.sinks.k1.type = file_roll
   a1.sinks.k1.channel = c1
   a1.sinks.k1.sink.directory = /var/log/flume
@@ -2073,7 +2479,7 @@ are named components, here is an example
 .. code-block:: properties
 
   a1.sources = r1
-  a1.sinks = k1 
+  a1.sinks = k1
   a1.channels = c1
   a1.sources.r1.interceptors = i1 i2
   a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
@@ -2087,7 +2493,7 @@ Note that the interceptor builders are p
 configurable and can be passed configuration values just like they are passed to any other configurable component.
 In the above example, events are passed to the HostInterceptor first and the events returned by the HostInterceptor
 are then passed along to the TimestampInterceptor. You can specify either the fully qualified class name (FQCN)
-or the alias ``timestamp``. If you have multiple collectors writing to the same HDFS path then you could also use
+or the alias ``timestamp``. If you have multiple collectors writing to the same HDFS path, then you could also use
 the HostInterceptor.
 
 Timestamp Interceptor
@@ -2170,6 +2576,50 @@ Example for agent named a1:
   a1.sources.r1.interceptors.i1.key = datacenter
   a1.sources.r1.interceptors.i1.value = NEW_YORK
 
+UUID Interceptor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This interceptor sets a universally unique identifier on all events that are intercepted. An example UUID is ``b5755073-77a9-43c1-8fad-b7a586fc1b97``, which represents a 128-bit value.
+

[... 277 lines stripped ...]


Mime
View raw message