flume-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hshreedha...@apache.org
Subject svn commit: r1596704 - in /flume/site/trunk/content/sphinx: FlumeDeveloperGuide.rst FlumeUserGuide.rst download.rst index.rst
Date Wed, 21 May 2014 22:34:24 GMT
Author: hshreedharan
Date: Wed May 21 22:34:24 2014
New Revision: 1596704

URL: http://svn.apache.org/r1596704
Log:
Flume 1.5.0 release


Modified:
    flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst
    flume/site/trunk/content/sphinx/FlumeUserGuide.rst
    flume/site/trunk/content/sphinx/download.rst
    flume/site/trunk/content/sphinx/index.rst

Modified: flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst
URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst?rev=1596704&r1=1596703&r2=1596704&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst (original)
+++ flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst Wed May 21 22:34:24 2014
@@ -15,7 +15,7 @@
 
 
 ======================================
-Flume 1.4.0 Developer Guide
+Flume 1.5.0 Developer Guide
 ======================================
 
 Introduction
@@ -166,7 +166,7 @@ RPC clients - Avro and Thrift
 As of Flume 1.4.0, Avro is the default RPC protocol.  The
 ``NettyAvroRpcClient`` and ``ThriftRpcClient`` implement the ``RpcClient``
 interface. The client needs to create this object with the host and port of
-the target Flume agent, and canthen use the ``RpcClient`` to send data into
+the target Flume agent, and can then use the ``RpcClient`` to send data into
 the agent. The following example shows how to use the Flume Client SDK API
 within a user's data-generating application:
 

Modified: flume/site/trunk/content/sphinx/FlumeUserGuide.rst
URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/FlumeUserGuide.rst?rev=1596704&r1=1596703&r2=1596704&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/FlumeUserGuide.rst (original)
+++ flume/site/trunk/content/sphinx/FlumeUserGuide.rst Wed May 21 22:34:24 2014
@@ -15,7 +15,7 @@
 
 
 ======================================
-Flume 1.4.0 User Guide
+Flume 1.5.0 User Guide
 ======================================
 
 Introduction
@@ -128,7 +128,7 @@ Setting up an agent
 -------------------
 
 Flume agent configuration is stored in a local configuration file.  This is a
-text file which has a format follows the Java properties file format.
+text file that follows the Java properties file format.
 Configurations for one or more agents can be specified in the same
 configuration file. The configuration file includes properties of each source,
 sink and channel in an agent and how they are wired together to form data
@@ -705,6 +705,8 @@ ssl                  false        Set th
 keystore             --           This is the path to a Java keystore file. Required for SSL.
 keystore-password    --           The password for the Java keystore. Required for SSL.
 keystore-type        JKS          The type of the Java keystore. This can be "JKS" or "PKCS12".
+ipFilter             false        Set this to true to enable ipFiltering for netty
+ipFilter.rules       --           Define N netty ipFilter pattern rules with this config.
 ==================   ===========  ===================================================
 
 Example for agent named a1:
@@ -718,6 +720,21 @@ Example for agent named a1:
   a1.sources.r1.bind = 0.0.0.0
   a1.sources.r1.port = 4141
 
+Example of ipFilter.rules
+
+ipFilter.rules defines N netty ipFilters separated by a comma a pattern rule must be in this format.
+
+<'allow' or deny>:<'ip' or 'name' for computer name>:<pattern>
+or
+allow/deny:ip/name:pattern
+
+example: ipFilter.rules=allow:ip:127.*,allow:name:localhost,deny:ip:*
+
+Note that the first rule to match will apply as the example below shows from a client on the localhost
+
+This will Allow the client on localhost be deny clients from any other ip "allow:name:localhost,deny:ip:*"
+This will deny the client on localhost be allow clients from any other ip "deny:name:localhost,allow:ip:*"
+
 Thrift Source
 ~~~~~~~~~~~~~
 
@@ -929,13 +946,29 @@ Property Name         Default         De
 **spoolDir**          --              The directory from which to read files from.
 fileSuffix            .COMPLETED      Suffix to append to completely ingested files
 deletePolicy          never           When to delete completed files: ``never`` or ``immediate``
-fileHeader            false           Whether to add a header storing the filename
-fileHeaderKey         file            Header key to use when appending filename to header
+fileHeader            false           Whether to add a header storing the absolute path filename.
+fileHeaderKey         file            Header key to use when appending absolute path filename to event header.
+basenameHeader        false           Whether to add a header storing the basename of the file.
+basenameHeaderKey     basename        Header Key to use when appending  basename of file to event header.
 ignorePattern         ^$              Regular expression specifying which files to ignore (skip)
 trackerDir            .flumespool     Directory to store metadata related to processing of files.
                                       If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
+consumeOrder          oldest          In which order files in the spooling directory will be consumed ``oldest``,
+                                      ``youngest`` and ``random``. In case of ``oldest`` and ``youngest``, the last modified
+                                      time of the files will be used to compare the files. In case of a tie, the file
+                                      with smallest laxicographical order will be consumed first. In case of ``random`` any
+                                      file will be picked randomly. When using ``oldest`` and ``youngest`` the whole
+                                      directory will be scanned to pick the oldest/youngest file, which might be slow if there
+                                      are a large number of files, while using ``random`` may cause old files to be consumed
+                                      very late if new files keep coming in the spooling directory.
+maxBackoff            4000            The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
 batchSize             100             Granularity at which to batch transfer to the channel
 inputCharset          UTF-8           Character set used by deserializers that treat the input file as text.
+decodeErrorPolicy     ``FAIL``        What to do when we see a non-decodable character in the input file.
+                                      ``FAIL``: Throw an exception and fail to parse the file.
+                                      ``REPLACE``: Replace the unparseable character with the "replacement character" char,
+                                      typically Unicode U+FFFD.
+                                      ``IGNORE``: Drop the unparseable character sequence.
 deserializer          ``LINE``        Specify the deserializer used to parse the file into events.
                                       Defaults to parsing each line as an event. The class specified must implement
                                       ``EventDeserializer.Builder``.
@@ -960,6 +993,47 @@ Example for an agent named agent-1:
   agent-1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
   agent-1.sources.src-1.fileHeader = true
 
+Twitter 1% firehose Source (experimental)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. warning::
+  This source is hightly experimental and may change between minor versions of Flume.
+  Use at your own risk.
+
+Experimental source that connects via Streaming API to the 1% sample twitter
+firehose, continously downloads tweets, converts them to Avro format and
+sends Avro events to a downstream Flume sink. Requires the consumer and 
+access tokens and secrets of a Twitter developer account.
+Required properties are in **bold**.
+
+====================== ===========  ===================================================
+Property Name          Default      Description
+====================== ===========  ===================================================
+**channels**           --
+**type**               --           The component type name, needs to be ``org.apache.flume.source.twitter.TwitterSource``
+**consumerKey**        --           OAuth consumer key
+**consumerSecret**     --           OAuth consumer secret
+**accessToken**        --           OAuth access token
+**accessTokenSecret**  --           OAuth toekn secret 
+maxBatchSize           1000         Maximum number of twitter messages to put in a single batch
+maxBatchDurationMillis 1000         Maximum number of milliseconds to wait before closing a batch
+====================== ===========  ===================================================
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+  a1.sources = r1
+  a1.channels = c1
+  a1.sources.r1.type = org.apache.flume.source.twitter.TwitterSource
+  a1.sources.r1.channels = c1
+  a1.sources.r1.consumerKey = YOUR_TWITTER_CONSUMER_KEY
+  a1.sources.r1.consumerSecret = YOUR_TWITTER_CONSUMER_SECRET
+  a1.sources.r1.accessToken = YOUR_TWITTER_ACCESS_TOKEN
+  a1.sources.r1.accessTokenSecret = YOUR_TWITTER_ACCESS_TOKEN_SECRET
+  a1.sources.r1.maxBatchSize = 10
+  a1.sources.r1.maxBatchDurationMillis = 200
+
 Event Deserializers
 '''''''''''''''''''
 
@@ -1107,6 +1181,8 @@ Property Name    Default      Descriptio
 **host**         --           Host name or IP address to bind to
 **port**         --           Port # to bind to
 eventSize        2500         Maximum size of a single event line, in bytes
+keepFields       false        Setting this to true will preserve the Priority,
+                              Timestamp and Hostname in the body of the event.
 selector.type                 replicating or multiplexing
 selector.*       replicating  Depends on the selector.type value
 interceptors     --           Space-separated list of interceptors
@@ -1143,6 +1219,8 @@ Property Name         Default           
 **host**              --                Host name or IP address to bind to.
 **ports**             --                Space-separated list (one or more) of ports to bind to.
 eventSize             2500              Maximum size of a single event line, in bytes.
+keepFields            false             Setting this to true will preserve the
+                                        Priority, Timestamp and Hostname in the body of the event.
 portHeader            --                If specified, the port number will be stored in the header of each event using the header name specified here. This allows for interceptors and channel selectors to customize routing logic based on the incoming port.
 charset.default       UTF-8             Default character set used while parsing syslog events into strings.
 charset.port.<port>   --                Character set is configurable on a per-port basis.
@@ -1177,6 +1255,8 @@ Property Name   Default      Description
 **type**        --           The component type name, needs to be ``syslogudp``
 **host**        --           Host name or IP address to bind to
 **port**        --           Port # to bind to
+keepFields      false        Setting this to true will preserve the Priority,
+                             Timestamp and Hostname in the body of the event.
 selector.type                replicating or multiplexing
 selector.*      replicating  Depends on the selector.type value
 interceptors    --           Space-separated list of interceptors
@@ -1223,6 +1303,9 @@ selector.type   replicating             
 selector.*                                                    Depends on the selector.type value
 interceptors    --                                            Space-separated list of interceptors
 interceptors.*
+enableSSL       false                                         Set the property true, to enable SSL
+keystore                                                      Location of the keystore includng keystore file name
+keystorePassword                                              Keystore password
 ==================================================================================================================================
 
 For example, a http source for agent named a1:
@@ -1397,7 +1480,7 @@ Scribe Source
 
 Scribe is another type of ingest system. To adopt existing Scribe ingest system,
 Flume should use ScribeSource based on Thrift with compatible transfering protocol.
-The deployment of Scribe please following guide from Facebook.
+For deployment of Scribe please follow the guide from Facebook.
 Required properties are in **bold**.
 
 ==============  ===========  ==============================================
@@ -1514,6 +1597,13 @@ hdfs.roundValue         1             Ro
 hdfs.roundUnit          second        The unit of the round down value - ``second``, ``minute`` or ``hour``.
 hdfs.timeZone           Local Time    Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
 hdfs.useLocalTimeStamp  false         Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
+hdfs.closeTries         0             Number of times the sink must try to close a file. If set to 1, this sink will not re-try a failed close
+                                      (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension.
+                                      If set to 0, the sink will try to close the file until the file is eventually closed
+                                      (there is no limit on the number of times it would try).
+hdfs.retryInterval      180           Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode,
+                                      so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not
+                                      attempt to close the file if the first attempt fails, and may leave the file open or with a ".tmp" extension.
 serializer              ``TEXT``      Other possible options include ``avro_event`` or the
                                       fully-qualified class name of an implementation of the
                                       ``EventSerializer.Builder`` interface.
@@ -1569,25 +1659,26 @@ hostname / port pair. The events are tak
 batches of the configured batch size.
 Required properties are in **bold**.
 
-==========================   =======  ==============================================
+==========================   =====================================================  ===========================================================================================
 Property Name                Default  Description
-==========================   =======  ==============================================
+==========================   =====================================================  ===========================================================================================
 **channel**                  --
-**type**                     --       The component type name, needs to be ``avro``.
-**hostname**                 --       The hostname or IP address to bind to.
-**port**                     --       The port # to listen on.
-batch-size                   100      number of event to batch together for send.
-connect-timeout              20000    Amount of time (ms) to allow for the first (handshake) request.
-request-timeout              20000    Amount of time (ms) to allow for requests after the first.
-reset-connection-interval    none     Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent.
-compression-type             none     This can be "none" or "deflate".  The compression-type must match the compression-type of matching AvroSource
-compression-level            6        The level of compression to compress event. 0 = no compression and 1-9 is compression.  The higher the number the more compression
-ssl                          false    Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a "truststore", "truststore-password", "truststore-type", and specify whether to "trust-all-certs".
-trust-all-certs              false    If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and "listen in" on the encrypted connection.
-truststore                   --       The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source's SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically "jssecacerts" or "cacerts" in the Oracle JRE) will be used.
-truststore-password          --       The password for the specified truststore.
-truststore-type              JKS      The type of the Java truststore. This can be "JKS" or other supported Java truststore type.
-==========================   =======  ==============================================
+**type**                     --                                                     The component type name, needs to be ``avro``.
+**hostname**                 --                                                     The hostname or IP address to bind to.
+**port**                     --                                                     The port # to listen on.
+batch-size                   100                                                    number of event to batch together for send.
+connect-timeout              20000                                                  Amount of time (ms) to allow for the first (handshake) request.
+request-timeout              20000                                                  Amount of time (ms) to allow for requests after the first.
+reset-connection-interval    none                                                   Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent.
+compression-type             none                                                   This can be "none" or "deflate".  The compression-type must match the compression-type of matching AvroSource
+compression-level            6                                                      The level of compression to compress event. 0 = no compression and 1-9 is compression.  The higher the number the more compression
+ssl                          false                                                  Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a "truststore", "truststore-password", "truststore-type", and specify whether to "trust-all-certs".
+trust-all-certs              false                                                  If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and "listen in" on the encrypted connection.
+truststore                   --                                                     The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source's SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically "jssecacerts" or "cacerts" in the Oracle JRE) will be used.
+truststore-password          --                                                     The password for the specified truststore.
+truststore-type              JKS                                                    The type of the Java truststore. This can be "JKS" or other supported Java truststore type.
+maxIoWorkers                 2 * the number of available processors in the machine  The maximum number of I/O worker threads. This is configured on the NettyAvroRpcClient NioClientSocketChannelFactory.
+==========================   =====================================================  ===========================================================================================
 
 Example for agent named a1:
 
@@ -1760,7 +1851,11 @@ Property Name       Default             
 **type**            --                                                      The component type name, needs to be ``hbase``
 **table**           --                                                      The name of the table in Hbase to write to.
 **columnFamily**    --                                                      The column family in Hbase to write to.
+zookeeperQuorum     --                                                      The quorum spec. This is the value for the property ``hbase.zookeeper.quorum`` in hbase-site.xml
+znodeParent         /hbase                                                  The base path for the znode for the -ROOT- region. Value of ``zookeeper.znode.parent`` in hbase-site.xml
 batchSize           100                                                     Number of events to be written per txn.
+coalesceIncrements  false                                                   Should the sink coalesce multiple increments to a cell per batch. This might give
+                                                                            better performance if there are multiple increments to a limited number of cells.
 serializer          org.apache.flume.sink.hbase.SimpleHbaseEventSerializer  Default increment column = "iCol", payload column = "pCol".
 serializer.*        --                                                      Properties to be passed to the serializer.
 kerberosPrincipal   --                                                      Kerberos user principal for accessing secure HBase
@@ -1783,30 +1878,32 @@ AsyncHBaseSink
 ''''''''''''''
 
 This sink writes data to HBase using an asynchronous model. A class implementing
-AsyncHbaseEventSerializer
-which is specified by the configuration is used to convert the events into
+AsyncHbaseEventSerializer which is specified by the configuration is used to convert the events into
 HBase puts and/or increments. These puts and increments are then written
-to HBase. This sink provides the same consistency guarantees as HBase,
+to HBase. This sink uses the `Asynchbase API <https://github.com/OpenTSDB/asynchbase>`_ to write to
+HBase. This sink provides the same consistency guarantees as HBase,
 which is currently row-wise atomicity. In the event of Hbase failing to
 write certain events, the sink will replay all events in that transaction.
 The type is the FQCN: org.apache.flume.sink.hbase.AsyncHBaseSink.
 Required properties are in **bold**.
 
-================  ============================================================  ====================================================================================
-Property Name     Default                                                       Description
-================  ============================================================  ====================================================================================
-**channel**       --
-**type**          --                                                            The component type name, needs to be ``asynchbase``
-**table**         --                                                            The name of the table in Hbase to write to.
-zookeeperQuorum   --                                                            The quorum spec. This is the value for the property ``hbase.zookeeper.quorum`` in hbase-site.xml
-znodeParent       /hbase                                                        The base path for the znode for the -ROOT- region. Value of ``zookeeper.znode.parent`` in hbase-site.xml
-**columnFamily**  --                                                            The column family in Hbase to write to.
-batchSize         100                                                           Number of events to be written per txn.
-timeout           60000                                                         The length of time (in milliseconds) the sink waits for acks from hbase for
-                                                                                all events in a transaction.
-serializer        org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
-serializer.*      --                                                            Properties to be passed to the serializer.
-================  ============================================================  ====================================================================================
+===================  ============================================================  ====================================================================================
+Property Name        Default                                                       Description
+===================  ============================================================  ====================================================================================
+**channel**          --
+**type**             --                                                            The component type name, needs to be ``asynchbase``
+**table**            --                                                            The name of the table in Hbase to write to.
+zookeeperQuorum      --                                                            The quorum spec. This is the value for the property ``hbase.zookeeper.quorum`` in hbase-site.xml
+znodeParent          /hbase                                                        The base path for the znode for the -ROOT- region. Value of ``zookeeper.znode.parent`` in hbase-site.xml
+**columnFamily**     --                                                            The column family in Hbase to write to.
+batchSize            100                                                           Number of events to be written per txn.
+coalesceIncrements   false                                                         Should the sink coalesce multiple increments to a cell per batch. This might give
+                                                                                   better performance if there are multiple increments to a limited number of cells.
+timeout              60000                                                         The length of time (in milliseconds) the sink waits for acks from hbase for
+                                                                                   all events in a transaction.
+serializer           org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
+serializer.*         --                                                            Properties to be passed to the serializer.
+===================  ============================================================  ====================================================================================
 
 Note that this sink takes the Zookeeper Quorum and parent znode information in
 the configuration. Zookeeper Quorum and parent node configuration may be
@@ -1835,7 +1932,7 @@ This sink extracts data from Flume event
 
 This sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr (via MorphlineSolrSink). In particular, this sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a data model that is useful to Search applications.
 
-The ETL functionality is customizable using a `morphline configuration file <http://cloudera.github.io/cdk/docs/0.4.0/cdk-morphlines/index.html>`_ that defines a chain of transformation commands that pipe event records from one command to another. 
+The ETL functionality is customizable using a `morphline configuration file <http://cloudera.github.io/cdk/docs/current/cdk-morphlines/index.html>`_ that defines a chain of transformation commands that pipe event records from one command to another. 
 
 Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records, including arbitrary binary payloads. A morphline command is a bit like a Flume Interceptor. Morphlines can be embedded into Hadoop components such as Flume.
 
@@ -1915,7 +2012,10 @@ indexType         logs                  
 clusterName       elasticsearch                                                            Name of the ElasticSearch cluster to connect to
 batchSize         100                                                                      Number of events to be written per txn.
 ttl               --                                                                       TTL in days, when set will cause the expired documents to be deleted automatically,
-                                                                                           if not set documents will never be automatically deleted
+                                                                                           if not set documents will never be automatically deleted. TTL is accepted both in the earlier form of
+                                                                                           integer only e.g. a1.sinks.k1.ttl = 5 and also with a qualifier ms (millisecond), s (second), m (minute),
+                                                                                           h (hour), d (day) and w (week). Example a1.sinks.k1.ttl = 5d will set TTL to 5 days. Follow
+                                                                                           http://www.elasticsearch.org/guide/reference/mapping/ttl-field/ for more information.
 serializer        org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer The ElasticSearchIndexRequestBuilderFactory or ElasticSearchEventSerializer to use. Implementations of
                                                                                            either class are accepted but ElasticSearchIndexRequestBuilderFactory is preferred.
 serializer.*      --                                                                       Properties to be passed to the serializer.
@@ -1933,10 +2033,50 @@ Example for agent named a1:
   a1.sinks.k1.indexType = bar_type
   a1.sinks.k1.clusterName = foobar_cluster
   a1.sinks.k1.batchSize = 500
-  a1.sinks.k1.ttl = 5
+  a1.sinks.k1.ttl = 5d
   a1.sinks.k1.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer
   a1.sinks.k1.channel = c1
 
+Kite Dataset Sink (experimental)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. warning::
+  This source is experimental and may change between minor versions of Flume.
+  Use at your own risk.
+
+Experimental sink that writes events to a `Kite Dataset <http://kitesdk.org/docs/current/kite-data/guide.html>`_.
+This sink will deserialize the body of each incoming event and store the
+resulting record in a Kite Dataset. It determines target Dataset by opening a
+repository URI, ``kite.repo.uri``, and loading a Dataset by name,
+``kite.dataset.name``.
+
+The only supported serialization is avro, and the record schema must be passed
+in the event headers, using either ``flume.avro.schema.literal`` with the JSON
+schema representation or ``flume.avro.schema.url`` with a URL where the schema
+may be found (``hdfs:/...`` URIs are supported). This is compatible with the
+Log4jAppender flume client and the spooling directory source's Avro
+deserializer using ``deserializer.schemaType = LITERAL``.
+
+Note 1: The ``flume.avro.schema.hash`` header is **not supported**.
+Note 2: In some cases, file rolling may occur slightly after the roll interval
+has been exceeded. However, this delay will not exceed 5 seconds. In most
+cases, the delay is neglegible.
+
+=======================  =======  ===========================================================
+Property Name            Default  Description
+=======================  =======  ===========================================================
+**channel**              --
+**type**                 --       Must be org.apache.flume.sink.kite.DatasetSink
+**kite.repo.uri**        --       URI of the repository to open
+**kite.dataset.name**    --       Name of the Dataset where records will be written
+kite.batchSize           100      Number of records to process in each batch
+kite.rollInterval        30       Maximum wait time (seconds) before data files are released
+auth.kerberosPrincipal   --       Kerberos user principal for secure authentication to HDFS
+auth.kerberosKeytab      --       Kerberos keytab location (local FS) for the principal
+auth.proxyUser           --       The effective user for HDFS actions, if different from
+                                  the kerberos principal
+=======================  =======  ===========================================================
+
 Custom Sink
 ~~~~~~~~~~~
 
@@ -2059,15 +2199,13 @@ Property Name         Default           
 checkpointDir                                     ~/.flume/file-channel/checkpoint  The directory where checkpoint file will be stored
 useDualCheckpoints                                false                             Backup the checkpoint. If this is set to ``true``, ``backupCheckpointDir`` **must** be set
 backupCheckpointDir                               --                                The directory where the checkpoint is backed up to. This directory **must not** be the same as the data directories or the checkpoint directory
-dataDirs                                          ~/.flume/file-channel/data        The directory where log files will be stored
-transactionCapacity                               1000                              The maximum size of transaction supported by the channel
+dataDirs                                          ~/.flume/file-channel/data        Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance
+transactionCapacity                               10000                             The maximum size of transaction supported by the channel
 checkpointInterval                                30000                             Amount of time (in millis) between checkpoints
 maxFileSize                                       2146435071                        Max size (in bytes) of a single log file
 minimumRequiredSpace                              524288000                         Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value
 capacity                                          1000000                           Maximum capacity of the channel
 keep-alive                                        3                                 Amount of time (in sec) to wait for a put operation
-write-timeout                                     3                                 Amount of time (in sec) to wait for a write operation
-checkpoint-timeout                                600                               Expert: Amount of time (in sec) to wait for a checkpoint
 use-log-replay-v1                                 false                             Expert: Use old replay logic
 use-fast-replay                                   false                             Expert: Replay without using queue
 encryption.activeKey                              --                                Key name used to encrypt new data
@@ -2155,6 +2293,80 @@ The same scenerio as above, however key-
   a1.channels.c1.encryption.keyProvider.keys.key-0.passwordFile = /path/to/key-0.password
 
 
+Spillable Memory Channel
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The events are stored in an in-memory queue and on disk. The in-memory queue serves as the primary store and the disk as overflow.
+The disk store is managed using an embedded File channel. When the in-memory queue is full, additional incoming events are stored in
+the file channel. This channel is ideal for flows that need high throughput of memory channel during normal operation, but at the
+same time need the larger capacity of the file channel for better tolerance of intermittent sink side outages or drop in drain rates.
+The throughput will reduce approximately to file channel speeds during such abnormal situations. In case of an agent crash or restart,
+only the events stored on disk are recovered when the agent comes online. **This channel is currently experimental and 
+not recommended for use in production.**
+
+Required properties are in **bold**. Please refer to file channel for additional required properties.
+
+============================  ================  =============================================================================================
+Property Name                 Default           Description
+============================  ================  =============================================================================================
+**type**                      --                The component type name, needs to be ``SPILLABLEMEMORY``
+memoryCapacity                10000             Maximum number of events stored in memory queue. To disable use of in-memory queue, set this to zero.
+overflowCapacity              100000000         Maximum number of events stored in overflow disk (i.e File channel). To disable use of overflow, set this to zero.
+overflowTimeout               3                 The number of seconds to wait before enabling disk overflow when memory fills up.
+byteCapacityBufferPercentage  20                Defines the percent of buffer between byteCapacity and the estimated total size
+                                                of all events in the channel, to account for data in headers. See below.
+byteCapacity                  see description   Maximum **bytes** of memory allowed as a sum of all events in the memory queue.
+                                                The implementation only counts the Event ``body``, which is the reason for
+                                                providing the ``byteCapacityBufferPercentage`` configuration parameter as well.
+                                                Defaults to a computed value equal to 80% of the maximum memory available to
+                                                the JVM (i.e. 80% of the -Xmx value passed on the command line).
+                                                Note that if you have multiple memory channels on a single JVM, and they happen
+                                                to hold the same physical events (i.e. if you are using a replicating channel
+                                                selector from a single source) then those event sizes may be double-counted for
+                                                channel byteCapacity purposes.
+                                                Setting this value to ``0`` will cause this value to fall back to a hard
+                                                internal limit of about 200 GB.
+avgEventSize                  500               Estimated average size of events, in bytes, going into the channel
+<file channel properties>     see file channel  Any file channel property with the exception of 'keep-alive' and 'capacity' can be used.
+                                                The keep-alive of file channel is managed by Spillable Memory Channel. Use 'overflowCapacity'
+                                                to set the File channel's capacity.
+============================  ================  =============================================================================================
+
+In-memory queue is considered full if either memoryCapacity or byteCapacity limit is reached.
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+  a1.channels = c1
+  a1.channels.c1.type = SPILLABLEMEMORY
+  a1.channels.c1.memoryCapacity = 10000
+  a1.channels.c1.overflowCapacity = 1000000
+  a1.channels.c1.byteCapacity = 800000
+  a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
+  a1.channels.c1.dataDirs = /mnt/flume/data
+
+To disable the use of the in-memory queue and function like a file channel:
+
+.. code-block:: properties
+
+  a1.channels = c1
+  a1.channels.c1.type = SPILLABLEMEMORY
+  a1.channels.c1.memoryCapacity = 0
+  a1.channels.c1.overflowCapacity = 1000000
+  a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
+  a1.channels.c1.dataDirs = /mnt/flume/data
+
+
+To disable the use of overflow disk and function purely as a in-memory channel:
+
+.. code-block:: properties
+
+  a1.channels = c1
+  a1.channels.c1.type = SPILLABLEMEMORY
+  a1.channels.c1.memoryCapacity = 100000
+
+
 Pseudo Transaction Channel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -2595,7 +2807,7 @@ prefix            ""       The prefix st
 Morphline Interceptor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-This interceptor filters the events through a `morphline configuration file <http://cloudera.github.io/cdk/docs/0.4.0/cdk-morphlines/index.html>`_ that defines a chain of transformation commands that pipe records from one command to another.
+This interceptor filters the events through a `morphline configuration file <http://cloudera.github.io/cdk/docs/current/cdk-morphlines/index.html>`_ that defines a chain of transformation commands that pipe records from one command to another.
 For example the morphline can ignore certain events or alter or insert certain event headers via regular expression based pattern matching, or it can auto-detect and set a MIME type via Apache Tika on events that are intercepted. For example, this kind of packet sniffing can be used for content based dynamic routing in a Flume topology.
 MorphlineInterceptor can also help to implement dynamic routing to multiple Apache Solr collections (e.g. for multi-tenancy).
 
@@ -2671,11 +2883,11 @@ If the Flume event body contained ``1:2:
 
 .. code-block:: properties
 
-  agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
-  agent.sources.r1.interceptors.i1.serializers = s1 s2 s3
-  agent.sources.r1.interceptors.i1.serializers.s1.name = one
-  agent.sources.r1.interceptors.i1.serializers.s2.name = two
-  agent.sources.r1.interceptors.i1.serializers.s3.name = three
+  a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
+  a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
+  a1.sources.r1.interceptors.i1.serializers.s1.name = one
+  a1.sources.r1.interceptors.i1.serializers.s2.name = two
+  a1.sources.r1.interceptors.i1.serializers.s3.name = three
 
 The extracted event will contain the same body but the following headers will have been added ``one=>1, two=>2, three=>3``
 
@@ -2686,11 +2898,11 @@ If the Flume event body contained ``2012
 
 .. code-block:: properties
 
-  agent.sources.r1.interceptors.i1.regex = ^(?:\\n)?(\\d\\d\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d)
-  agent.sources.r1.interceptors.i1.serializers = s1
-  agent.sources.r1.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
-  agent.sources.r1.interceptors.i1.serializers.s1.name = timestamp
-  agent.sources.r1.interceptors.i1.serializers.s1.pattern = yyyy-MM-dd HH:mm
+  a1.sources.r1.interceptors.i1.regex = ^(?:\\n)?(\\d\\d\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d)
+  a1.sources.r1.interceptors.i1.serializers = s1
+  a1.sources.r1.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
+  a1.sources.r1.interceptors.i1.serializers.s1.name = timestamp
+  a1.sources.r1.interceptors.i1.serializers.s1.pattern = yyyy-MM-dd HH:mm
 
 the extracted event will contain the same body but the following headers will have been added ``timestamp=>1350611220000``
 
@@ -2731,21 +2943,21 @@ Log4J Appender
 
 Appends Log4j events to a flume agent's avro source. A client using this
 appender must have the flume-ng-sdk in the classpath (eg,
-flume-ng-sdk-1.4.0.jar).
+flume-ng-sdk-1.5.0.jar).
 Required properties are in **bold**.
 
-=====================  =======  ==============================================================
+=====================  =======  ==================================================================================
 Property Name          Default  Description
-=====================  =======  ==============================================================
+=====================  =======  ==================================================================================
 **Hostname**           --       The hostname on which a remote Flume agent is running with an
                                 avro source.
 **Port**               --       The port at which the remote Flume agent's avro source is
                                 listening.
 UnsafeMode             false    If true, the appender will not throw exceptions on failure to
                                 send the events.
-AvroReflectionEnabled  false    Use Avro Reflection to serialize Log4j events.
+AvroReflectionEnabled  false    Use Avro Reflection to serialize Log4j events. (Do not use when users log strings)
 AvroSchemaUrl          --       A URL from which the Avro schema can be retrieved.
-=====================  =======  ==============================================================
+=====================  =======  ==================================================================================
 
 Sample log4j.properties file:
 
@@ -2795,7 +3007,7 @@ Load Balancing Log4J Appender
 
 Appends Log4j events to a list of flume agent's avro source. A client using this
 appender must have the flume-ng-sdk in the classpath (eg,
-flume-ng-sdk-1.4.0.jar). This appender supports a round-robin and random
+flume-ng-sdk-1.5.0.jar). This appender supports a round-robin and random
 scheme for performing the load balancing. It also supports a configurable backoff
 timeout so that down agents are removed temporarily from the set of hosts
 Required properties are in **bold**.
@@ -2883,9 +3095,9 @@ and can be specified in the flume-env.sh
 Property Name            Default  Description
 =======================  =======  =====================================================================================
 **type**                 --       The component type name, has to be ``ganglia``
-**hosts**                --       Comma-separated list of ``hostname:port``
-pollInterval             60       Time, in seconds, between consecutive reporting to ganglia server
-isGanglia3               false    Ganglia server version is 3. By default, Flume sends in ganglia 3.1 format
+**hosts**                --       Comma-separated list of ``hostname:port`` of Ganglia servers
+pollFrequency            60       Time, in seconds, between consecutive reporting to Ganglia server
+isGanglia3               false    Ganglia server version is 3. By default, Flume sends in Ganglia 3.1 format
 =======================  =======  =====================================================================================
 
 We can start Flume with Ganglia support as follows::
@@ -2936,7 +3148,7 @@ Property Name            Default  Descri
 port                     41414    The port to start the server on.
 =======================  =======  =====================================================================================
 
-We can start Flume with Ganglia support as follows::
+We can start Flume with JSON Reporting support as follows::
 
   $ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=34545
 

Modified: flume/site/trunk/content/sphinx/download.rst
URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/download.rst?rev=1596704&r1=1596703&r2=1596704&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/download.rst (original)
+++ flume/site/trunk/content/sphinx/download.rst Wed May 21 22:34:24 2014
@@ -12,8 +12,8 @@ originals on the main distribution serve
    :header: "", "Mirrors", "Checksum", "Signature"
    :widths: 25, 25, 25, 25
 
-   "Apache Flume binary (tar.gz)",  `apache-flume-1.4.0-bin.tar.gz <http://www.apache.org/dyn/closer.cgi/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz>`_, `apache-flume-1.4.0-bin.tar.gz.md5 <http://www.us.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz.md5>`_, `apache-flume-1.4.0-bin.tar.gz.asc <http://www.us.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz.asc>`_
-   "Apache Flume source (tar.gz)",  `apache-flume-1.4.0-src.tar.gz <http://www.apache.org/dyn/closer.cgi/flume/1.4.0/apache-flume-1.4.0-src.tar.gz>`_, `apache-flume-1.4.0-src.tar.gz.md5 <http://www.us.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-src.tar.gz.md5>`_, `apache-flume-1.4.0-src.tar.gz.asc <http://www.us.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-src.tar.gz.asc>`_
+   "Apache Flume binary (tar.gz)",  `apache-flume-1.5.0-bin.tar.gz <http://www.apache.org/dyn/closer.cgi/flume/1.5.0/apache-flume-1.5.0-bin.tar.gz>`_, `apache-flume-1.5.0-bin.tar.gz.md5 <http://www.us.apache.org/dist/flume/1.5.0/apache-flume-1.5.0-bin.tar.gz.md5>`_, `apache-flume-1.5.0-bin.tar.gz.asc <http://www.us.apache.org/dist/flume/1.5.0/apache-flume-1.5.0-bin.tar.gz.asc>`_
+   "Apache Flume source (tar.gz)",  `apache-flume-1.5.0-src.tar.gz <http://www.apache.org/dyn/closer.cgi/flume/1.5.0/apache-flume-1.5.0-src.tar.gz>`_, `apache-flume-1.5.0-src.tar.gz.md5 <http://www.us.apache.org/dist/flume/1.5.0/apache-flume-1.5.0-src.tar.gz.md5>`_, `apache-flume-1.5.0-src.tar.gz.asc <http://www.us.apache.org/dist/flume/1.5.0/apache-flume-1.5.0-src.tar.gz.asc>`_
 
 It is essential that you verify the integrity of the downloaded files using the PGP or MD5 signatures. Please read
 `Verifying Apache HTTP Server Releases <http://httpd.apache.org/dev/verification.html>`_ for more information on
@@ -25,9 +25,9 @@ as well as the asc signature file for th
 Then verify the signatures using::
 
     % gpg --import KEYS
-    % gpg --verify apache-flume-1.4.0-src.tar.gz.asc
+    % gpg --verify apache-flume-1.5.0-src.tar.gz.asc
 
-Apache Flume 1.4.0 is signed by Mike Percy 66F2054B
+Apache Flume 1.5.0 is signed by Hari Shreedharan 77FFC9AB
 
 Alternatively, you can verify the MD5 or SHA1 signatures of the files. A program called md5, md5sum, or shasum is included in many
 Unix distributions for this purpose.

Modified: flume/site/trunk/content/sphinx/index.rst
URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/index.rst?rev=1596704&r1=1596703&r2=1596704&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/index.rst (original)
+++ flume/site/trunk/content/sphinx/index.rst Wed May 21 22:34:24 2014
@@ -33,6 +33,36 @@ application.
 
 .. raw:: html
 
+   <h3>May 20, 2014 - Apache Flume 1.5.0 Released</h3>
+
+The Apache Flume team is pleased to announce the release of Flume 1.5.0.
+
+Flume is a distributed, reliable, and available service for efficiently
+collecting, aggregating, and moving large amounts of streaming event data.
+
+Version 1.5.0 is the fifth Flume release as an Apache top-level project.
+Flume 1.5.0 is stable, production-ready software, and is backwards-compatible
+with previous versions of the Flume 1.x codeline.
+
+Several months of active development went into this release: 123 patches were committed since 1.4.0, representing many features, enhancements, and bug fixes. While the full change log can be found on the 1.5.0 release page (link below), here are a few new feature highlights:
+
+* New in-memory channel that can spill to disk
+* A new dataset sink that use Kite API to write data to HDFS and HBase
+* Support for Elastic Search HTTP API in Elastic Search Sink
+* Much faster replay in the File Channel.
+
+The full change log and documentation are available on the
+`Flume 1.5.0 release page <releases/1.5.0.html>`__.
+
+This release can be downloaded from the Flume `Download <download.html>`__ page.
+
+Your contributions, feedback, help and support make Flume better!
+For more information on how to report problems or contribute,
+please visit our `Get Involved <getinvolved.html>`__ page.
+
+The Apache Flume Team
+
+
    <h3>July 2, 2013 - Apache Flume 1.4.0 Released</h3>
 
 The Apache Flume team is pleased to announce the release of Flume 1.4.0.



Mime
View raw message