Return-Path: X-Original-To: apmail-flume-commits-archive@www.apache.org Delivered-To: apmail-flume-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 13CB611C58 for ; Wed, 21 May 2014 22:34:48 +0000 (UTC) Received: (qmail 25042 invoked by uid 500); 21 May 2014 22:34:48 -0000 Delivered-To: apmail-flume-commits-archive@flume.apache.org Received: (qmail 25010 invoked by uid 500); 21 May 2014 22:34:48 -0000 Mailing-List: contact commits-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flume.apache.org Delivered-To: mailing list commits@flume.apache.org Received: (qmail 25003 invoked by uid 99); 21 May 2014 22:34:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 May 2014 22:34:47 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 May 2014 22:34:44 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id AB4892388868; Wed, 21 May 2014 22:34:24 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1596704 - in /flume/site/trunk/content/sphinx: FlumeDeveloperGuide.rst FlumeUserGuide.rst download.rst index.rst Date: Wed, 21 May 2014 22:34:24 -0000 To: commits@flume.apache.org From: hshreedharan@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20140521223424.AB4892388868@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: hshreedharan Date: Wed May 21 22:34:24 2014 New Revision: 1596704 URL: http://svn.apache.org/r1596704 Log: Flume 1.5.0 release Modified: flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst flume/site/trunk/content/sphinx/FlumeUserGuide.rst flume/site/trunk/content/sphinx/download.rst flume/site/trunk/content/sphinx/index.rst Modified: flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst?rev=1596704&r1=1596703&r2=1596704&view=diff ============================================================================== --- flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst (original) +++ flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst Wed May 21 22:34:24 2014 @@ -15,7 +15,7 @@ ====================================== -Flume 1.4.0 Developer Guide +Flume 1.5.0 Developer Guide ====================================== Introduction @@ -166,7 +166,7 @@ RPC clients - Avro and Thrift As of Flume 1.4.0, Avro is the default RPC protocol. The ``NettyAvroRpcClient`` and ``ThriftRpcClient`` implement the ``RpcClient`` interface. The client needs to create this object with the host and port of -the target Flume agent, and canthen use the ``RpcClient`` to send data into +the target Flume agent, and can then use the ``RpcClient`` to send data into the agent. The following example shows how to use the Flume Client SDK API within a user's data-generating application: Modified: flume/site/trunk/content/sphinx/FlumeUserGuide.rst URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/FlumeUserGuide.rst?rev=1596704&r1=1596703&r2=1596704&view=diff ============================================================================== --- flume/site/trunk/content/sphinx/FlumeUserGuide.rst (original) +++ flume/site/trunk/content/sphinx/FlumeUserGuide.rst Wed May 21 22:34:24 2014 @@ -15,7 +15,7 @@ ====================================== -Flume 1.4.0 User Guide +Flume 1.5.0 User Guide ====================================== Introduction @@ -128,7 +128,7 @@ Setting up an agent ------------------- Flume agent configuration is stored in a local configuration file. This is a -text file which has a format follows the Java properties file format. +text file that follows the Java properties file format. Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of each source, sink and channel in an agent and how they are wired together to form data @@ -705,6 +705,8 @@ ssl false Set th keystore -- This is the path to a Java keystore file. Required for SSL. keystore-password -- The password for the Java keystore. Required for SSL. keystore-type JKS The type of the Java keystore. This can be "JKS" or "PKCS12". +ipFilter false Set this to true to enable ipFiltering for netty +ipFilter.rules -- Define N netty ipFilter pattern rules with this config. ================== =========== =================================================== Example for agent named a1: @@ -718,6 +720,21 @@ Example for agent named a1: a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port = 4141 +Example of ipFilter.rules + +ipFilter.rules defines N netty ipFilters separated by a comma a pattern rule must be in this format. + +<'allow' or deny>:<'ip' or 'name' for computer name>: +or +allow/deny:ip/name:pattern + +example: ipFilter.rules=allow:ip:127.*,allow:name:localhost,deny:ip:* + +Note that the first rule to match will apply as the example below shows from a client on the localhost + +This will Allow the client on localhost be deny clients from any other ip "allow:name:localhost,deny:ip:*" +This will deny the client on localhost be allow clients from any other ip "deny:name:localhost,allow:ip:*" + Thrift Source ~~~~~~~~~~~~~ @@ -929,13 +946,29 @@ Property Name Default De **spoolDir** -- The directory from which to read files from. fileSuffix .COMPLETED Suffix to append to completely ingested files deletePolicy never When to delete completed files: ``never`` or ``immediate`` -fileHeader false Whether to add a header storing the filename -fileHeaderKey file Header key to use when appending filename to header +fileHeader false Whether to add a header storing the absolute path filename. +fileHeaderKey file Header key to use when appending absolute path filename to event header. +basenameHeader false Whether to add a header storing the basename of the file. +basenameHeaderKey basename Header Key to use when appending basename of file to event header. ignorePattern ^$ Regular expression specifying which files to ignore (skip) trackerDir .flumespool Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir. +consumeOrder oldest In which order files in the spooling directory will be consumed ``oldest``, + ``youngest`` and ``random``. In case of ``oldest`` and ``youngest``, the last modified + time of the files will be used to compare the files. In case of a tie, the file + with smallest laxicographical order will be consumed first. In case of ``random`` any + file will be picked randomly. When using ``oldest`` and ``youngest`` the whole + directory will be scanned to pick the oldest/youngest file, which might be slow if there + are a large number of files, while using ``random`` may cause old files to be consumed + very late if new files keep coming in the spooling directory. +maxBackoff 4000 The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter. batchSize 100 Granularity at which to batch transfer to the channel inputCharset UTF-8 Character set used by deserializers that treat the input file as text. +decodeErrorPolicy ``FAIL`` What to do when we see a non-decodable character in the input file. + ``FAIL``: Throw an exception and fail to parse the file. + ``REPLACE``: Replace the unparseable character with the "replacement character" char, + typically Unicode U+FFFD. + ``IGNORE``: Drop the unparseable character sequence. deserializer ``LINE`` Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement ``EventDeserializer.Builder``. @@ -960,6 +993,47 @@ Example for an agent named agent-1: agent-1.sources.src-1.spoolDir = /var/log/apache/flumeSpool agent-1.sources.src-1.fileHeader = true +Twitter 1% firehose Source (experimental) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: + This source is hightly experimental and may change between minor versions of Flume. + Use at your own risk. + +Experimental source that connects via Streaming API to the 1% sample twitter +firehose, continously downloads tweets, converts them to Avro format and +sends Avro events to a downstream Flume sink. Requires the consumer and +access tokens and secrets of a Twitter developer account. +Required properties are in **bold**. + +====================== =========== =================================================== +Property Name Default Description +====================== =========== =================================================== +**channels** -- +**type** -- The component type name, needs to be ``org.apache.flume.source.twitter.TwitterSource`` +**consumerKey** -- OAuth consumer key +**consumerSecret** -- OAuth consumer secret +**accessToken** -- OAuth access token +**accessTokenSecret** -- OAuth toekn secret +maxBatchSize 1000 Maximum number of twitter messages to put in a single batch +maxBatchDurationMillis 1000 Maximum number of milliseconds to wait before closing a batch +====================== =========== =================================================== + +Example for agent named a1: + +.. code-block:: properties + + a1.sources = r1 + a1.channels = c1 + a1.sources.r1.type = org.apache.flume.source.twitter.TwitterSource + a1.sources.r1.channels = c1 + a1.sources.r1.consumerKey = YOUR_TWITTER_CONSUMER_KEY + a1.sources.r1.consumerSecret = YOUR_TWITTER_CONSUMER_SECRET + a1.sources.r1.accessToken = YOUR_TWITTER_ACCESS_TOKEN + a1.sources.r1.accessTokenSecret = YOUR_TWITTER_ACCESS_TOKEN_SECRET + a1.sources.r1.maxBatchSize = 10 + a1.sources.r1.maxBatchDurationMillis = 200 + Event Deserializers ''''''''''''''''''' @@ -1107,6 +1181,8 @@ Property Name Default Descriptio **host** -- Host name or IP address to bind to **port** -- Port # to bind to eventSize 2500 Maximum size of a single event line, in bytes +keepFields false Setting this to true will preserve the Priority, + Timestamp and Hostname in the body of the event. selector.type replicating or multiplexing selector.* replicating Depends on the selector.type value interceptors -- Space-separated list of interceptors @@ -1143,6 +1219,8 @@ Property Name Default **host** -- Host name or IP address to bind to. **ports** -- Space-separated list (one or more) of ports to bind to. eventSize 2500 Maximum size of a single event line, in bytes. +keepFields false Setting this to true will preserve the + Priority, Timestamp and Hostname in the body of the event. portHeader -- If specified, the port number will be stored in the header of each event using the header name specified here. This allows for interceptors and channel selectors to customize routing logic based on the incoming port. charset.default UTF-8 Default character set used while parsing syslog events into strings. charset.port. -- Character set is configurable on a per-port basis. @@ -1177,6 +1255,8 @@ Property Name Default Description **type** -- The component type name, needs to be ``syslogudp`` **host** -- Host name or IP address to bind to **port** -- Port # to bind to +keepFields false Setting this to true will preserve the Priority, + Timestamp and Hostname in the body of the event. selector.type replicating or multiplexing selector.* replicating Depends on the selector.type value interceptors -- Space-separated list of interceptors @@ -1223,6 +1303,9 @@ selector.type replicating selector.* Depends on the selector.type value interceptors -- Space-separated list of interceptors interceptors.* +enableSSL false Set the property true, to enable SSL +keystore Location of the keystore includng keystore file name +keystorePassword Keystore password ================================================================================================================================== For example, a http source for agent named a1: @@ -1397,7 +1480,7 @@ Scribe Source Scribe is another type of ingest system. To adopt existing Scribe ingest system, Flume should use ScribeSource based on Thrift with compatible transfering protocol. -The deployment of Scribe please following guide from Facebook. +For deployment of Scribe please follow the guide from Facebook. Required properties are in **bold**. ============== =========== ============================================== @@ -1514,6 +1597,13 @@ hdfs.roundValue 1 Ro hdfs.roundUnit second The unit of the round down value - ``second``, ``minute`` or ``hour``. hdfs.timeZone Local Time Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles. hdfs.useLocalTimeStamp false Use the local time (instead of the timestamp from the event header) while replacing the escape sequences. +hdfs.closeTries 0 Number of times the sink must try to close a file. If set to 1, this sink will not re-try a failed close + (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. + If set to 0, the sink will try to close the file until the file is eventually closed + (there is no limit on the number of times it would try). +hdfs.retryInterval 180 Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, + so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not + attempt to close the file if the first attempt fails, and may leave the file open or with a ".tmp" extension. serializer ``TEXT`` Other possible options include ``avro_event`` or the fully-qualified class name of an implementation of the ``EventSerializer.Builder`` interface. @@ -1569,25 +1659,26 @@ hostname / port pair. The events are tak batches of the configured batch size. Required properties are in **bold**. -========================== ======= ============================================== +========================== ===================================================== =========================================================================================== Property Name Default Description -========================== ======= ============================================== +========================== ===================================================== =========================================================================================== **channel** -- -**type** -- The component type name, needs to be ``avro``. -**hostname** -- The hostname or IP address to bind to. -**port** -- The port # to listen on. -batch-size 100 number of event to batch together for send. -connect-timeout 20000 Amount of time (ms) to allow for the first (handshake) request. -request-timeout 20000 Amount of time (ms) to allow for requests after the first. -reset-connection-interval none Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent. -compression-type none This can be "none" or "deflate". The compression-type must match the compression-type of matching AvroSource -compression-level 6 The level of compression to compress event. 0 = no compression and 1-9 is compression. The higher the number the more compression -ssl false Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a "truststore", "truststore-password", "truststore-type", and specify whether to "trust-all-certs". -trust-all-certs false If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and "listen in" on the encrypted connection. -truststore -- The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source's SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically "jssecacerts" or "cacerts" in the Oracle JRE) will be used. -truststore-password -- The password for the specified truststore. -truststore-type JKS The type of the Java truststore. This can be "JKS" or other supported Java truststore type. -========================== ======= ============================================== +**type** -- The component type name, needs to be ``avro``. +**hostname** -- The hostname or IP address to bind to. +**port** -- The port # to listen on. +batch-size 100 number of event to batch together for send. +connect-timeout 20000 Amount of time (ms) to allow for the first (handshake) request. +request-timeout 20000 Amount of time (ms) to allow for requests after the first. +reset-connection-interval none Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent. +compression-type none This can be "none" or "deflate". The compression-type must match the compression-type of matching AvroSource +compression-level 6 The level of compression to compress event. 0 = no compression and 1-9 is compression. The higher the number the more compression +ssl false Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a "truststore", "truststore-password", "truststore-type", and specify whether to "trust-all-certs". +trust-all-certs false If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and "listen in" on the encrypted connection. +truststore -- The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source's SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically "jssecacerts" or "cacerts" in the Oracle JRE) will be used. +truststore-password -- The password for the specified truststore. +truststore-type JKS The type of the Java truststore. This can be "JKS" or other supported Java truststore type. +maxIoWorkers 2 * the number of available processors in the machine The maximum number of I/O worker threads. This is configured on the NettyAvroRpcClient NioClientSocketChannelFactory. +========================== ===================================================== =========================================================================================== Example for agent named a1: @@ -1760,7 +1851,11 @@ Property Name Default **type** -- The component type name, needs to be ``hbase`` **table** -- The name of the table in Hbase to write to. **columnFamily** -- The column family in Hbase to write to. +zookeeperQuorum -- The quorum spec. This is the value for the property ``hbase.zookeeper.quorum`` in hbase-site.xml +znodeParent /hbase The base path for the znode for the -ROOT- region. Value of ``zookeeper.znode.parent`` in hbase-site.xml batchSize 100 Number of events to be written per txn. +coalesceIncrements false Should the sink coalesce multiple increments to a cell per batch. This might give + better performance if there are multiple increments to a limited number of cells. serializer org.apache.flume.sink.hbase.SimpleHbaseEventSerializer Default increment column = "iCol", payload column = "pCol". serializer.* -- Properties to be passed to the serializer. kerberosPrincipal -- Kerberos user principal for accessing secure HBase @@ -1783,30 +1878,32 @@ AsyncHBaseSink '''''''''''''' This sink writes data to HBase using an asynchronous model. A class implementing -AsyncHbaseEventSerializer -which is specified by the configuration is used to convert the events into +AsyncHbaseEventSerializer which is specified by the configuration is used to convert the events into HBase puts and/or increments. These puts and increments are then written -to HBase. This sink provides the same consistency guarantees as HBase, +to HBase. This sink uses the `Asynchbase API `_ to write to +HBase. This sink provides the same consistency guarantees as HBase, which is currently row-wise atomicity. In the event of Hbase failing to write certain events, the sink will replay all events in that transaction. The type is the FQCN: org.apache.flume.sink.hbase.AsyncHBaseSink. Required properties are in **bold**. -================ ============================================================ ==================================================================================== -Property Name Default Description -================ ============================================================ ==================================================================================== -**channel** -- -**type** -- The component type name, needs to be ``asynchbase`` -**table** -- The name of the table in Hbase to write to. -zookeeperQuorum -- The quorum spec. This is the value for the property ``hbase.zookeeper.quorum`` in hbase-site.xml -znodeParent /hbase The base path for the znode for the -ROOT- region. Value of ``zookeeper.znode.parent`` in hbase-site.xml -**columnFamily** -- The column family in Hbase to write to. -batchSize 100 Number of events to be written per txn. -timeout 60000 The length of time (in milliseconds) the sink waits for acks from hbase for - all events in a transaction. -serializer org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer -serializer.* -- Properties to be passed to the serializer. -================ ============================================================ ==================================================================================== +=================== ============================================================ ==================================================================================== +Property Name Default Description +=================== ============================================================ ==================================================================================== +**channel** -- +**type** -- The component type name, needs to be ``asynchbase`` +**table** -- The name of the table in Hbase to write to. +zookeeperQuorum -- The quorum spec. This is the value for the property ``hbase.zookeeper.quorum`` in hbase-site.xml +znodeParent /hbase The base path for the znode for the -ROOT- region. Value of ``zookeeper.znode.parent`` in hbase-site.xml +**columnFamily** -- The column family in Hbase to write to. +batchSize 100 Number of events to be written per txn. +coalesceIncrements false Should the sink coalesce multiple increments to a cell per batch. This might give + better performance if there are multiple increments to a limited number of cells. +timeout 60000 The length of time (in milliseconds) the sink waits for acks from hbase for + all events in a transaction. +serializer org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer +serializer.* -- Properties to be passed to the serializer. +=================== ============================================================ ==================================================================================== Note that this sink takes the Zookeeper Quorum and parent znode information in the configuration. Zookeeper Quorum and parent node configuration may be @@ -1835,7 +1932,7 @@ This sink extracts data from Flume event This sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr (via MorphlineSolrSink). In particular, this sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a data model that is useful to Search applications. -The ETL functionality is customizable using a `morphline configuration file `_ that defines a chain of transformation commands that pipe event records from one command to another. +The ETL functionality is customizable using a `morphline configuration file `_ that defines a chain of transformation commands that pipe event records from one command to another. Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records, including arbitrary binary payloads. A morphline command is a bit like a Flume Interceptor. Morphlines can be embedded into Hadoop components such as Flume. @@ -1915,7 +2012,10 @@ indexType logs clusterName elasticsearch Name of the ElasticSearch cluster to connect to batchSize 100 Number of events to be written per txn. ttl -- TTL in days, when set will cause the expired documents to be deleted automatically, - if not set documents will never be automatically deleted + if not set documents will never be automatically deleted. TTL is accepted both in the earlier form of + integer only e.g. a1.sinks.k1.ttl = 5 and also with a qualifier ms (millisecond), s (second), m (minute), + h (hour), d (day) and w (week). Example a1.sinks.k1.ttl = 5d will set TTL to 5 days. Follow + http://www.elasticsearch.org/guide/reference/mapping/ttl-field/ for more information. serializer org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer The ElasticSearchIndexRequestBuilderFactory or ElasticSearchEventSerializer to use. Implementations of either class are accepted but ElasticSearchIndexRequestBuilderFactory is preferred. serializer.* -- Properties to be passed to the serializer. @@ -1933,10 +2033,50 @@ Example for agent named a1: a1.sinks.k1.indexType = bar_type a1.sinks.k1.clusterName = foobar_cluster a1.sinks.k1.batchSize = 500 - a1.sinks.k1.ttl = 5 + a1.sinks.k1.ttl = 5d a1.sinks.k1.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer a1.sinks.k1.channel = c1 +Kite Dataset Sink (experimental) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: + This source is experimental and may change between minor versions of Flume. + Use at your own risk. + +Experimental sink that writes events to a `Kite Dataset `_. +This sink will deserialize the body of each incoming event and store the +resulting record in a Kite Dataset. It determines target Dataset by opening a +repository URI, ``kite.repo.uri``, and loading a Dataset by name, +``kite.dataset.name``. + +The only supported serialization is avro, and the record schema must be passed +in the event headers, using either ``flume.avro.schema.literal`` with the JSON +schema representation or ``flume.avro.schema.url`` with a URL where the schema +may be found (``hdfs:/...`` URIs are supported). This is compatible with the +Log4jAppender flume client and the spooling directory source's Avro +deserializer using ``deserializer.schemaType = LITERAL``. + +Note 1: The ``flume.avro.schema.hash`` header is **not supported**. +Note 2: In some cases, file rolling may occur slightly after the roll interval +has been exceeded. However, this delay will not exceed 5 seconds. In most +cases, the delay is neglegible. + +======================= ======= =========================================================== +Property Name Default Description +======================= ======= =========================================================== +**channel** -- +**type** -- Must be org.apache.flume.sink.kite.DatasetSink +**kite.repo.uri** -- URI of the repository to open +**kite.dataset.name** -- Name of the Dataset where records will be written +kite.batchSize 100 Number of records to process in each batch +kite.rollInterval 30 Maximum wait time (seconds) before data files are released +auth.kerberosPrincipal -- Kerberos user principal for secure authentication to HDFS +auth.kerberosKeytab -- Kerberos keytab location (local FS) for the principal +auth.proxyUser -- The effective user for HDFS actions, if different from + the kerberos principal +======================= ======= =========================================================== + Custom Sink ~~~~~~~~~~~ @@ -2059,15 +2199,13 @@ Property Name Default checkpointDir ~/.flume/file-channel/checkpoint The directory where checkpoint file will be stored useDualCheckpoints false Backup the checkpoint. If this is set to ``true``, ``backupCheckpointDir`` **must** be set backupCheckpointDir -- The directory where the checkpoint is backed up to. This directory **must not** be the same as the data directories or the checkpoint directory -dataDirs ~/.flume/file-channel/data The directory where log files will be stored -transactionCapacity 1000 The maximum size of transaction supported by the channel +dataDirs ~/.flume/file-channel/data Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance +transactionCapacity 10000 The maximum size of transaction supported by the channel checkpointInterval 30000 Amount of time (in millis) between checkpoints maxFileSize 2146435071 Max size (in bytes) of a single log file minimumRequiredSpace 524288000 Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value capacity 1000000 Maximum capacity of the channel keep-alive 3 Amount of time (in sec) to wait for a put operation -write-timeout 3 Amount of time (in sec) to wait for a write operation -checkpoint-timeout 600 Expert: Amount of time (in sec) to wait for a checkpoint use-log-replay-v1 false Expert: Use old replay logic use-fast-replay false Expert: Replay without using queue encryption.activeKey -- Key name used to encrypt new data @@ -2155,6 +2293,80 @@ The same scenerio as above, however key- a1.channels.c1.encryption.keyProvider.keys.key-0.passwordFile = /path/to/key-0.password +Spillable Memory Channel +~~~~~~~~~~~~~~~~~~~~~~~~ + +The events are stored in an in-memory queue and on disk. The in-memory queue serves as the primary store and the disk as overflow. +The disk store is managed using an embedded File channel. When the in-memory queue is full, additional incoming events are stored in +the file channel. This channel is ideal for flows that need high throughput of memory channel during normal operation, but at the +same time need the larger capacity of the file channel for better tolerance of intermittent sink side outages or drop in drain rates. +The throughput will reduce approximately to file channel speeds during such abnormal situations. In case of an agent crash or restart, +only the events stored on disk are recovered when the agent comes online. **This channel is currently experimental and +not recommended for use in production.** + +Required properties are in **bold**. Please refer to file channel for additional required properties. + +============================ ================ ============================================================================================= +Property Name Default Description +============================ ================ ============================================================================================= +**type** -- The component type name, needs to be ``SPILLABLEMEMORY`` +memoryCapacity 10000 Maximum number of events stored in memory queue. To disable use of in-memory queue, set this to zero. +overflowCapacity 100000000 Maximum number of events stored in overflow disk (i.e File channel). To disable use of overflow, set this to zero. +overflowTimeout 3 The number of seconds to wait before enabling disk overflow when memory fills up. +byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size + of all events in the channel, to account for data in headers. See below. +byteCapacity see description Maximum **bytes** of memory allowed as a sum of all events in the memory queue. + The implementation only counts the Event ``body``, which is the reason for + providing the ``byteCapacityBufferPercentage`` configuration parameter as well. + Defaults to a computed value equal to 80% of the maximum memory available to + the JVM (i.e. 80% of the -Xmx value passed on the command line). + Note that if you have multiple memory channels on a single JVM, and they happen + to hold the same physical events (i.e. if you are using a replicating channel + selector from a single source) then those event sizes may be double-counted for + channel byteCapacity purposes. + Setting this value to ``0`` will cause this value to fall back to a hard + internal limit of about 200 GB. +avgEventSize 500 Estimated average size of events, in bytes, going into the channel + see file channel Any file channel property with the exception of 'keep-alive' and 'capacity' can be used. + The keep-alive of file channel is managed by Spillable Memory Channel. Use 'overflowCapacity' + to set the File channel's capacity. +============================ ================ ============================================================================================= + +In-memory queue is considered full if either memoryCapacity or byteCapacity limit is reached. + +Example for agent named a1: + +.. code-block:: properties + + a1.channels = c1 + a1.channels.c1.type = SPILLABLEMEMORY + a1.channels.c1.memoryCapacity = 10000 + a1.channels.c1.overflowCapacity = 1000000 + a1.channels.c1.byteCapacity = 800000 + a1.channels.c1.checkpointDir = /mnt/flume/checkpoint + a1.channels.c1.dataDirs = /mnt/flume/data + +To disable the use of the in-memory queue and function like a file channel: + +.. code-block:: properties + + a1.channels = c1 + a1.channels.c1.type = SPILLABLEMEMORY + a1.channels.c1.memoryCapacity = 0 + a1.channels.c1.overflowCapacity = 1000000 + a1.channels.c1.checkpointDir = /mnt/flume/checkpoint + a1.channels.c1.dataDirs = /mnt/flume/data + + +To disable the use of overflow disk and function purely as a in-memory channel: + +.. code-block:: properties + + a1.channels = c1 + a1.channels.c1.type = SPILLABLEMEMORY + a1.channels.c1.memoryCapacity = 100000 + + Pseudo Transaction Channel ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -2595,7 +2807,7 @@ prefix "" The prefix st Morphline Interceptor ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This interceptor filters the events through a `morphline configuration file `_ that defines a chain of transformation commands that pipe records from one command to another. +This interceptor filters the events through a `morphline configuration file `_ that defines a chain of transformation commands that pipe records from one command to another. For example the morphline can ignore certain events or alter or insert certain event headers via regular expression based pattern matching, or it can auto-detect and set a MIME type via Apache Tika on events that are intercepted. For example, this kind of packet sniffing can be used for content based dynamic routing in a Flume topology. MorphlineInterceptor can also help to implement dynamic routing to multiple Apache Solr collections (e.g. for multi-tenancy). @@ -2671,11 +2883,11 @@ If the Flume event body contained ``1:2: .. code-block:: properties - agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d) - agent.sources.r1.interceptors.i1.serializers = s1 s2 s3 - agent.sources.r1.interceptors.i1.serializers.s1.name = one - agent.sources.r1.interceptors.i1.serializers.s2.name = two - agent.sources.r1.interceptors.i1.serializers.s3.name = three + a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d) + a1.sources.r1.interceptors.i1.serializers = s1 s2 s3 + a1.sources.r1.interceptors.i1.serializers.s1.name = one + a1.sources.r1.interceptors.i1.serializers.s2.name = two + a1.sources.r1.interceptors.i1.serializers.s3.name = three The extracted event will contain the same body but the following headers will have been added ``one=>1, two=>2, three=>3`` @@ -2686,11 +2898,11 @@ If the Flume event body contained ``2012 .. code-block:: properties - agent.sources.r1.interceptors.i1.regex = ^(?:\\n)?(\\d\\d\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d) - agent.sources.r1.interceptors.i1.serializers = s1 - agent.sources.r1.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer - agent.sources.r1.interceptors.i1.serializers.s1.name = timestamp - agent.sources.r1.interceptors.i1.serializers.s1.pattern = yyyy-MM-dd HH:mm + a1.sources.r1.interceptors.i1.regex = ^(?:\\n)?(\\d\\d\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d) + a1.sources.r1.interceptors.i1.serializers = s1 + a1.sources.r1.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer + a1.sources.r1.interceptors.i1.serializers.s1.name = timestamp + a1.sources.r1.interceptors.i1.serializers.s1.pattern = yyyy-MM-dd HH:mm the extracted event will contain the same body but the following headers will have been added ``timestamp=>1350611220000`` @@ -2731,21 +2943,21 @@ Log4J Appender Appends Log4j events to a flume agent's avro source. A client using this appender must have the flume-ng-sdk in the classpath (eg, -flume-ng-sdk-1.4.0.jar). +flume-ng-sdk-1.5.0.jar). Required properties are in **bold**. -===================== ======= ============================================================== +===================== ======= ================================================================================== Property Name Default Description -===================== ======= ============================================================== +===================== ======= ================================================================================== **Hostname** -- The hostname on which a remote Flume agent is running with an avro source. **Port** -- The port at which the remote Flume agent's avro source is listening. UnsafeMode false If true, the appender will not throw exceptions on failure to send the events. -AvroReflectionEnabled false Use Avro Reflection to serialize Log4j events. +AvroReflectionEnabled false Use Avro Reflection to serialize Log4j events. (Do not use when users log strings) AvroSchemaUrl -- A URL from which the Avro schema can be retrieved. -===================== ======= ============================================================== +===================== ======= ================================================================================== Sample log4j.properties file: @@ -2795,7 +3007,7 @@ Load Balancing Log4J Appender Appends Log4j events to a list of flume agent's avro source. A client using this appender must have the flume-ng-sdk in the classpath (eg, -flume-ng-sdk-1.4.0.jar). This appender supports a round-robin and random +flume-ng-sdk-1.5.0.jar). This appender supports a round-robin and random scheme for performing the load balancing. It also supports a configurable backoff timeout so that down agents are removed temporarily from the set of hosts Required properties are in **bold**. @@ -2883,9 +3095,9 @@ and can be specified in the flume-env.sh Property Name Default Description ======================= ======= ===================================================================================== **type** -- The component type name, has to be ``ganglia`` -**hosts** -- Comma-separated list of ``hostname:port`` -pollInterval 60 Time, in seconds, between consecutive reporting to ganglia server -isGanglia3 false Ganglia server version is 3. By default, Flume sends in ganglia 3.1 format +**hosts** -- Comma-separated list of ``hostname:port`` of Ganglia servers +pollFrequency 60 Time, in seconds, between consecutive reporting to Ganglia server +isGanglia3 false Ganglia server version is 3. By default, Flume sends in Ganglia 3.1 format ======================= ======= ===================================================================================== We can start Flume with Ganglia support as follows:: @@ -2936,7 +3148,7 @@ Property Name Default Descri port 41414 The port to start the server on. ======================= ======= ===================================================================================== -We can start Flume with Ganglia support as follows:: +We can start Flume with JSON Reporting support as follows:: $ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=34545 Modified: flume/site/trunk/content/sphinx/download.rst URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/download.rst?rev=1596704&r1=1596703&r2=1596704&view=diff ============================================================================== --- flume/site/trunk/content/sphinx/download.rst (original) +++ flume/site/trunk/content/sphinx/download.rst Wed May 21 22:34:24 2014 @@ -12,8 +12,8 @@ originals on the main distribution serve :header: "", "Mirrors", "Checksum", "Signature" :widths: 25, 25, 25, 25 - "Apache Flume binary (tar.gz)", `apache-flume-1.4.0-bin.tar.gz `_, `apache-flume-1.4.0-bin.tar.gz.md5 `_, `apache-flume-1.4.0-bin.tar.gz.asc `_ - "Apache Flume source (tar.gz)", `apache-flume-1.4.0-src.tar.gz `_, `apache-flume-1.4.0-src.tar.gz.md5 `_, `apache-flume-1.4.0-src.tar.gz.asc `_ + "Apache Flume binary (tar.gz)", `apache-flume-1.5.0-bin.tar.gz `_, `apache-flume-1.5.0-bin.tar.gz.md5 `_, `apache-flume-1.5.0-bin.tar.gz.asc `_ + "Apache Flume source (tar.gz)", `apache-flume-1.5.0-src.tar.gz `_, `apache-flume-1.5.0-src.tar.gz.md5 `_, `apache-flume-1.5.0-src.tar.gz.asc `_ It is essential that you verify the integrity of the downloaded files using the PGP or MD5 signatures. Please read `Verifying Apache HTTP Server Releases `_ for more information on @@ -25,9 +25,9 @@ as well as the asc signature file for th Then verify the signatures using:: % gpg --import KEYS - % gpg --verify apache-flume-1.4.0-src.tar.gz.asc + % gpg --verify apache-flume-1.5.0-src.tar.gz.asc -Apache Flume 1.4.0 is signed by Mike Percy 66F2054B +Apache Flume 1.5.0 is signed by Hari Shreedharan 77FFC9AB Alternatively, you can verify the MD5 or SHA1 signatures of the files. A program called md5, md5sum, or shasum is included in many Unix distributions for this purpose. Modified: flume/site/trunk/content/sphinx/index.rst URL: http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/index.rst?rev=1596704&r1=1596703&r2=1596704&view=diff ============================================================================== --- flume/site/trunk/content/sphinx/index.rst (original) +++ flume/site/trunk/content/sphinx/index.rst Wed May 21 22:34:24 2014 @@ -33,6 +33,36 @@ application. .. raw:: html +

May 20, 2014 - Apache Flume 1.5.0 Released

+ +The Apache Flume team is pleased to announce the release of Flume 1.5.0. + +Flume is a distributed, reliable, and available service for efficiently +collecting, aggregating, and moving large amounts of streaming event data. + +Version 1.5.0 is the fifth Flume release as an Apache top-level project. +Flume 1.5.0 is stable, production-ready software, and is backwards-compatible +with previous versions of the Flume 1.x codeline. + +Several months of active development went into this release: 123 patches were committed since 1.4.0, representing many features, enhancements, and bug fixes. While the full change log can be found on the 1.5.0 release page (link below), here are a few new feature highlights: + +* New in-memory channel that can spill to disk +* A new dataset sink that use Kite API to write data to HDFS and HBase +* Support for Elastic Search HTTP API in Elastic Search Sink +* Much faster replay in the File Channel. + +The full change log and documentation are available on the +`Flume 1.5.0 release page `__. + +This release can be downloaded from the Flume `Download `__ page. + +Your contributions, feedback, help and support make Flume better! +For more information on how to report problems or contribute, +please visit our `Get Involved `__ page. + +The Apache Flume Team + +

July 2, 2013 - Apache Flume 1.4.0 Released

The Apache Flume team is pleased to announce the release of Flume 1.4.0.