druid-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] fjy closed pull request #6373: Docs for ingestion stat reports and new parse exception handling
Date Tue, 25 Sep 2018 00:45:07 GMT
fjy closed pull request #6373: Docs for ingestion stat reports and new parse exception handling
URL: https://github.com/apache/incubator-druid/pull/6373
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/content/development/extensions-core/kafka-ingestion.md b/docs/content/development/extensions-core/kafka-ingestion.md
index 7b17c46a270..ebc240a2d16 100644
--- a/docs/content/development/extensions-core/kafka-ingestion.md
+++ b/docs/content/development/extensions-core/kafka-ingestion.md
@@ -122,7 +122,7 @@ The tuningConfig is optional and default parameters will be used if no
tuningCon
 |`intermediatePersistPeriod`|ISO8601 Period|The period that determines the rate at which
intermediate persists occur.|no (default == PT10M)|
 |`maxPendingPersists`|Integer|Maximum number of persists that can be pending but not started.
If this limit would be exceeded by a new intermediate persist, ingestion will block until
the currently-running persist finishes. Maximum heap memory usage for indexing scales with
maxRowsInMemory * (2 + maxPendingPersists).|no (default == 0, meaning one persist can be running
concurrently with ingestion, and none can be queued up)|
 |`indexSpec`|Object|Tune how data is indexed, see 'IndexSpec' below for more details.|no|
-|`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown
and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default
== false)|
+|`reportParseExceptions`|DEPRECATED. If true, exceptions encountered during parsing will
be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.
Setting `reportParseExceptions` to true will override existing configurations for `maxParseExceptions`
and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions`
to no more than 1.|false|no|
 |`handoffConditionTimeout`|Long|Milliseconds to wait for segment handoff. It must be >=
0, where 0 means to wait forever.|no (default == 0)|
 |`resetOffsetAutomatically`|Boolean|Whether to reset the consumer offset if the next offset
that it is trying to fetch is less than the earliest available offset for that particular
partition. The consumer offset will be reset to either the earliest or latest offset depending
on `useEarliestOffset` property of `KafkaSupervisorIOConfig` (see below). This situation typically
occurs when messages in Kafka are no longer available for consumption and therefore won't
be ingested into Druid. If set to false then ingestion for that particular partition will
halt and manual intervention is required to correct the situation, please see `Reset Supervisor`
API below.|no (default == false)|
 |`workerThreads`|Integer|The number of threads that will be used by the supervisor for asynchronous
operations.|no (default == min(10, taskCount))|
@@ -133,6 +133,9 @@ The tuningConfig is optional and default parameters will be used if no
tuningCon
 |`offsetFetchPeriod`|ISO8601 Period|How often the supervisor queries Kafka and the indexing
tasks to fetch current offsets and calculate lag.|no (default == PT30S, min == PT5S)|
 |`segmentWriteOutMediumFactory`|String|Segment write-out medium to use when creating segments.
See [Additional Peon Configuration: SegmentWriteOutMediumFactory](../../configuration/index.html#segmentwriteoutmediumfactory)
for explanation and available options.|no (not specified by default, the value from `druid.peon.defaultSegmentWriteOutMediumFactory`
is used)|
 |`intermediateHandoffPeriod`|ISO8601 Period|How often the tasks should hand off segments.
Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`,
whichever happens earlier.|no (default == P2147483647D)|
+|`logParseExceptions`|Boolean|If true, log an error message when a parsing exception occurs,
containing information about the row where the error occurred.|no, default == false|
+|`maxParseExceptions`|Integer|The maximum number of parse exceptions that can occur before
the task halts ingestion and fails. Overridden if `reportParseExceptions` is set.|no, unlimited
default|
+|`maxSavedParseExceptions`|Integer|When a parse exception occurs, Druid can keep track of
the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances
will be saved. These saved exceptions will be made available after the task finishes in the
[task completion report](../../ingestion/reports.html). Overridden if `reportParseExceptions`
is set.|no, default == 0|
 
 #### IndexSpec
 
diff --git a/docs/content/ingestion/hadoop.md b/docs/content/ingestion/hadoop.md
index 507cfdb3113..68d2464e474 100644
--- a/docs/content/ingestion/hadoop.md
+++ b/docs/content/ingestion/hadoop.md
@@ -166,7 +166,7 @@ The tuningConfig is optional and default parameters will be used if no
tuningCon
 |leaveIntermediate|Boolean|Leave behind intermediate files (for debugging) in the workingPath
when a job completes, whether it passes or fails.|no (default == false)|
 |cleanupOnFailure|Boolean|Clean up intermediate files when a job fails (unless leaveIntermediate
is on).|no (default == true)|
 |overwriteFiles|Boolean|Override existing files found during indexing.|no (default == false)|
-|ignoreInvalidRows|Boolean|Ignore rows found to have problems.|no (default == false)|
+|ignoreInvalidRows|Boolean|DEPRECATED. Ignore rows found to have problems. If false, any
exception encountered during parsing will be thrown and will halt ingestion; if true, unparseable
rows and fields will be skipped. If `maxParseExceptions` is defined, this property is ignored.|no
(default == false)|
 |combineText|Boolean|Use CombineTextInputFormat to combine multiple files into a file split.
This can speed up Hadoop jobs when processing a large number of small files.|no (default ==
false)|
 |useCombiner|Boolean|Use Hadoop combiner to merge rows at mapper if possible.|no (default
== false)|
 |jobProperties|Object|A map of properties to add to the Hadoop job configuration, see below
for details.|no (default == null)|
@@ -174,6 +174,8 @@ The tuningConfig is optional and default parameters will be used if no
tuningCon
 |numBackgroundPersistThreads|Integer|The number of new background threads to use for incremental
persists. Using this feature causes a notable increase in memory pressure and cpu usage but
will make the job finish more quickly. If changing from the default of 0 (use current thread
for persists), we recommend setting it to 1.|no (default == 0)|
 |forceExtendableShardSpecs|Boolean|Forces use of extendable shardSpecs. Experimental feature
intended for use with the [Kafka indexing service extension](../development/extensions-core/kafka-ingestion.html).|no
(default = false)|
 |useExplicitVersion|Boolean|Forces HadoopIndexTask to use version.|no (default = false)|
+|logParseExceptions|Boolean|If true, log an error message when a parsing exception occurs,
containing information about the row where the error occurred.|false|no|
+|maxParseExceptions|Integer|The maximum number of parse exceptions that can occur before
the task halts ingestion and fails. Overrides `ignoreInvalidRows` if `maxParseExceptions`
is defined.|unlimited|no|
 
 ### jobProperties field of TuningConfig
 
diff --git a/docs/content/ingestion/native_tasks.md b/docs/content/ingestion/native_tasks.md
index a39c2cddc0c..3d27823c233 100644
--- a/docs/content/ingestion/native_tasks.md
+++ b/docs/content/ingestion/native_tasks.md
@@ -479,9 +479,12 @@ The tuningConfig is optional and default parameters will be used if no
tuningCon
 |maxPendingPersists|Maximum number of persists that can be pending but not started. If this
limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running
persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2
+ maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion,
and none can be queued up)|no|
 |forceExtendableShardSpecs|Forces use of extendable shardSpecs. Experimental feature intended
for use with the [Kafka indexing service extension](../development/extensions-core/kafka-ingestion.html).|false|no|
 |forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](../ingestion/index.html#roll-up-modes).
The perfect rollup optimizes the total size of generated segments and querying time while
indexing time will be increased. This flag cannot be used with either `appendToExisting` of
IOConfig or `forceExtendableShardSpecs`. For more details, see the below __Segment pushing
modes__ section.|false|no|
-|reportParseExceptions|If true, exceptions encountered during parsing will be thrown and
will halt ingestion; if false, unparseable rows and fields will be skipped.|false|no|
+|reportParseExceptions|DEPRECATED. If true, exceptions encountered during parsing will be
thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting
`reportParseExceptions` to true will override existing configurations for `maxParseExceptions`
and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions`
to no more than 1.|false|no|
 |pushTimeout|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means
to wait forever.|0|no|
 |segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See
[Additional Peon Configuration: SegmentWriteOutMediumFactory](../configuration/index.html#segmentwriteoutmediumfactory)
for explanation and available options.|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory`
is used|no|
+|logParseExceptions|If true, log an error message when a parsing exception occurs, containing
information about the row where the error occurred.|false|no|
+|maxParseExceptions|The maximum number of parse exceptions that can occur before the task
halts ingestion and fails. Overridden if `reportParseExceptions` is set.|unlimited|no|
+|maxSavedParseExceptions|When a parse exception occurs, Druid can keep track of the most
recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will
be saved. These saved exceptions will be made available after the task finishes in the [task
completion report](../ingestion/reports.html). Overridden if `reportParseExceptions` is set.|0|no|
 
 #### IndexSpec
 
@@ -528,3 +531,5 @@ continues to ingest remaining data.
 
 To enable bulk pushing mode, `forceGuaranteedRollup` should be set in the TuningConfig. Note
that this option cannot
 be used with either `forceExtendableShardSpecs` of TuningConfig or `appendToExisting` of
IOConfig.
+
+
diff --git a/docs/content/ingestion/reports.md b/docs/content/ingestion/reports.md
new file mode 100644
index 00000000000..82cb7612efd
--- /dev/null
+++ b/docs/content/ingestion/reports.md
@@ -0,0 +1,131 @@
+---
+layout: doc_page
+---
+# Ingestion Reports
+
+## Completion Report
+
+After a task completes, a report containing information about the number of rows ingested
and any parse exceptions that occurred is available at:
+
+```
+http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
+```
+
+This reporting feature is supported by the non-parallel native batch tasks, the Hadoop batch
task, and tasks created by the Kafka Indexing Service. Realtime tasks created by Tranquility
do not provide completion reports.
+
+An example output is shown below, along with a description of the fields:
+
+```json
+{
+  "ingestionStatsAndErrors": {
+    "taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
+    "payload": {
+      "ingestionState": "COMPLETED",
+      "unparseableEvents": {},
+      "rowStats": {
+        "determinePartitions": {
+          "processed": 0,
+          "processedWithError": 0,
+          "thrownAway": 0,
+          "unparseable": 0
+        },
+        "buildSegments": {
+          "processed": 5390324,
+          "processedWithError": 0,
+          "thrownAway": 0,
+          "unparseable": 0
+        }
+      },
+      "errorMsg": null
+    },
+    "type": "ingestionStatsAndErrors"
+  }
+}
+```
+
+The `ingestionStatsAndErrors` report provides information about row counts and errors. 
+
+The `ingestionState` shows what step of ingestion the task reached. Possible states include:
+* `NOT_STARTED`: The task has not begun reading any rows
+* `DETERMINE_PARTITIONS`: The task is processing rows to determine partitioning
+* `BUILD_SEGMENTS`: The task is processing rows to construct segments
+* `COMPLETED`: The task has finished its work.
+
+Only batch tasks have the DETERMINE_PARTITIONS phase. Realtime tasks such as those created
by the Kafka Indexing Service do not have a DETERMINE_PARTITIONS phase.
+
+`unparseableEvents` contains lists of exception messages that were caused by unparseable
inputs. This can help with identifying problematic input rows. There will be one list each
for the DETERMINE_PARTITIONS and BUILD_SEGMENTS phases. Note that the Hadoop batch task does
not support saving of unparseable events.
+
+the `rowStats` map contains information about row counts. There is one entry for each ingestion
phase. The definitions of the different row counts are shown below:
+* `processed`: Number of rows successfully ingested without parsing errors
+* `processedWithError`: Number of rows that were ingested, but contained a parsing error
within one or more columns. This typically occurs where input rows have a parseable structure
but invalid types for columns, such as passing in a non-numeric String value for a numeric
column.
+* `thrownAway`: Number of rows skipped. This includes rows with timestamps that were outside
of the ingestion task's defined time interval and rows that were filtered out with [Transform
Specs](../ingestion/transform-spec.html).
+* `unparseable`: Number of rows that could not be parsed at all and were discarded. This
tracks input rows without a parseable structure, such as passing in non-JSON data when using
a JSON parser.
+
+The `errorMsg` field shows a message describing the error that caused a task to fail. It
will be null if the task was successful.
+
+## Live Reports
+
+### Row stats
+
+The non-parallel [Native Batch Task](../native_tasks.md), the Hadoop batch task, and the
tasks created by the Kafka Indexing Service support retrieval of row stats while the task
is running.
+
+The live report can be accessed with a GET to the following URL on a peon running a task:
+
+```
+http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/rowStats
+```
+
+An example report is shown below. The `movingAverages` section contains 1 minute, 5 minute,
and 15 minute moving averages of increases to the four row counters, which have the same definitions
as those in the completion report. The `totals` section shows the current totals.
+
+```
+{
+  "movingAverages": {
+    "buildSegments": {
+      "5m": {
+        "processed": 3.392158326408501,
+        "unparseable": 0,
+        "thrownAway": 0,
+        "processedWithError": 0
+      },
+      "15m": {
+        "processed": 1.736165476881023,
+        "unparseable": 0,
+        "thrownAway": 0,
+        "processedWithError": 0
+      },
+      "1m": {
+        "processed": 4.206417693750045,
+        "unparseable": 0,
+        "thrownAway": 0,
+        "processedWithError": 0
+      }
+    }
+  },
+  "totals": {
+    "buildSegments": {
+      "processed": 1994,
+      "processedWithError": 0,
+      "thrownAway": 0,
+      "unparseable": 0
+    }
+  }
+}
+```
+
+Note that this is only supported by the non-parallel [Native Batch Task](../native_tasks.md),
the Hadoop Batch task, and the tasks created by the Kafka Indexing Service.
+
+For the Kafka Indexing Service, a GET to the following Overlord API will retrieve live row
stat reports from each task being managed by the supervisor and provide a combined report.
+
+```
+http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/supervisor/<supervisor-id>/stats
+```
+
+### Unparseable Events
+
+Current lists of unparseable events can be retrieved from a running task with a GET to the
following peon API:
+
+```
+http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparseableEvents
+```
+
+Note that this is only supported by the non-parallel [Native Batch Task](../native_tasks.md)
and the tasks created by the Kafka Indexing Service.
\ No newline at end of file
diff --git a/docs/content/operations/api-reference.md b/docs/content/operations/api-reference.md
index 4eed0e1bb08..2a58de09874 100644
--- a/docs/content/operations/api-reference.md
+++ b/docs/content/operations/api-reference.md
@@ -368,6 +368,10 @@ Retrieve the status of a task.
 
 Retrieve information about the segments of a task.
 
+* `/druid/indexer/v1/task/{taskId}/reports`
+
+Retrieve a [task completion report](../ingestion/reports.html) for a task. Only works for
completed tasks. 
+
 #### POST
 
 * `/druid/indexer/v1/task` 
@@ -388,7 +392,16 @@ The MiddleManager does not have any API endpoints beyond the [common
endpoints](
 
 ## Peon
 
-The Peon does not have any API endpoints beyond the [common endpoints](#common).
+#### GET
+
+* `/druid/worker/v1/chat/{taskId}/rowStats`
+
+Retrieve a live row stats report from a peon. See [task reports](../ingestion/reports.html)
for more details.
+
+* `/druid/worker/v1/chat/{taskId}/unparseableEvents`
+
+Retrieve an unparseable events report from a peon. See [task reports](../ingestion/reports.html)
for more details.
+
 
 ## Broker
 
diff --git a/docs/content/toc.md b/docs/content/toc.md
index 43d2022d78f..058b1c72de3 100644
--- a/docs/content/toc.md
+++ b/docs/content/toc.md
@@ -46,6 +46,7 @@ layout: toc
   * [Updating Existing Data](/docs/VERSION/ingestion/update-existing-data.html)
   * [Deleting Data](/docs/VERSION/ingestion/delete-data.html)
   * [Task Locking & Priority](/docs/VERSION/ingestion/locking-and-priority.html)
+  * [Task Reports](/docs/VERSION/ingestion/reports.html)
   * [FAQ](/docs/VERSION/ingestion/faq.html)
   * [Misc. Tasks](/docs/VERSION/ingestion/misc-tasks.html)
 


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Mime
View raw message