hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "arkady borkovsky (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2278) Streaming: better conrol over input splits
Date Fri, 30 Nov 2007 22:26:43 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547312

arkady borkovsky commented on HADOOP-2278:

Some split conditions are hard to catch by a fixed patter.

A very important split condition is "key switch" -- (need to be fiels as a separate issue?)

Quite often, the mapper input is grouped by key and the mapper is actually a reducer.  Therefore,
it expects that all the values for given key go to the same task.
Currently, the split happens between any two records, so "key runs" are usually broken at
split boundaries.

The work around is to have infinite split size -- which creates bad granularity.

> Streaming: better conrol over input splits
> ------------------------------------------
>                 Key: HADOOP-2278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2278
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: arkady borkovsky
> In steaming, the map command usually expect to receive it's input uninterpreted -- just
as it is stored in DFS.
> However, the split (the beginning and the end of the portion of data that goes to a single
map task) is often important and is not "any line break".
> Often the input consists of multi-line docments -- e.g. in XML.
> There should be a way to specify a pattern that separates logical records.
> Existing "Streaming XML record reader" kind of provides this functionality.  However,
it is accepted that "Streaming XML" is a hack and needs to be replaced 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message