Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9466019300 for ; Sat, 23 Apr 2016 10:35:35 +0000 (UTC) Received: (qmail 61803 invoked by uid 500); 23 Apr 2016 10:35:31 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 61645 invoked by uid 500); 23 Apr 2016 10:35:31 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 61635 invoked by uid 99); 23 Apr 2016 10:35:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Apr 2016 10:35:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0E50E1804DD for ; Sat, 23 Apr 2016 10:35:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.079 X-Spam-Level: X-Spam-Status: No, score=0.079 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ybGJCZu1zyTU for ; Sat, 23 Apr 2016 10:35:29 +0000 (UTC) Received: from lb1-smtp-cloud2.xs4all.net (lb1-smtp-cloud2.xs4all.net [194.109.24.21]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 329AC5F247 for ; Sat, 23 Apr 2016 10:35:29 +0000 (UTC) Received: from roundcube.xs4all.nl ([194.109.20.214]) by smtp-cloud2.xs4all.net with ESMTP id lmbN1s0094d84Ai01mbNPn; Sat, 23 Apr 2016 12:35:22 +0200 Received: from a82-95-194-2.adsl.xs4all.nl ([82.95.194.2]) by roundcube.xs4all.nl with HTTP (HTTP/1.1 POST); Sat, 23 Apr 2016 12:35:22 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Sat, 23 Apr 2016 12:35:22 +0200 From: rpereira To: user@hadoop.apache.org Subject: Hadoop Streaming icm HDFS Message-ID: <04e7f2914034863927082bcf247900a5@xs4all.nl> X-Sender: rpereira@xs4all.nl (ah1YQijypD9bWHHdL45TnusSnM5d1UkZ) User-Agent: XS4ALL Webmail Hi I have a textfile that I'm processing through hadoop streaming. I placed the file on de HDFS. My data transform process is a set of awk and sed commands that creates a table structure. I can choose the count of mappers. When I use one mapper the data is correct. When choosing more than one mapper then the data will be split up. The splitting up is done on eol. I would like to have it split up Before the text markers. I need to have the text blocks not be splitted up as it will mean loss of information. And I like to be able to use more than one mapper. Example: ============================================ Current situation : text mark 1 some data ... some data text mark 2 some data ----------------split----------------- ... some data text mark 3 some data ... some data ============================================ Correct situation : text mark 1 some data ... some data ----------------split here ------------- text mark 2 some data ... some data ----------------or split here ---------- text mark 3 some data ... some data I wouldn't like to do preprocessing before placing it on the HDFS to solve this issue. I want to go ahead from the HDFS filesystem being flexible with the count of mapper processes applied. Are there any possibilities to have the splitting be done outside the textblocks keeping the text blocks complete ? Kind Regards Rene --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org For additional commands, e-mail: user-help@hadoop.apache.org