Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 43584 invoked from network); 9 Jul 2008 10:27:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Jul 2008 10:27:51 -0000 Received: (qmail 93494 invoked by uid 500); 9 Jul 2008 10:27:47 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 93465 invoked by uid 500); 9 Jul 2008 10:27:47 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 93454 invoked by uid 99); 9 Jul 2008 10:27:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jul 2008 03:27:47 -0700 X-ASF-Spam-Status: No, hits=3.3 required=10.0 tests=DNS_FROM_RFC_BOGUSMX,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [85.37.17.113] (HELO smtp-out113.alice.it) (85.37.17.113) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jul 2008 10:26:54 +0000 Received: from FBCMMO03.fbc.local ([192.168.68.197]) by smtp-out113.alice.it with Microsoft SMTPSVC(6.0.3790.1830); Wed, 9 Jul 2008 12:26:49 +0200 Received: from FBCMCL01B08.fbc.local ([192.168.171.46]) by FBCMMO03.fbc.local with Microsoft SMTPSVC(6.0.3790.1830); Wed, 9 Jul 2008 12:26:47 +0200 Received: from [192.168.1.94] ([87.3.20.70]) by FBCMCL01B08.fbc.local with Microsoft SMTPSVC(6.0.3790.1830); Wed, 9 Jul 2008 12:25:29 +0200 Message-ID: <48749263.8090005@cli.di.unipi.it> Date: Wed, 09 Jul 2008 12:26:43 +0200 From: Francesco Tamberi User-Agent: Thunderbird 2.0.0.14 (X11/20080505) MIME-Version: 1.0 To: core-user@hadoop.apache.org Subject: Custom InputFormat/OutputFormat Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 09 Jul 2008 10:25:29.0328 (UTC) FILETIME=[1C2C3F00:01C8E1AE] X-Virus-Checked: Checked by ClamAV on apache.org Hi all, I want to use hadoop for some streaming text processing on text documents like: text text text ... Just xml-like notation but not real xml files. I have to work on text included between tags, so I implemented an InputFormat (extending FileInputFormat) with a RecordReader that return file position as Key and needed text as Value. This is next method and I'm pretty sure that it works as expected.. /** Read a text block. */ public synchronized boolean next(LongWritable key, Text value) throws IOException { if (pos >= end) return false; key.set(pos); // key is position buffer.reset(); long bytesRead = readBlock(startTag, endTag); // put needed text in buffer if (bytesRead == 0) return false; pos += bytesRead; value.set(buffer.getData(), 0, buffer.getLength()); return true; } But when I test it, using "cat" as mapper function and TextOutputFormat as OutputFormat, I have one key/value per line: For every text block, the first tuple has fileposition as key and text as value, remaining have text as key and no value... ie: file_pos / first_line second_line / third_line / ... Where am I wrong? Thank you in advance, Francesco