Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 48133D392 for ; Mon, 6 Aug 2012 15:31:35 +0000 (UTC) Received: (qmail 33826 invoked by uid 500); 6 Aug 2012 15:31:33 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 33753 invoked by uid 500); 6 Aug 2012 15:31:33 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 33738 invoked by uid 99); 6 Aug 2012 15:31:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Aug 2012 15:31:33 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dontariq@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qa0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Aug 2012 15:31:25 +0000 Received: by qadz32 with SMTP id z32so1181148qad.14 for ; Mon, 06 Aug 2012 08:31:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=VnSgSQNTYMP8FFjBdH+N5+aDoJHnWXFBQbX2ZQQEJUw=; b=bWjwFjczbKTK8cszGxDvEw7OBauN8s4nh8I1CczaaRkciWKnbC0UEnN1pqqtxiDvqM go84L4ycCUgS/xw4EusNu8GyUA60fn8EudGKGNmS8wvZniUik5EuU4cCoyRi8gnIewKi 5gPWpgVGXx3OvTi1Fkg63V/rh+krqCXDlxRGC5USc+AyWRu7BFwIuS3cCnWh3SkO0ZMx X/Zb3KXYDAYqXJNgZq0HolN1Aky2bdYpmr9jcBe2f3ga4UIcCCii1e7oxgmLOQhG3ToB manDLxtF6CacHzN6sluFOkclgVk5/kNwvhq8oRzZXx69/gYwqBLSmf+CoEn/1VA77s29 2N3Q== Received: by 10.224.185.198 with SMTP id cp6mr18163671qab.79.1344267064834; Mon, 06 Aug 2012 08:31:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.188.210 with HTTP; Mon, 6 Aug 2012 08:30:24 -0700 (PDT) From: Mohammad Tariq Date: Mon, 6 Aug 2012 21:00:24 +0530 Message-ID: Subject: Handling files with unclear boundaries To: mapreduce-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Hello list, I need some guidance on how to handle files where we don't have any proper delimiters or record boundaries. Actually I am trying to process a set of file that are totally alien to me (SAS XPT files) through MR. But one thing that is always fixed is that each time I have to read 107 bytes from the line. Is it possible to use this length as a delimiter for creating splits some how??And if so which InputFormat would be appropriate??Many thanks. Regards, Mohammad Tariq