Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DD668996E for ; Sat, 8 Oct 2011 00:52:37 +0000 (UTC) Received: (qmail 39867 invoked by uid 500); 8 Oct 2011 00:52:37 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 39840 invoked by uid 500); 8 Oct 2011 00:52:37 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 39832 invoked by uid 99); 8 Oct 2011 00:52:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2011 00:52:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mike.sukmanowsky@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2011 00:52:31 +0000 Received: by iabz7 with SMTP id z7so7126329iab.35 for ; Fri, 07 Oct 2011 17:52:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=DW9RwLV9NUaKyQ3Vu65FZmxC7J32TGyVQYIvu9Vtz1M=; b=HNAeB6utXmS8EufIi2S+RAcKePQDeCYS/oKsu3EOvN/ni8nqv9Bl9AEpYKzBceFNf5 /FyMKzQLPo8AF4P9Aqv67FsIuMPZLJu7x94g5vTozke5zv6+Gpya4GuksQ9v/0bgO44p byGo1qEKRQIWweFjKNrwt0h2G+IRz7M7DFDF8= MIME-Version: 1.0 Received: by 10.42.137.6 with SMTP id w6mr250251ict.5.1318035129408; Fri, 07 Oct 2011 17:52:09 -0700 (PDT) Received: by 10.231.35.66 with HTTP; Fri, 7 Oct 2011 17:52:09 -0700 (PDT) In-Reply-To: References: Date: Fri, 7 Oct 2011 20:52:09 -0400 Message-ID: Subject: Custom InputFormat for Multiline Input File Hive/Hadoop From: Mike Sukmanowsky To: dev@hive.apache.org Content-Type: multipart/alternative; boundary=90e6ba6e8298b09d1604aebef7ae X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba6e8298b09d1604aebef7ae Content-Type: text/plain; charset=ISO-8859-1 Hi all, Sending this to core-user@hadoop.apache.org and dev@hive.apache.org. Trying to process Omniture's data log files with Hadoop/Hive. The file format is tab delimited and while being pretty simple for the most part, they do allow you to have multiple new lines and tabs within a field that are escaped by a backslash (\\n and \\t). As a result I've opted to create my own InputFormat to handle the multiple newlines and convert those tabs to spaces when Hive is going to try to do a split on the tabs. I've found a fairly good reference for doing this using the newer InputFormat API at http://blog.rguha.net/?p=293 but unfortunately my version of Hive (0.7.0) still uses the old InputFormat API. I haven't been able to find many tutorials on writing a custom InputFile using the older API so I'm looking to see if I can get some guidance as to what may be wrong with the following two classes: https://gist.github.com/3141e9d27d4e07f5f9ed https://gist.github.com/79fdab227950a0776616 The SELECT statements within hive currently return nothing and my other variations returned nothing but NULL values. This issue is also available on StackOverflow at http://stackoverflow.com/questions/7692994/custom-inputformat-with-hive. If there's a resource someone can point me to that'd also be great. Many thanks in advance, Mike --90e6ba6e8298b09d1604aebef7ae--