Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5FF4895A6 for ; Sat, 5 May 2012 21:45:20 +0000 (UTC) Received: (qmail 10394 invoked by uid 500); 5 May 2012 21:45:18 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 10289 invoked by uid 500); 5 May 2012 21:45:18 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 10281 invoked by uid 99); 5 May 2012 21:45:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 May 2012 21:45:18 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 May 2012 21:45:10 +0000 Received: by obbef5 with SMTP id ef5so8319387obb.35 for ; Sat, 05 May 2012 14:44:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=Q1hYUPMTTfp9J9VH101Smed8FJjhz+v0qLYpyu9zJkY=; b=XXJ7hiEe68zAeo47OsROZdFgTvki4CHXZLeyFluQRl9LaC5tbXk0Z5rqaAfjhfF3Wh xoIKxGBfT/CTwnJ+/6IgZcixO5T+s/1LjCUDiyz3Bsbe16GfDwggLz9oucifiLlZA/Kh QZQjmpbmpLbChBU+7T2H/DwIIPSq9kUUWdYwGM3K+uVcIzL+INXCrX0GHe9IQm0yfrWg FO8LKc+WV/P7Tztmz4fOwQf5EWtZUM6B8oAx+wK+lfzt+0UdM4G8l3uz78nsc3kFntxj ROnpGjbl+iE8FDrTJ9xztMgokQkSqo/dfHwxzdHr9uzUBrV51tdoFtNq1+iiFK/Q17OB 2/NQ== MIME-Version: 1.0 Received: by 10.50.155.168 with SMTP id vx8mr5623632igb.11.1336254289204; Sat, 05 May 2012 14:44:49 -0700 (PDT) Received: by 10.42.83.16 with HTTP; Sat, 5 May 2012 14:44:49 -0700 (PDT) In-Reply-To: References: Date: Sat, 5 May 2012 17:44:49 -0400 Message-ID: Subject: Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles. From: Edward Capriolo To: user@hive.apache.org, safdar.kureishy@gmail.com Cc: user@nutch.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable This is one of the things about hive the key is not easily available. You are going to need an input format that creates a new value which is contains the key and the value. Like this: -> new MyKeyValue< > On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy wrote: > Hi, > > I have attached a Sequence file with the following format: > > > (CrawlDatum is a custom Java type, that contains several fields that woul= d > be flattened into several columns by the SerDe). > > In other words, what I would like to do, is to expose this URL+CrawlDatum > data via a Hive External table, with the following columns: > || url || status || fetchtime || fetchinterval || modifiedtime || retries= || > score || metadata || > > So, I was hoping that after defining a custom SerDe, I would just have to > define the Hive table as follows: > > CREATE EXTERNAL TABLE crawldb > (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, modifiedt= ime > LONG, retries INT, score FLOAT, metadata MAP) > ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe' > STORED AS SEQUENCEFILE > LOCATION '/user/training/deepcrawl/crawldb/current/part-00000'; > > For example, a sample record should like like the following through a Hiv= e > table: > || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 = || > 1 || 0.98 || {x=3D1,y=3D2,p=3D3,q=3D4} || > > I would like this to be possible without having to duplicate/flatten the > data through a separate transformation. Initially, I thought my custom Se= rDe > could have following definition for serialize(): > > =A0 =A0 =A0 =A0 @override > public Object deserialize(Writable obj) throws SerDeException { > =A0 =A0 =A0 =A0 =A0 =A0 ... > =A0 =A0 =A0 =A0 =A0} > > But the problem is that the input argument obj above is only the > VALUE=A0portion of a Sequence record. There seems to be a limitation with= the > way Hive reads Sequence files. Specifically, for each row in a sequence > file, the KEY is ignored and only the VALUE is used by Hive. This is seen > from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow() metho= d > below, which ignores the KEY when iterating over a RecordReader (see bold > text below from the corresponding Hive code for > FetchOperator::getNextRow()): > > =A0 /** > =A0 =A0* Get the next row. The fetch context is modified appropriately. > =A0 =A0* > =A0 =A0**/ > =A0 public InspectableObject getNextRow() throws IOException { > =A0 =A0 try { > =A0 =A0 =A0 while (true) { > =A0 =A0 =A0 =A0 if (currRecReader =3D=3D null) { > =A0 =A0 =A0 =A0 =A0 currRecReader =3D getRecordReader(); > =A0 =A0 =A0 =A0 =A0 if (currRecReader =3D=3D null) { > =A0 =A0 =A0 =A0 =A0 =A0 return null; > =A0 =A0 =A0 =A0 =A0 } > =A0 =A0 =A0 =A0 } > > =A0 =A0 =A0 =A0 boolean ret =3D currRecReader.next(key, value); > =A0 =A0 =A0 =A0 if (ret) { > =A0 =A0 =A0 =A0 =A0 if (this.currPart =3D=3D null) { > =A0 =A0 =A0 =A0 =A0 =A0 Object obj =3D serde.deserialize(value); > =A0 =A0 =A0 =A0 =A0 =A0 return new InspectableObject(obj, serde.getObject= Inspector()); > =A0 =A0 =A0 =A0 =A0 } else { > =A0 =A0 =A0 =A0 =A0 =A0 rowWithPart[0] =3D serde.deserialize(value); > =A0 =A0 =A0 =A0 =A0 =A0 return new InspectableObject(rowWithPart, rowObje= ctInspector); > =A0 =A0 =A0 =A0 =A0 } > =A0 =A0 =A0 =A0 } else { > =A0 =A0 =A0 =A0 =A0 currRecReader.close(); > =A0 =A0 =A0 =A0 =A0 currRecReader =3D null; > =A0 =A0 =A0 =A0 } > =A0 =A0 =A0 } > =A0 =A0 } catch (Exception e) { > =A0 =A0 =A0 throw new IOException(e); > =A0 =A0 } > =A0 } > > As you can see, the "key" variable is ignored and never returned. The > problem is that in the Nutch crawldb Sequence File, the KEY is the URL, a= nd > I need it to be displayed in the Hive table along with the fields of > CrawlDatum. But when writing the the custom SerDe, I only see the CrawlDa= tum > that comes after the key, on each record...which is not sufficient. > > One hack could be to write a CustomSequenceFileRecordReader.java that > returns the offset in the sequence file as the KEY, and an aggregation of > the (Key+Value) as the VALUE. For that, perhaps I need to hack the code > below from SequenceFileRecordReader, which will get really very messy: > =A0 protected synchronized boolean next(K key) > =A0 =A0 throws IOException { > =A0 =A0 if (!more) return false; > =A0 =A0 long pos =3D in.getPosition(); > =A0 =A0 boolean remaining =3D (in.next(key) !=3D null); > =A0 =A0 if (pos >=3D end && in.syncSeen()) { > =A0 =A0 =A0 more =3D false; > =A0 =A0 } else { > =A0 =A0 =A0 more =3D remaining; > =A0 =A0 } > =A0 =A0 return more; > =A0 } > > This would require me to write a CustomSequenceFileRecordReader and a > CustomSequenceFileInputFormat and then some custom SerDe, and probably ma= ke > several other changes as well. Is it possible to just get away with writi= ng > a custom SerDe and some pre-existing reader that includes the key when > invoking SerDe.deserialize()? Unless I'm missing something, why does Hive > have this limitation, when accessing Sequence files? I would imagine that > the key of a sequence file record would be just as important as the > value...so why is it left out by the FetchOperator:getNextRow() method? > > If this is the unfortunate reality with reading sequence files in Nutch, = is > there another Hive storage format I should use that works around this > limitation? Such as "create external table ..... STORED AS > CUSTOM_SEQUENCEFILE"? Or, let's say I write my own > CustomHiveSequenceFileInputFormat, how do i register it with Hive and use= it > in the Hive "STORED AS" definition? > > Any help or pointers would be greatly appreciated. I hope I'm mistaken ab= out > the limitation above, and if not, hopefully there is an easy way to resol= ve > this through a custom SerDe alone. > > Warm regards, > Safdar