Return-Path: Delivered-To: apmail-hadoop-hive-user-archive@minotaur.apache.org Received: (qmail 41632 invoked from network); 14 Apr 2010 00:16:07 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 14 Apr 2010 00:16:07 -0000 Received: (qmail 2039 invoked by uid 500); 14 Apr 2010 00:16:07 -0000 Delivered-To: apmail-hadoop-hive-user-archive@hadoop.apache.org Received: (qmail 2017 invoked by uid 500); 14 Apr 2010 00:16:07 -0000 Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hive-user@hadoop.apache.org Delivered-To: mailing list hive-user@hadoop.apache.org Received: (qmail 2009 invoked by uid 99); 14 Apr 2010 00:16:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Apr 2010 00:16:07 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pw0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Apr 2010 00:16:02 +0000 Received: by pwi7 with SMTP id 7so5677243pwi.35 for ; Tue, 13 Apr 2010 17:15:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type :content-transfer-encoding; bh=N2tbPC8YN/w+Es1pSIvFx/XdYZGlvGuSWSHpUo/2FPw=; b=ftAZCFrrWpvLcpEzJMAlk3sVgnaXqvUPT6u7EZciYfY3yImbw2cEuVUVgUZClP9Tht Cy/0L6JSKduBVXUZZXvM4y9uHUpI7YwVo4fC98OpwvyZOLG5s5XaQ9W3aiRBogAeoAvv 8kKGfhKtoq/A/XMikwn3VMJtSFBM/vIRIuyQY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Bg+w8fZt2iwdgaI5iCtDf/cQlXY6sLaIWzpFUJeXu1Zu9lSGwKEelDpvoirMZY7geT gwHS0T9UZEUqAcw0AY/VIXKRdcPHjxWEbBu6Nwdl/0gL2f0fLBNFTvX8mHlETZDEd6Y5 /pPJ+3WLmgHTmBMZE1TLn9az7YXixSRXny9no= MIME-Version: 1.0 Received: by 10.143.11.12 with HTTP; Tue, 13 Apr 2010 17:15:41 -0700 (PDT) In-Reply-To: References: Date: Tue, 13 Apr 2010 20:15:41 -0400 Received: by 10.143.20.1 with SMTP id x1mr3184817wfi.148.1271204142028; Tue, 13 Apr 2010 17:15:42 -0700 (PDT) Message-ID: Subject: Re: Sequence Files with data inside key From: Edward Capriolo To: "hive-user@hadoop.apache.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I was looking at the code and it looks like hive uses ignorekeyOUTPUTformat so rather the trying to swap values in the inputformat just write an ignore value output format. On Tuesday, April 13, 2010, Edward Capriolo wrote: > > > On Fri, Apr 2, 2010 at 9:34 PM, Zheng Shao wrote: > > The easiest way is to write a SequenceFileInputFormat that returns a > RecordReader that has key in the value and value in the key. > > Zheng > > On Fri, Apr 2, 2010 at 2:16 PM, Edward Capriolo w= rote: >> I have some sequence files in which all our data is in the key. >> >> http://osdir.com/ml/hive-user-hadoop-apache/2009-10/msg00027.html >> >> Has anyone tackled the above issue? >> >> > > > > -- > Yours, > Zheng > > > I am attempting to do this for sequence files. Unfortunately I have to co= py much of the SequenceFile format since the reader (in) has private access= . > ---------------------------------------- > public class SequenceKeyOnlyInputFormat extends SequenceFileInputFormat { > > =A0=A0=A0 public RecordReader getRecordReader(InputSplit split, Job= Conf job, Reporter reporter) throws IOException { > =A0=A0=A0 =A0=A0=A0 reporter.setStatus(split.toString()); > =A0=A0=A0 =A0=A0=A0 return new SequenceKeyOnlyRecordReader(job, (Fi= leSplit) split); > =A0=A0=A0 } > > } > -------------------------------------------- > @SuppressWarnings({ "unchecked", "deprecation" }) > public class SequenceKeyOnlyRecordReader > implements RecordReader{ > > =A0=A0=A0 private SequenceFile.Reader in; > =A0=A0=A0 private long start; > =A0=A0=A0 private long end; > =A0=A0=A0 private boolean more =3D true; > =A0=A0=A0 protected Configuration conf; > > > =A0=A0=A0 public SequenceKeyOnlyRecordReader(Configuration conf, FileSpli= t split) throws IOException { > =A0=A0=A0 =A0=A0=A0 Path path =3D split.getPath(); > =A0=A0=A0 =A0=A0=A0 FileSystem fs =3D path.getFileSystem(conf); > =A0=A0=A0 =A0=A0=A0 this.in =3D new SequenceFile.Reader(fs, path, conf); > =A0=A0=A0 =A0=A0=A0 this.end =3D split.getStart() + split.getLength(); > =A0=A0=A0 =A0=A0=A0 this.conf =3D conf; > > =A0=A0=A0 =A0=A0=A0 if (split.getStart() > in.getPosition()) in.sync(spli= t.getStart()); // sync to start > > =A0=A0=A0 =A0=A0=A0 this.start =3D in.getPosition(); > =A0=A0=A0 =A0=A0=A0 more =3D start < end; > =A0=A0=A0 } > > =A0=A0=A0 /** > =A0=A0=A0 =A0* The class of key that must be passed to {@link #next(Objec= t, Object)}.. > =A0=A0=A0 =A0*/ > =A0=A0=A0 public Class getKeyClass() { > =A0=A0=A0 =A0=A0=A0 return in.getKeyClass(); > =A0=A0=A0 } > > =A0=A0=A0 /** > =A0=A0=A0 =A0* The class of value that must be passed to {@link #next(Obj= ect, Object)}.. > =A0=A0=A0 =A0*/ > =A0=A0=A0 public Class getValueClass() { > =A0=A0=A0 =A0=A0=A0 return in.getKeyClass(); > =A0=A0=A0 } > > =A0=A0=A0 public K createKey() { > =A0=A0=A0 =A0=A0=A0 return (K) ReflectionUtils.newInstance(getKeyClass(),= conf); > =A0=A0=A0 } > > =A0=A0=A0 public V createValue() { > =A0=A0=A0 =A0=A0=A0 return (V) ReflectionUtils.newInstance(getKeyClass(),= conf); > =A0=A0=A0 } > > =A0=A0=A0 public synchronized boolean next(K key, V value) throws IOExcep= tion { > =A0=A0=A0 =A0=A0=A0 if (!more) return false; > =A0=A0=A0 =A0=A0=A0 long pos =3D in.getPosition(); > > =A0=A0=A0 =A0=A0=A0 boolean remaining =3D in.next(key); > =A0=A0=A0 =A0=A0=A0 if (remaining) { > =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 getCurrentValue(value); > =A0=A0=A0 =A0=A0=A0 } > =A0=A0=A0 =A0=A0=A0 if (pos >=3D end && in.syncSeen()) { > =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 more =3D false; > =A0=A0=A0 =A0=A0=A0 } else { > =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 more =3D remaining; > =A0=A0=A0 =A0=A0=A0 } > =A0=A0=A0 =A0=A0=A0 return more; > =A0=A0=A0 } > > =A0=A0=A0 protected synchronized boolean next(K key) throws IOException { > =A0=A0=A0 =A0=A0=A0 if (!more) return false; > =A0=A0=A0 =A0=A0=A0 long pos =3D in.getPosition(); > =A0=A0=A0 =A0=A0=A0 boolean remaining =3D in.next(key); > =A0=A0=A0 =A0=A0=A0 if (pos >=3D end && in.syncSeen()) { > =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 more =3D false; > =A0=A0=A0 =A0=A0=A0 } else { > =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 more =3D remaining; > =A0=A0=A0 =A0=A0=A0 } > =A0=A0=A0 =A0=A0=A0 return more; > =A0=A0=A0 } > > =A0=A0=A0 protected synchronized void getCurrentValue(V value) throws IOE= xception { > =A0=A0=A0 =A0=A0=A0 =A0in.getCurrentValue(value); > =A0=A0=A0 =A0=A0=A0 //in.next(value); > =A0=A0=A0 } > > =A0=A0=A0 /** > =A0=A0=A0 =A0* Return the progress within the input split > =A0=A0=A0 =A0* > =A0=A0=A0 =A0* @return 0.0 to 1.0 of the input byte range > =A0=A0=A0 =A0*/ > =A0=A0=A0 public float getProgress() throws IOException { > =A0=A0=A0 =A0=A0=A0 if (end =3D=3D start) { > =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 return 0.0f; > =A0=A0=A0 =A0=A0=A0 } else { > =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 return Math.min(1.0f, (in.getPosition() - s= tart) / (float) (end - start)); > =A0=A0=A0 =A0=A0=A0 } > =A0=A0=A0 } > > =A0=A0=A0 public synchronized long getPos() throws IOException { > =A0=A0=A0 =A0=A0=A0 return in.getPosition(); > =A0=A0=A0 } > > =A0=A0=A0 protected synchronized void seek(long pos) throws IOException { > =A0=A0=A0 =A0=A0=A0 in.seek(pos); > =A0=A0=A0 } > > =A0=A0=A0 public synchronized void close() throws IOException { > =A0=A0=A0 =A0=A0=A0 in.close(); > =A0=A0=A0 } > > } > > seems like: > > =A0=A0=A0 protected synchronized void getCurrentValue(V value) throws IOE= xception { > =A0=A0=A0 =A0=A0=A0 =A0in.getCurrentValue(value); > =A0=A0=A0 } > > ^ Returns nulls > > =A0=A0=A0 protected synchronized void getCurrentValue(V value) throws IOE= xception { > =A0=A0=A0 =A0=A0 in.next(value); > =A0=A0=A0 } > > ^ returns every other row. > > Do you have any idea what I am doing wrong? Will contrib it hopefully If = i can get this going correctly. > > Thanks, > Edward >