Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6ED6111BD7 for ; Tue, 22 Jul 2014 22:17:03 +0000 (UTC) Received: (qmail 19682 invoked by uid 500); 22 Jul 2014 22:16:58 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 19542 invoked by uid 500); 22 Jul 2014 22:16:58 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 19532 invoked by uid 99); 22 Jul 2014 22:16:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Jul 2014 22:16:58 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com designates 209.85.212.180 as permitted sender) Received: from [209.85.212.180] (HELO mail-wi0-f180.google.com) (209.85.212.180) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Jul 2014 22:16:52 +0000 Received: by mail-wi0-f180.google.com with SMTP id n3so1209852wiv.13 for ; Tue, 22 Jul 2014 15:16:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=byiWHtXmcxoTcBTPiFZCUQzOhbisNpBZo4pGp6yvvcA=; b=IP48z9eihN5L/1YJUf0bHgCXM9IBE0+XMoQXRZKlcmtX8MANlZViqQKizTFjfqYUaS x8ORmE78pKqJWz47P5IKSPx+awQ7A/FedAD8vyqQThFC35FGfuUrDSbWIoKF41Kss/ku S+DR9OIziRxv8Ixb0qcuS6IH0fUo/npGfqkoMUXf1vvxQUMTH7CjN0Vg1C9STwpMgqmX zAgXBsS6tmnqNmLFyLRSzeSZRilVyBTECFaHitd9Wpk2lDvwP+wQ7uNCpOmVBNSyas+U WGxcq1Anbq5gS4Vj5AmIEmRw/Xwn7liMQ2V+xGcpz1C838sb+ImhtOxGqjR7JTEzKMoc r96Q== MIME-Version: 1.0 X-Received: by 10.194.177.168 with SMTP id cr8mr28153760wjc.134.1406067391212; Tue, 22 Jul 2014 15:16:31 -0700 (PDT) Received: by 10.194.88.100 with HTTP; Tue, 22 Jul 2014 15:16:31 -0700 (PDT) In-Reply-To: References: Date: Tue, 22 Jul 2014 18:16:31 -0400 Message-ID: Subject: Re: Skippin those gost darn 0 byte diles From: Edward Capriolo To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=089e013d1d9e62316504fecf93f7 X-Virus-Checked: Checked by ClamAV on apache.org --089e013d1d9e62316504fecf93f7 Content-Type: text/plain; charset=UTF-8 Here is the stack trace... Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072) at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274) ... 15 more On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo wrote: > Currently using: > > > org.apache.hadoop > hadoop-hdfs > 2.3.0 > > > > I have this piece of code that does. > > writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class, > CompressionType.BLOCK, codec); > > Then I have a piece of code like this... > > public static final long SYNC_EVERY_LINES = 1000; > if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){ > meta.getWriter().sync(); > } > > > And I commonly see: > > [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls /user/beacon/ > 2014072117 > DEPRECATED: Use of this script to execute hdfs command is deprecated. > Instead use the hdfs command for it. > > Found 12 items > -rw-r--r-- 3 service-igor supergroup 1065682 2014-07-21 17:50 > /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1 > -rw-r--r-- 3 service-igor supergroup 1029041 2014-07-21 17:40 > /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93 > -rw-r--r-- 3 service-igor supergroup 1002096 2014-07-21 17:10 > /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b > -rw-r--r-- 3 service-igor supergroup 1028450 2014-07-21 17:30 > /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92 > -rw-r--r-- 3 service-igor supergroup 0 2014-07-21 17:50 > /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351 > -rw-r--r-- 3 service-igor supergroup 1084873 2014-07-21 17:30 > /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2 > -rw-r--r-- 3 service-igor supergroup 1043108 2014-07-21 17:20 > /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b > -rw-r--r-- 3 service-igor supergroup 986866 2014-07-21 17:10 > /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd > -rw-r--r-- 3 service-igor supergroup 0 2014-07-21 17:50 > /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9 > -rw-r--r-- 3 service-igor supergroup 1040931 2014-07-21 17:50 > /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1 > -rw-r--r-- 3 service-igor supergroup 1012137 2014-07-21 17:40 > /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47 > -rw-r--r-- 3 service-igor supergroup 1028467 2014-07-21 17:20 > /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b > > Sometimes even though they show as 0 bytes you can read data from them. > Sometimes it blows up with a stack trace I have lost. > > > On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux > wrote: > >> I looked at the source by curiosity, for the latest version (2.4), the >> header is flushed during the writer creation. Of course, key/value classes >> are provided. By 0-bytes, you really mean even without the header? Or 0 >> bytes of payload? >> >> >> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux >> wrote: >> >>> The header is expected to have the full name of the key class and value >>> class so if it is only detected with the first record (?) indeed the file >>> can not respect its own format. >>> >>> I haven't tried it but LazyOutputFormat should solve your problem. >>> >>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html >>> >>> Regards >>> >>> Bertrand Dechoux >>> >>> >>> Bertrand Dechoux >>> >>> >>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo >> > wrote: >>> >>>> I have two processes. One that writes sequence files directly to hdfs, >>>> the other that is a hive table that reads these files. >>>> >>>> All works well with the exception that I am only flushing the files >>>> periodically. SequenceFile input format gets angry when it encounters >>>> 0-bytes seq files. >>>> >>>> I was considering flush and sync on first record write. Also was >>>> thinking should just be able to hack sequence file input format to skip 0 >>>> byte files and not throw exception on readFully() which it sometimes does. >>>> >>>> Anyone ever tackled this? >>>> >>> >>> >> > --089e013d1d9e62316504fecf93f7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Here is the stack trace...

 Caused by: java.io.EOFException
  at java.io.DataInputStream.readByte(DataInputStream.java:267)
  at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
  at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
  at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:=
2072)
  at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFi=
le.java:2139)
  at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.=
java:2214)
  at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequ=
enceFileRecordReader.java:109)
  at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRec=
ordReader.java:84)
  at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveC=
ontextAwareRecordReader.java:274)
  ... 15 more



On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <edl= inuxguru@gmail.com> wrote:
Currently us= ing:

=C2=A0=C2=A0=C2=A0 <dependency>
=C2=A0=C2=A0=C2=A0 =C2= =A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 <groupId>org.apache.hadoop</gro= upId>
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 <artifactId>= hadoop-hdfs</artifactId>
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2= =A0=C2=A0=C2=A0 <version>2.3.0</version>
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 </dependency>

I have this piece of code that does.

writer =3D SequenceFile.cr= eateWriter(fs, conf, p, Text.class, Text.class, CompressionType.BLOCK, code= c);

Then I have a piece of code like this...

=C2=A0 public static final long SYNC_EVERY_LINES =3D 1000;
=C2=A0if = (meta.getLinesWritten() % SYNC_EVERY_LINES =3D=3D 0){
=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 meta.getWriter().sync();
=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 }


And I commonly see:

[ecapriolo@staging-= hadoop-cdh-67-14 ~]$ hadoop dfs -ls=C2=A0 /user/beacon/2014072117
DEPRECATED: Use of this script to execute hdfs command is deprecated.
In= stead use the hdfs command for it.

Found 12 items
-rw-r--r--=C2= =A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 1065682 2014-07-21 17= :50 /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 1029041 = 2014-07-21 17:40 /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93<= br>-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 10020= 96 2014-07-21 17:10 /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d= 1b
-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 1028450 = 2014-07-21 17:30 /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92<= br>-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 2014-07-21 17:50 /user/beacon/2014072117/5450= f246-7623-4bbd-8c97-8176a0c30351
-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 1084873 = 2014-07-21 17:30 /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2<= br>-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 10431= 08 2014-07-21 17:20 /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e= 9b
-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0=C2=A0 98= 6866 2014-07-21 17:10 /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f= 6fbd
-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 2014-07-21 17:50 /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 1040931 = 2014-07-21 17:50 /user/beacon/
2014072117/bba6a677-226c-4982-8fb2-4b136108baf1<= br>-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 10121= 37 2014-07-21 17:40 /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba= 47
-rw-r--r--=C2=A0=C2=A0 3 service-igor supergroup=C2=A0=C2=A0=C2=A0 1028467 = 2014-07-21 17:20 /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f2557= 7b

Sometimes even though they show as 0 bytes you can read dat= a from them. Sometimes it blows up with a stack trace I have lost.


On Tue,= Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <dechouxb@gmail.com> wrote:
I looked at the source by c= uriosity, for the latest version (2.4), the header is flushed during the wr= iter creation. Of course, key/value classes are provided. By 0-bytes, you r= eally mean even without the header? Or 0 bytes of payload?


On Tue, Jul 22, 2014 at 11:05 PM, Bertra= nd Dechoux <dechouxb@gmail.com> wrote:
The header is expected to have the full name of the key cl= ass and value class so if it is only detected with the first record (?) ind= eed the file can not respect its own format.

I haven'= ;t tried it but LazyOutputFormat should solve your problem.

Regards

Bertrand Dechoux=


Bertrand Dechoux


On Tue, Jul 22, 2014 at 10:39 PM, Edward= Capriolo <edlinuxguru@gmail.com> wrote:
I have two processes. One that writes seque= nce files directly to hdfs, the other that is a hive table that reads these= files.

All works well with the exception that I am only flush= ing the files periodically. SequenceFile input format gets angry when it en= counters 0-bytes seq files.

I was considering flush and sync on first record write. Also was = thinking should just be able to hack sequence file input format to skip 0 b= yte files and not throw exception on readFully() which it sometimes does.
Anyone ever tackled this?




--089e013d1d9e62316504fecf93f7--