Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A9B5B200C10 for ; Thu, 19 Jan 2017 11:12:31 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A6AE3160B57; Thu, 19 Jan 2017 10:12:31 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 69832160B54 for ; Thu, 19 Jan 2017 11:12:30 +0100 (CET) Received: (qmail 47886 invoked by uid 500); 19 Jan 2017 10:12:29 -0000 Mailing-List: contact user-help@gearpump.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@gearpump.incubator.apache.org Delivered-To: mailing list user@gearpump.incubator.apache.org Received: (qmail 47876 invoked by uid 99); 19 Jan 2017 10:12:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Jan 2017 10:12:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 38259180661 for ; Thu, 19 Jan 2017 10:12:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.689 X-Spam-Level: * X-Spam-Status: No, score=1.689 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ZbYAT9PujOLe for ; Thu, 19 Jan 2017 10:12:27 +0000 (UTC) Received: from mail-qt0-f174.google.com (mail-qt0-f174.google.com [209.85.216.174]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 597895F365 for ; Thu, 19 Jan 2017 10:12:26 +0000 (UTC) Received: by mail-qt0-f174.google.com with SMTP id l7so60264049qtd.1 for ; Thu, 19 Jan 2017 02:12:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=sLT0IqECr7uEJxF1TGfPKN1XGGHp0oDjg+FYZXsIDyA=; b=GeLZmQDvgg5Z42C72MdlhSfqhNhvsNNoVL8CPZ+nuJVyo6WRYV6e6/bcyVWK7SJtJ5 DVcKnInvmu+uR3MfMV8DQwWr40TcqkgOTnwB9F7OiISZ+0/jbYjjndTKdfimU8r2VIV3 PKACEk/IPN8K/Fwmv3meamAc9fj8W4B7R4AOHIj9hyvHB1HygQ+h7thsRZp0x/GC+ISr 5XUKGWQ4KqGxaGFr0T5X+04lN42V06U6lTneiUsFJMws9V8y4B/DJ+UsCGbIwDCxQZJJ V4lM1SLhDutL8CL1vBqzfF1GDABfrH7z/uGGK5YmMkV1ytVmOz/xioUW0RYP7u8/JbeV 8ZGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=sLT0IqECr7uEJxF1TGfPKN1XGGHp0oDjg+FYZXsIDyA=; b=Lv0wLC+S8pWw7VefKM3PO1ld9Ba7OiXIas/nXb7auldLv2++eYw6O0cDUlbyL+h2l9 sZ4uQ5HSsE53x+d9JquscRxCkXLF/ygVU6mVySwFktD4SLkiDU4bEJV7K3j1ba5HXO7G JHcFrxlrXMOqe9bdjhf5TIKPVVOXQfE/0lgAuecjhp4cYdk7QH/tistY9zOo988K0hnj pA00tRBLnbT1kD2dwJ5DkGlYCbOtvT6NlmLowJSjFR55WOks21SPBHmSdD25jOMh0kbw Vt4jI7iC5NTv5lqgA6nBf3w+4C7MDad4Smqx764EaG94pSLIvM70GlQyel0D0IXSruN1 /ShA== X-Gm-Message-State: AIkVDXKa5VGYJa1Z6xZHYQMnbke2uNGCKPeIRuigyvdE9COjTh0p8qdKDD7Sdu691X/mgpd+j70C2scpZe9/+g== X-Received: by 10.200.37.199 with SMTP id f7mr7241672qtf.186.1484820744191; Thu, 19 Jan 2017 02:12:24 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Vincent Wang Date: Thu, 19 Jan 2017 10:12:13 +0000 Message-ID: Subject: Re: Practical example reading an hadoop hdfs file using gearpump To: "user@gearpump.incubator.apache.org" Content-Type: multipart/alternative; boundary=001a1140605203499d05466fc530 archived-at: Thu, 19 Jan 2017 10:12:31 -0000 --001a1140605203499d05466fc530 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi John, If just for proof of concept, I think it's should be the easiest way. For the limitations you mentioned: 1. You can set the *SeqFileStreamProducer* with parallelism value one when creating the processor 2. You can just do some minor change in *SeqFileStreamProducer* to disable it's repeatedly reading the same file. 3. And also, you can just set the *SeqFileStreamProcessor* with paralleli= sm value one. So I think the DAG could be *SeqFileStreamProducer ->** Flow (parsing, validating and transform) - > SeqFileStreamProcessor* Thanks, Huafeng John Caicedo =E4=BA=8E2017=E5=B9=B41=E6=9C=8819=E6= =97=A5=E5=91=A8=E5=9B=9B =E4=B8=8A=E5=8D=8811:32=E5=86=99=E9=81=93=EF=BC=9A > Thank you, > > > here is my case, > > > I need to read an hdfs file then do some transformations and validation i= n > the byte streams to finally generate another hdfs file as output, so sour= ce > and sink for hdfs will help a lot and the flows as well for defining > validation and transformation. The program should stop when it has reache= d > the end of file. I am exploring tools I can leverage to build my own tool > to ingest files in hadoop in a data lake system. I have experience in > programming at database level but I am new in scala and I can learn it by > example. > > > Reading the example I found it will duplicate the input file constantly > #2 but in my case I need it to stop as I mentioned before. > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > First of all, this example is a simple case and here are some limitations > you should know: > > 1. > > The example only accepts one sequence file, not a directory, and the > output file format is also sequence file. > 2. > > The example will duplicate the input file constantly, so if the > example runs for a long time, the output files will be large. > 3. > > Each SeqFileStreamProcessor will generate a output file. > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > Also about > > org.apache.gearpump.streaming.examples.sol.SOL > > > if I want to write something operational do I need to have the SOL > > > "SOL is a throughput test. It will create multiple layers, and then do > random shuffling between these layers. > > SOLProducer -> SOLProcessor -> SOLProcessor -> ..." > > > ./target/pack/bin/gear app -jar ./examples/target/$SCALA_VERSION_MAJOR/ge= arpump-examples-assembly-$VERSION.jar org.apache.gearpump.streaming.example= s.sol.SOL -input $INPUT_FILE_PATH -output $OUTPUT_DIRECTORY > > > > Here is the DAG I want to build: > > HDFS -> Source -> Flow (parsing, validating and transform) -> Sink -> HD= FS > > Is it doable adapt the example available with minor changes to achieve th= e > DAG I need ? > > Thanks > > John > > > ------------------------------ > *De:* Karol Brejna > *Enviado:* mi=C3=A9rcoles, 18 de enero de 2017 03:03 a. m. > *Para:* user@gearpump.incubator.apache.org > *Asunto:* Re: Practical example reading an hadoop hdfs file using gearpum= p > > As Huafeng mentioned, we have a simple example. > There is also a HDFS Sink if you'd be interested: > > https://github.com/apache/incubator-gearpump/blob/master/external/hadoopf= s/README.md > > > apache/incubator-gearpump > > github.com > incubator-gearpump - Mirror of Apache Gearpump (Incubating) > > > > Could you share your use case? > > Regards, > Karol > > On Wed, Jan 18, 2017 at 5:29 AM, Vincent Wang wrote= : > > Hi John, > > > > What's your use case? We have a very simple example under > > examples/streaming/fsio, in which each Source task will read from a sam= e > > sequence file on HDFS. > > > > Thanks, > > Huafeng > > > > John Caicedo =E4=BA=8E2017=E5=B9=B41=E6=9C=8818= =E6=97=A5=E5=91=A8=E4=B8=89 =E4=B8=8A=E5=8D=8811:26=E5=86=99=E9=81=93=EF=BC= =9A > >> > >> Hi guys, > >> > >> > >> I am new in the group and I am interested in knowing how to read hdfs > >> files using gear pump, so I can define transformation DAGs from it. > >> > >> > >> Basically I need a practical example that allows me define a Source > >> reading an hdfs file, Can someone provide a practical example or any > >> guidance in how to define a source for reading hadoop hdfs file ? > >> > >> > >> Thanks > >> > >> > >> John > >> > >> > >> > > > --001a1140605203499d05466fc530 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi John,

=C2=A0 If just for proof of co= ncept, I think it's should be the easiest way. For the limitations you = mentioned:

=C2=A0 1. You can set the=C2=A0SeqFi= leStreamProducer=C2=A0with=C2=A0parallelism value one = when creating the processor
=C2=A0 2. You= can just do some minor change in SeqFileStreamProducer=C2=A0to disable it's=C2=A0repeatedly reading the sam= e file.
=C2=A0 3. And also, you can just set the SeqFileStream= Processor=C2=A0with=C2=A0parallelism value one.

=C2=A0So I thi= nk the DAG could be=C2=A0SeqFileStreamProducer ->=C2=A0= Flow (parsing, validating and transform) - > SeqFileStreamProcessor<= /div>

Thanks,
Huafeng


John Caicedo <jcaicedo99@hotmail.com>=E4=BA= =8E2017=E5=B9=B41=E6=9C=8819=E6=97=A5=E5=91=A8=E5=9B=9B =E4=B8=8A=E5=8D=881= 1:32=E5=86=99=E9=81=93=EF=BC=9A

Thank you,


here is my case,


I need to read an hdfs file then do some transformat= ions and validation in the byte streams to finally generate another hdfs fi= le as output, so source and sink for hdfs will help a lot and the flows as = well for defining validation and transformation. The program should stop when it has reached the end of file. I am exploring to= ols I can leverage to build my own tool to ingest files in hadoop in a data= lake system. I have experience in programming at database level but I am n= ew in scala and I can learn it by example.


Reading the example I found it will duplicate the in= put file constantly =C2=A0#2=C2=A0but in my case I need it to=C2=A0stop as = I =C2=A0mentioned before.


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D


First of all, this example is a simple case and here are some limitations y= ou should know:

  1. The example only accepts one sequence file, not a directory,= and the output file format is also sequence file.

  2. The example will duplicate the input file constantly, so if = the example runs for a long time, the output files will be large.

  3. Each SeqFileStreamProcessor will generate a output file.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D


Also about=C2=A0

org.apache.gearpump.stre=
aming.examples.sol.SOL


if I want to write something operational do I =C2=A0= need to have the =C2=A0SOL=C2=A0


"SOL is a throughput test. It will create multiple layers, and then do= random shuffling between these layers.

SOLProducer -> SOLProcessor -> SOLProcessor -> ..."


./target/pack/bin/gear a=
pp -jar ./examples/target/$SCALA_VERSION_MAJOR/gearpump-=
examples-assembly-$VERSION.jar org.apache.gearpump.strea=
ming.examples.sol.SOL -input $INPUT_FILE_PATH -output $OUTPUT_DIRECTORY


Here is the= DAG I want to build:

HDFS -> = Source -> Flow (parsing, validating and transform) -> Sink =C2=A0->= ; HDFS

Is it doabl= e adapt the example available with minor changes to achieve the DAG I need = ?

Thanks

John



De: Karol Brejna &= lt;karolbrejna@apache.org>
Enviado: mi=C3=A9rcoles, 18 de enero de 2017 03:= 03 a. m.
Para: user@gearpump.incubator.a= pache.org
Asunto: Re: Practical example reading an hadoop = hdfs file using gearpump
=C2=A0
As Huafeng mentioned, w= e have a simple example.
There is also a HDFS Sink if you'd be interested:
https://github.com/apache/incubator-gearpump/blob/= master/external/hadoopfs/README.md
=
=
incubator-gearpump - Mirror of Apache Gearpump (Incubating)



Could you share your use case?

Regards,
Karol

On Wed, Jan 18, 2017 at 5:29 AM, Vincent Wang <fvunicorn@gmail.com= > wrote:
> Hi John,
>
>=C2=A0=C2=A0 What's your use case?=C2=A0 We have a very simple exam= ple under
> examples/streaming/fsio, in which each Source task will read from a sa= me
> sequence file on HDFS.
>
> Thanks,
> Huafeng
>
> John Caicedo <jcaicedo99@hotmail.com>=E4=BA=8E2017=E5= =B9=B41=E6=9C=8818=E6=97=A5=E5=91=A8=E4=B8=89 =E4=B8=8A=E5=8D=8811:26=E5=86= =99=E9=81=93=EF=BC=9A
>>
>> Hi guys,
>>
>>
>> I am new in the group and I am interested in knowing how to=C2=A0 = read hdfs
>> files using gear pump, so I can define transformation DAGs from it= .
>>
>>
>> Basically I need a practical example that allows me define a Sourc= e
>> reading an hdfs file, Can someone provide a practical example or a= ny
>> guidance in how to define a source for reading hadoop hdfs file ?<= br class=3D"gmail_msg"> >>
>>
>> Thanks
>>
>>
>> John
>>
>>
>>
>
--001a1140605203499d05466fc530--