gearpump-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Caicedo <>
Subject Re: Practical example reading an hadoop hdfs file using gearpump
Date Thu, 19 Jan 2017 03:32:27 GMT
Thank you,

here is my case,

I need to read an hdfs file then do some transformations and validation in the byte streams
to finally generate another hdfs file as output, so source and sink for hdfs will help a lot
and the flows as well for defining validation and transformation. The program should stop
when it has reached the end of file. I am exploring tools I can leverage to build my own tool
to ingest files in hadoop in a data lake system. I have experience in programming at database
level but I am new in scala and I can learn it by example.

Reading the example I found it will duplicate the input file constantly  #2 but in my case
I need it to stop as I  mentioned before.


First of all, this example is a simple case and here are some limitations you should know:

  1.  The example only accepts one sequence file, not a directory, and the output file format
is also sequence file.

  2.  The example will duplicate the input file constantly, so if the example runs for a long
time, the output files will be large.

  3.  Each SeqFileStreamProcessor will generate a output file.


Also about


if I want to write something operational do I  need to have the  SOL

"SOL is a throughput test. It will create multiple layers, and then do random shuffling between
these layers.

SOLProducer -> SOLProcessor -> SOLProcessor -> ..."

./target/pack/bin/gear app -jar ./examples/target/$SCALA_VERSION_MAJOR/gearpump-examples-assembly-$VERSION.jar
org.apache.gearpump.streaming.examples.sol.SOL -input $INPUT_FILE_PATH -output $OUTPUT_DIRECTORY

Here is the DAG I want to build:

HDFS -> Source -> Flow (parsing, validating and transform) -> Sink  -> HDFS

Is it doable adapt the example available with minor changes to achieve the DAG I need ?



De: Karol Brejna <>
Enviado: miƩrcoles, 18 de enero de 2017 03:03 a. m.
Asunto: Re: Practical example reading an hadoop hdfs file using gearpump

As Huafeng mentioned, we have a simple example.
There is also a HDFS Sink if you'd be interested:

incubator-gearpump - Mirror of Apache Gearpump (Incubating)

Could you share your use case?


On Wed, Jan 18, 2017 at 5:29 AM, Vincent Wang <> wrote:
> Hi John,
>   What's your use case?  We have a very simple example under
> examples/streaming/fsio, in which each Source task will read from a same
> sequence file on HDFS.
> Thanks,
> Huafeng
> John Caicedo <>?2017?1?18??? ??11:26??:
>> Hi guys,
>> I am new in the group and I am interested in knowing how to  read hdfs
>> files using gear pump, so I can define transformation DAGs from it.
>> Basically I need a practical example that allows me define a Source
>> reading an hdfs file, Can someone provide a practical example or any
>> guidance in how to define a source for reading hadoop hdfs file ?
>> Thanks
>> John

View raw message