gearpump-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Wang <fvunic...@gmail.com>
Subject Re: Practical example reading an hadoop hdfs file using gearpump
Date Thu, 19 Jan 2017 10:12:13 GMT
Hi John,

  If just for proof of concept, I think it's should be the easiest way. For
the limitations you mentioned:

  1. You can set the *SeqFileStreamProducer* with parallelism value one
when creating the processor
  2. You can just do some minor change in *SeqFileStreamProducer* to
disable it's repeatedly reading the same file.
  3. And also, you can just set the *SeqFileStreamProcessor* with parallelism
value one.

 So I think the DAG could be *SeqFileStreamProducer ->** Flow (parsing,
validating and transform) - > SeqFileStreamProcessor*

Thanks,
Huafeng


John Caicedo <jcaicedo99@hotmail.com>于2017年1月19日周四 上午11:32写道:

> Thank you,
>
>
> here is my case,
>
>
> I need to read an hdfs file then do some transformations and validation in
> the byte streams to finally generate another hdfs file as output, so source
> and sink for hdfs will help a lot and the flows as well for defining
> validation and transformation. The program should stop when it has reached
> the end of file. I am exploring tools I can leverage to build my own tool
> to ingest files in hadoop in a data lake system. I have experience in
> programming at database level but I am new in scala and I can learn it by
> example.
>
>
> Reading the example I found it will duplicate the input file constantly
>  #2 but in my case I need it to stop as I  mentioned before.
>
>
> ===================================
>
>
> First of all, this example is a simple case and here are some limitations
> you should know:
>
>    1.
>
>    The example only accepts one sequence file, not a directory, and the
>    output file format is also sequence file.
>    2.
>
>    The example will duplicate the input file constantly, so if the
>    example runs for a long time, the output files will be large.
>    3.
>
>    Each SeqFileStreamProcessor will generate a output file.
>
> ==================================
>
>
> Also about
>
> org.apache.gearpump.streaming.examples.sol.SOL
>
>
> if I want to write something operational do I  need to have the  SOL
>
>
> "SOL is a throughput test. It will create multiple layers, and then do
> random shuffling between these layers.
>
> SOLProducer -> SOLProcessor -> SOLProcessor -> ..."
>
>
> ./target/pack/bin/gear app -jar ./examples/target/$SCALA_VERSION_MAJOR/gearpump-examples-assembly-$VERSION.jar
org.apache.gearpump.streaming.examples.sol.SOL -input $INPUT_FILE_PATH -output $OUTPUT_DIRECTORY
>
>
>
> Here is the DAG I want to build:
>
> HDFS -> Source -> Flow (parsing, validating and transform) -> Sink  -> HDFS
>
> Is it doable adapt the example available with minor changes to achieve the
> DAG I need ?
>
> Thanks
>
> John
>
>
> ------------------------------
> *De:* Karol Brejna <karolbrejna@apache.org>
> *Enviado:* miércoles, 18 de enero de 2017 03:03 a. m.
> *Para:* user@gearpump.incubator.apache.org
> *Asunto:* Re: Practical example reading an hadoop hdfs file using gearpump
>
> As Huafeng mentioned, we have a simple example.
> There is also a HDFS Sink if you'd be interested:
>
> https://github.com/apache/incubator-gearpump/blob/master/external/hadoopfs/README.md
>
> <https://github.com/apache/incubator-gearpump/blob/master/external/hadoopfs/README.md>
> apache/incubator-gearpump
> <https://github.com/apache/incubator-gearpump/blob/master/external/hadoopfs/README.md>
> github.com
> incubator-gearpump - Mirror of Apache Gearpump (Incubating)
>
>
>
> Could you share your use case?
>
> Regards,
> Karol
>
> On Wed, Jan 18, 2017 at 5:29 AM, Vincent Wang <fvunicorn@gmail.com> wrote:
> > Hi John,
> >
> >   What's your use case?  We have a very simple example under
> > examples/streaming/fsio, in which each Source task will read from a same
> > sequence file on HDFS.
> >
> > Thanks,
> > Huafeng
> >
> > John Caicedo <jcaicedo99@hotmail.com>于2017年1月18日周三 上午11:26写道:
> >>
> >> Hi guys,
> >>
> >>
> >> I am new in the group and I am interested in knowing how to  read hdfs
> >> files using gear pump, so I can define transformation DAGs from it.
> >>
> >>
> >> Basically I need a practical example that allows me define a Source
> >> reading an hdfs file, Can someone provide a practical example or any
> >> guidance in how to define a source for reading hadoop hdfs file ?
> >>
> >>
> >> Thanks
> >>
> >>
> >> John
> >>
> >>
> >>
> >
>

Mime
View raw message