beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Duncan <joseph.dun...@liveramp.com>
Subject Re: [Python] Read Hadoop Sequence File?
Date Tue, 02 Jul 2019 12:20:20 GMT
It would be great I'd it was available for both Java and Python.

On Tue, Jul 2, 2019, 3:57 AM Ismaël Mejía <iemejia@gmail.com> wrote:

> (Adding dev@ and Solomon Duskis to the discussion)
>
> I was not aware of these thanks for sharing David. Definitely it would
> be a great addition if we could have those donated as an extension in
> the Beam side. We can even evolve them in the future to be more FileIO
> like. Any chance this can happen? Maybe Solomon and his team?
>
>
>
> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <dmvk@apache.org> wrote:
> >
> > Hi, you can use SequenceFileSink and Source, from a BigTable client.
> Those works nice with FileIO.
> >
> >
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
> >
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
> >
> > It would be really cool to move these into Beam, but that's up to
> Googlers to decide, whether they want to donate this.
> >
> > D.
> >
> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
> joseph.duncan@liveramp.com> wrote:
> >>
> >> It's not outside the realm of possibilities. For now I've created an
> intermediary step of a hadoop job that converts from sequence to text file.
> >>
> >> Looking into better options.
> >>
> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <chamikara@google.com>
> wrote:
> >>>
> >>> Java SDK has a HadoopInputFormatIO using which you should be able to
> read Sequence files:
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
> >>> I don't think there's a direct alternative for this for Python.
> >>>
> >>> Is it possible to write to a well-known format such as Avro instead of
> a Hadoop specific format which will allow you to read from both
> Dataproc/Hadoop and Beam Python SDK ?
> >>>
> >>> Thanks,
> >>> Cham
> >>>
> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <
> joseph.duncan@liveramp.com> wrote:
> >>>>
> >>>> That's a pretty big hole for a missing source/sink when looking at
> transitioning from Dataproc to Dataflow using GCS as storage buffer instead
> of a traditional hdfs.
> >>>>
> >>>> From what I've been able to tell from source code and documentation,
> Java is able to but not Python?
> >>>>
> >>>> Thanks,
> >>>> Shannon
> >>>>
> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >>>>>
> >>>>> I don't think we have a source/sink for reading Hadoop sequence
> files. Your best bet currently will probably be to use FileSystem
> abstraction to create a file from a ParDo and read directly from there
> using a library that can read sequence files.
> >>>>>
> >>>>> Thanks,
> >>>>> Cham
> >>>>>
> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
> joseph.duncan@liveramp.com> wrote:
> >>>>>>
> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored on
> Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the
> Python SDK.
> >>>>>>
> >>>>>> I cannot locate any good adapters for this, and the one Hadoop
> Filesystem reader seems to only read from a "hdfs://" url.
> >>>>>>
> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start mixing
in
> Beam pipelines with our current Hadoop Pipelines.
> >>>>>>
> >>>>>> Is this a feature that is supported or will be supported in
the
> future?
> >>>>>> Does anyone have any good suggestions for this that is performant?
> >>>>>>
> >>>>>> I'd also like to be able to write back out to a SequenceFile
if
> possible.
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
>

Mime
View raw message