beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Romanenko <aromanenko....@gmail.com>
Subject Multiple file systems configuration
Date Tue, 20 Aug 2019 14:56:48 GMT
Hi all,

I’m looking for a working solution for cases where it’s needed (or even required) to use
different file system configuration (HDFS, S3, GC) in the same pipeline and where IO is Beam
FileSystems based (FileIO, TextIO, etc). 
For example: 
- reading data from one HDFS cluster and writing results into another one which requires different
configuration;
- reading objects from one S3 bucket, writing into another one and we need to use different
credentials and/or regions for that;
- we even can have heterogeneous case, where we need to read data from HDFS and write results
into S3 or vice versa.

Usually, in other IOs, we can do this easily by having specific methods, like “withConfiguration()”,
“withCredentialsProvider()”, etc. for Read and Write, but FileSystems based IO could be
configured only with PipelineOptions afaik. There was a thread about that a while ago [1]
where Lukasz Cwik said that it’s feasible by using different schemes but, unfortunately,
I haven’t managed to make it working on my side (neither for HDFS nor for S3).

So, any additional inputs or working solutions would be very welcomed if someone has any.
In the long term, I’d like to document this in details since, I guess, this case can be
quite demanded.

[1] https://lists.apache.org/thread.html/bb5f98c4154cc72d097ce5b404ff0b3bcb52b7360b0834af7116883b@%3Cdev.beam.apache.org%3E
<https://lists.apache.org/thread.html/bb5f98c4154cc72d097ce5b404ff0b3bcb52b7360b0834af7116883b@%3Cdev.beam.apache.org%3E>



Mime
View raw message