beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Demeshchuk <dmi...@postmates.com>
Subject Re: Docs/guidelines on writing filesystem sources and sinks
Date Fri, 07 Jul 2017 01:56:46 GMT
Hi Stephen,

Thanks for the detailed reply!

Some comments inline.

On Thu, Jul 6, 2017 at 5:21 PM, Stephen Sisk <sisk@google.com> wrote:

> Hi Dmitry,
>
> I'm excited to hear that you'd like to do this work. If you haven't
> already, I'd first suggest that you open a JIRA issue to make sure other
> folks know you're working on this.
>

Will do tomorrow, thanks for the suggestion. The code is currently not a
part of Beam, but I'd be more than happy to push it upstream.


>
> I was involved in working on the recent java HDFS file system
> implementation, so I'll try and share what I know - I suspect knowledge
> about this is scattered around a bit, so hopefully others will chime in as
> well.
>
> > 1. Are there any official or non-official guidelines or docs on writing
> filesystems? Even Java-specific ones may be really useful.
> I don't know of any guides for writing IOs. I believe folks should be
> helpful here on the mailing list for specific questions, but there aren't
> that many that are experts in file system implementations. It's not
> expected to be a frequent task, so no one has tried to document it (it also
> means your contribution will have a wide impact!) If you wanted to write up
> your notes from the process, it'd likely be highly helpful to others.
>
> https://issues.apache.org/jira/browse/BEAM-2005 documents the work that
> we did to add the java Hadoop FileSystem implementation, so that might be a
> good guide - it has links to PRs, you can find out about design questions
> that came up there, etc.. The Hadoop FileSystem is relatively new, so
> reviewing its commit history may be very informative.
>

I'll check it out, thanks! The main reason I'm looking for more concrete
guidelines is that a lot of internal filesystem-related mechanisms are not
obvious at all: for example, the fact that there's a temporary file created
first and then it's moved elsewhere. Some of these functions in my
implementation are suboptimal or are not doing anything because they don't
seem to be immediately useful, but due to the complexity of the
higher-level usage of FileSystem subclasses I'm likely making some mistakes
right now.


>
> > 2. Are there any existing generic test suites that every filesystem is
> supposed to pass? Again, even if they exist only in Java world, I'd still
> be down for trying to adopt them in Python SDK too.
>
> I don't know of any. If you put together a test plan, we'd be happy to
> discuss it. The tests for the java Hadoop FileSystem represent the current
> thinking, but could likely be expanded on.
>

I can try thinking of something, but, on a second thought, different
filesystems have different characteristics and guarantees, so the same
tests that pass for HDFS may be not necessarily pass for S3 (due to its
eventual consistency), and I'm sure Google Storage and local filesystem
will also have their own quirks. My hope was that some kind of a plan
already existed, but looks like that's not the case, and now I can see why.

I'll try to reflect on this idea and see if I can pull together a doc with
at least some basic acceptance tests and ways to apply them to the existing
filesystems. Will start a new thread if/when I end up doing that.


>
> > 3. Are there any established ideas of how to pass AWS credentials to
> Beam for making the S3 filesystem actually work?
>
> Looks like you already found the past discussions of this on the mailing
> list, that was what I would refer you to.
>
> > I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem,
> We had a similar problem with the hadoop configuration object - inside of
> the hadoop filesystem registrar, we read the pipeline options to see if
> there is configuration info there, as well as some default hadoop
> configuration file locations. See https://github.com/apache/
> beam/blob/master/sdks/java/io/hadoop-file-system/src/main/
> java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45
>

Thanks, that's actually the ideal approach for me! I wasn't sure if
pipeline options were accessible from inside transformations, but looks
like they are. This makes a really good case for supporting the entire AWS
stack conveniently by providing some extra pipeline option, like
"aws_config" or something.


>
> The python folks will have to comment if that's the type of solution they
> want you to use though.
>
> I hope this helps!
>
> Stephen
>
>
> On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <dmitry@postmates.com>
> wrote:
>
>> I also stumbled upon a problem that I can't really pass additional
>> configuration to a filesystem, e.g.
>>
>> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
>> aws_config=AWSConfig())
>>
>> because the ReadFromText class relies on PTransform's constructor, which
>> has a pre-defined set of arguments.
>>
>> This is probably becoming a cross-topic for the dev list (have I added it
>> in the right way?)
>>
>> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dmitry@postmates.com>
>> wrote:
>>
>>> Hi folks,
>>>
>>> I'm working on an S3 filesystem for the Python SDK, which already works
>>> in case of a happy path for both reading and writing, but I feel like there
>>> are quite a few edge cases that I'm likely missing.
>>>
>>> So far, my approach has been: "look at the generic FileSystem
>>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>>> to copy their approach as much as possible, at least for getting to the
>>> proof of concept".
>>>
>>> That said, I'd like to know a few things:
>>>
>>> 1. Are there any official or non-official guidelines or docs on writing
>>> filesystems? Even Java-specific ones may be really useful.
>>>
>>> 2. Are there any existing generic test suites that every filesystem is
>>> supposed to pass? Again, even if they exist only in Java world, I'd still
>>> be down for trying to adopt them in Python SDK too.
>>>
>>> 3. Are there any established ideas of how to pass AWS credentials to
>>> Beam for making the S3 filesystem actually work? I currently rely on the
>>> existing environment variables, which boto just picks up, but sounds like
>>> setting them up in runners like Dataflow or Spark would be troublesome.
>>> I've seen this discussion a couple times in the list, but couldn't tell if
>>> any closure was found. My personal preference would be having AWS settings
>>> passed in some global context (pipeline options, perhaps?), but there may
>>> be exceptions to that (say, people want to use different credentials for
>>> different AWS operations).
>>>
>>> Thanks!
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>


-- 
Best regards,
Dmitry Demeshchuk.

Mime
View raw message