beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Bradshaw <rober...@google.com>
Subject Re: [GSoC 2020 Proposal] BEAM-6807: Implement an Azure blobstore filesystem for Python SDK
Date Mon, 30 Mar 2020 16:58:02 GMT
On Fri, Mar 27, 2020 at 10:39 PM Badrul Chowdhury <
badrulchowdhury17@gmail.com> wrote:

> Udi Meiri suggested an alternative: use a custom scheme (azfs://) to
> differentiate Azure URIs.
>

+1, this sounds like the best solution.


> Although this will lead to more user "overhead", we can support both
> default and custom domain URIs this way.
>
> I have updated the proposal accordingly.
>
> Thanks,
> Badrul
>
> On Fri, Mar 27, 2020 at 2:21 PM Badrul Chowdhury <
> badrulchowdhury17@gmail.com> wrote:
>
>> Thanks for surfacing the issue- I have to admit, I didn't put much
>> thought into it at the time. I think we can get away with something as
>> simple as changing the regex pattern: *URI_SCHEMA_PATTERN =
>> re.compile('(?P<scheme>([a-zA-Z][-a-zA-Z0-9+.]*)://.*)|(core.windows.net
>> <http://core.windows.net>)')*
>>
>> [image: image.png]
>>
>> As you can see, the "scheme" group is not as clean as I would like it to
>> be: ideally, the regex should only capture "core.windows.net". But it's
>> a matter of tweaking the regex, which given time, should be doable.
>>
>> A bigger concern is not being able to read from custom domains, however.
>> A custom domain will not have "core.windows.net" in the URL, making it
>> difficult to detect Azure filesystem reliably. We can support default
>> domain names for version 1 and add support for custom domains after further
>> discussions. What do you think?
>>
>> Thanks,
>> Badrul
>>
>> On Fri, Mar 27, 2020 at 11:19 AM Pablo Estrada <pabloem@google.com>
>> wrote:
>>
>>> I am a little concerned about the lack of scheme. The scheme is an
>>> important part about how filesystems work in Beam. If you look at [1],
>>> you'll see that we figure out the file system to use based on the scheme.
>>>
>>> This allows us to do ReadFromText('gs://my_gcs_bucket/my_file') or
>>> ReadFromText('s3://...') or ReadFromText('hdfs://...') without the user
>>> being concerned about where the files are stored.
>>>
>>> I have looked around a bit, and you're right that Azure blob does not
>>> seem to be relying on any form of scheme. We'll need to think about how to
>>> make this work...
>>>
>>> cc: +Chamikara Jayalath <chamikara@google.com>
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L66-L119
>>>
>>> On Fri, Mar 27, 2020 at 10:27 AM Badrul Chowdhury <
>>> badrulchowdhury17@gmail.com> wrote:
>>>
>>>> Hi Pablo,
>>>>
>>>> Thanks for reviewing the proposal. I have replied to you comment about
>>>> the return value of scheme() for Azure Blob Store, please let me know what
>>>> you think.
>>>>
>>>>
>>>> Thanks,
>>>> Badrul
>>>>
>>>> On Tue, Mar 24, 2020 at 1:59 PM Badrul Chowdhury <
>>>> badrulchowdhury17@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I would love to hear your thoughts on my proposal for adding Python
>>>>> SDK support for Azure Blob Store I/O:
>>>>> https://docs.google.com/document/d/173e_gnDclwavqobiNjwxRlo9D1xjaZat98g6Yax0kGQ/edit?usp=sharing
>>>>>
>>>>> Stay safe!
>>>>>
>>>>> Thanks,
>>>>> Badrul
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Cheers,
>>>> Badrul
>>>>
>>>
>>
>> --
>>
>> Cheers,
>> Badrul
>>
>
>
> --
>
> Cheers,
> Badrul
>

Mime
View raw message