spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michal Čizmazia <mici...@gmail.com>
Subject Re: WAL on S3
Date Wed, 23 Sep 2015 00:15:07 GMT
My understanding of pluggable WAL was that it eliminates the need for
having a Hadoop-compatible file system [1].

What is the use of pluggable WAL when it can be only used together with
checkpointing which still requires a Hadoop-compatible file system?

[1]: https://issues.apache.org/jira/browse/SPARK-7056



On 22 September 2015 at 19:57, Tathagata Das <tathagata.das1565@gmail.com>
wrote:

> 1. Currently, the WAL can be used only with checkpointing turned on,
> because it does not make sense to recover from WAL if there is not
> checkpoint information to recover from.
>
> 2. Since the current implementation saves the WAL in the checkpoint
> directory, they share the fate -- if checkpoint directory is deleted, then
> both checkpoint info and WAL info is deleted.
>
> 3. Checkpointing is currently not pluggable. Why do do you want that?
>
>
>
> On Tue, Sep 22, 2015 at 4:53 PM, Michal Čizmazia <micizma@gmail.com>
> wrote:
>
>> I am trying to use pluggable WAL, but it can be used only with
>> checkpointing turned on. Thus I still need have a Hadoop-compatible file
>> system.
>>
>> Is there something like pluggable checkpointing?
>>
>> Or can WAL be used without checkpointing? What happens when WAL is
>> available but the checkpoint directory is lost?
>>
>> Thanks!
>>
>>
>> On 18 September 2015 at 05:47, Tathagata Das <tdas@databricks.com> wrote:
>>
>>> I dont think it would work with multipart upload either. The file is not
>>> visible until the multipart download is explicitly closed. So even if each
>>> write a part upload, all the parts are not visible until the multiple
>>> download is closed.
>>>
>>> TD
>>>
>>> On Fri, Sep 18, 2015 at 1:55 AM, Steve Loughran <stevel@hortonworks.com>
>>> wrote:
>>>
>>>>
>>>> > On 17 Sep 2015, at 21:40, Tathagata Das <tdas@databricks.com>
wrote:
>>>> >
>>>> > Actually, the current WAL implementation (as of Spark 1.5) does not
>>>> work with S3 because S3 does not support flushing. Basically, the current
>>>> implementation assumes that after write + flush, the data is immediately
>>>> durable, and readable if the system crashes without closing the WAL file.
>>>> This does not work with S3 as data is durable only and only if the S3 file
>>>> output stream is cleanly closed.
>>>> >
>>>>
>>>>
>>>> more precisely, unless you turn multipartition uploads on, the S3n/s3a
>>>> clients Spark uses *doesn't even upload anything to s3*.
>>>>
>>>> It's not a filesystem, and you have to bear that in mind.
>>>>
>>>> Amazon's own s3 client used in EMR behaves differently; it may be
>>>> usable as a destination (I haven't tested)
>>>>
>>>>
>>>
>>
>

Mime
View raw message