accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Accumulo on Google Cloud Storage
Date Fri, 22 Jun 2018 23:33:43 GMT
Unfortunately, that feature wasn't added until 2.0, which hasn't yet been
released, but I'm hoping it will be later this year.

However, I'm not convinced this is a write pattern issue, though. I
commented on
https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103#issuecomment-399608543

On Fri, Jun 22, 2018 at 1:50 PM Stephen Meyles <smeyles@gmail.com> wrote:

> Knowing that HBase has been run successfully on ADLS, went looking there
> (as they have the same WAL write pattern). This is informative:
>
>
> https://www.cloudera.com/documentation/enterprise/5-12-x/topics/admin_using_adls_storage_with_hbase.html
>
> which suggests a need to split the WALs off on HDFS proper versus ADLS (or
> presumably GCS) barring changes in the underlying semantics of each. AFAICT
> you can't currently configure Accumulo to send WAL logs to a separate
> cluster - is this correct?
>
> S.
>
>
> On Fri, Jun 22, 2018 at 9:07 AM, Stephen Meyles <smeyles@gmail.com> wrote:
>
>> > Did you try to adjust any Accumulo properties to do bigger writes less
>> frequently or something like that?
>>
>> We're using BatchWriters and sending reasonable larges batches of
>> Mutations. Given the stack traces in both our cases are related to WAL
>> writes it seems like batch size would be the only tweak available here
>> (though, without reading the code carefully it's not even clear to me that
>> is impactful) but if there others have suggestions I'd be happy to try.
>>
>> Given we have this working well and stable in other clusters atop
>> traditional HDFS I'm currently pursuing this further with the MS to
>> understand the variance to ADLS. Depending what emerges from that I may
>> circle back with more details and a bug report and start digging in more
>> deeply to the relevant code in Accumulo.
>>
>> S.
>>
>>
>> On Fri, Jun 22, 2018 at 6:09 AM, Maxim Kolchin <kolchinmax@gmail.com>
>> wrote:
>>
>>> > If somebody is interested in using Accumulo on GCS, I'd like to
>>> encourage them to submit any bugs they encounter, and any patches (if they
>>> are able) which resolve those bugs.
>>>
>>> I'd like to contribute a fix, but I don't know where to start. We tried
>>> to get any help from the Google Support about [1] over email, but they just
>>> say that the GCS doesn't support such write pattern. In the end, we can
>>> only guess how to adjust the Accumulo behaviour to minimise broken
>>> connections to the GCS.
>>>
>>> BTW although we observe this exception, the tablet server doesn't fail,
>>> so it means that after some retries it is able to write WALs to GCS.
>>>
>>> @Stephen,
>>>
>>> > as discussions with MS engineers have suggested, similar to the GCS
>>> thread, that small writes at high volume are, at best, suboptimal for ADLS.
>>>
>>> Did you try to adjust any Accumulo properties to do bigger writes less
>>> frequently or something like that?
>>>
>>> [1]: https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103
>>>
>>> Maxim
>>>
>>> On Thu, Jun 21, 2018 at 7:17 AM Stephen Meyles <smeyles@gmail.com>
>>> wrote:
>>>
>>>> I think we're seeing something similar but in our case we're trying to
>>>> run Accumulo atop ADLS. When we generate sufficient write load we start to
>>>> see stack traces like the following:
>>>>
>>>> [log.DfsLogger] ERROR: Failed to write log entries
>>>> java.io.IOException: attempting to write to a closed stream;
>>>> at
>>>> com.microsoft.azure.datalake.store.ADLFileOutputStream.write(ADLFileOutputStream.java:88)
>>>> at
>>>> com.microsoft.azure.datalake.store.ADLFileOutputStream.write(ADLFileOutputStream.java:77)
>>>> at
>>>> org.apache.hadoop.fs.adl.AdlFsOutputStream.write(AdlFsOutputStream.java:57)
>>>> at
>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:48)
>>>> at java.io.DataOutputStream.write(DataOutputStream.java:88)
>>>> at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
>>>> at
>>>> org.apache.accumulo.tserver.logger.LogFileKey.write(LogFileKey.java:87)
>>>> at org.apache.accumulo.tserver.log.DfsLogger.write(DfsLogger.java:537)
>>>>
>>>> We have developed a rudimentary LogCloser implementation that allows us
>>>> to recover from this but overall performance is significantly impacted by
>>>> this.
>>>>
>>>> > As for the WAL closing issue on GCS, I recall a previous thread
>>>> about that
>>>>
>>>> I searched more for this but wasn't able to find anything, nor similar
>>>> re: ADL. I am also curious about the earlier question:
>>>>
>>>> >> Does Accumulo have a specific write pattern [to WALs], so that file
>>>> system may not support it?
>>>>
>>>> as discussions with MS engineers have suggested, similar to the GCS
>>>> thread, that small writes at high volume are, at best, suboptimal for ADLS.
>>>>
>>>> Regards
>>>>
>>>> Stephen
>>>>
>>>>
>>>> On Wed, Jun 20, 2018 at 11:20 AM, Christopher <ctubbsii@apache.org>
>>>> wrote:
>>>>
>>>>> For what it's worth, this is an Apache project, not a Sqrrl project.
>>>>> Amazon is free to contribute to Accumulo to improve its support of their
>>>>> platform, just as anybody is free to do. Amazon may start contributing
more
>>>>> as a result of their acquisition... or they may not. There is no reason
to
>>>>> expect that their acquisition will have any impact whatsoever on the
>>>>> platforms Accumulo supports, because Accumulo is not, and has not ever
>>>>> been, a Sqrrl project (although some Sqrrl employees have contributed),
and
>>>>> thus will not become an Amazon project. It has been, and will remain,
a
>>>>> vendor-neutral Apache project. Regardless, we welcome contributions from
>>>>> anybody which would improve Accumulo's support of any additional platform
>>>>> alternatives to HDFS, whether it be GCS, S3, or something else.
>>>>>
>>>>> As for the WAL closing issue on GCS, I recall a previous thread about
>>>>> that... I think a simple patch might be possible to solve that issue,
but
>>>>> to date, nobody has contributed a fix. If somebody is interested in using
>>>>> Accumulo on GCS, I'd like to encourage them to submit any bugs they
>>>>> encounter, and any patches (if they are able) which resolve those bugs.
If
>>>>> they need help submitting a fix, please ask on the dev@ list.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 20, 2018 at 8:21 AM Geoffry Roberts <
>>>>> threadedblue@gmail.com> wrote:
>>>>>
>>>>>> Maxim,
>>>>>>
>>>>>> Interesting that you were able to run A on GCS.  I never thought
of
>>>>>> that--good to know.
>>>>>>
>>>>>> Since I am now an AWS guy (at least or the time being), in light
of
>>>>>> the fact that Amazon purchased Sqrrl,  I am interested to see what
develops.
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 20, 2018 at 5:15 AM, Maxim Kolchin <kolchinmax@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Geoffry,
>>>>>>>
>>>>>>> Thank you for the feedback!
>>>>>>>
>>>>>>> Thanks to [1, 2], I was able to run Accumulo cluster on Google
VMs
>>>>>>> and with GCS instead of HDFS. And I used Google Dataproc to run
Hadoop jobs
>>>>>>> on Accumulo. Almost everything was good until I've not faced
some
>>>>>>> connection issues with GCS. Quite often, the connection to GCS
breaks on
>>>>>>> writing or closing WALs.
>>>>>>>
>>>>>>> To all,
>>>>>>>
>>>>>>> Does Accumulo have a specific write pattern, so that file system
may
>>>>>>> not support it? Are there Accumulo properties which I can play
with to
>>>>>>> adjust the write pattern?
>>>>>>>
>>>>>>> [1]: https://github.com/cybermaggedon/accumulo-gs
>>>>>>> [2]: https://github.com/cybermaggedon/accumulo-docker
>>>>>>>
>>>>>>> Thank you!
>>>>>>> Maxim
>>>>>>>
>>>>>>> On Tue, Jun 19, 2018 at 10:31 PM Geoffry Roberts <
>>>>>>> threadedblue@gmail.com> wrote:
>>>>>>>
>>>>>>>> I tried running Accumulo on Google.  I first tried running
it on
>>>>>>>> Google's pre-made Hadoop.  I found the various file paths
one must contend
>>>>>>>> with are different on Google than on a straight download
from Apache.  It
>>>>>>>> seems they moved things around.  To counter this, I installed
my own Hadoop
>>>>>>>> along with Zookeeper and Accumulo on a Google node.  All
went well until
>>>>>>>> one fine day when I could no longer log in.  It seems Google
had pushed out
>>>>>>>> some changes over night that broke my client side Google
Cloud
>>>>>>>> installation.  Google referred the affected to a lengthy,
>>>>>>>> easy-to-make-a-mistake procedure for resolving the issue.
>>>>>>>>
>>>>>>>> I decided life was too short for this kind of thing and switched
to
>>>>>>>> Amazon.
>>>>>>>>
>>>>>>>> On Tue, Jun 19, 2018 at 7:34 AM, Maxim Kolchin <
>>>>>>>> kolchinmax@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Does anyone have experience running Accumulo on top of
Google
>>>>>>>>> Cloud Storage instead of HDFS? In [1] you can see some
details if you never
>>>>>>>>> heard about this feature.
>>>>>>>>>
>>>>>>>>> I see some discussion (see [2], [3]) around this topic,
but it
>>>>>>>>> looks to me that this isn't as popular as, I believe,
should be.
>>>>>>>>>
>>>>>>>>> [1]:
>>>>>>>>> https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
>>>>>>>>> [2]: https://github.com/apache/accumulo/issues/428
>>>>>>>>> [3]:
>>>>>>>>> https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Maxim
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> There are ways and there are ways,
>>>>>>>>
>>>>>>>> Geoffry Roberts
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> There are ways and there are ways,
>>>>>>
>>>>>> Geoffry Roberts
>>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message