hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 张铎(Duo Zhang) <palomino...@gmail.com>
Subject Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)
Date Sat, 22 May 2021 01:53:13 GMT
So maybe we could introduce a .hfilelist directory, and put the hflielist
files under this directory, so we do not need to list all the files under
the region directory.

And considering the possible implementation for typical object storages,
listing the last directory on the whole path will be less expensive.

Andrew Purtell <andrew.purtell@gmail.com> 于2021年5月22日周六 上午9:35写道:

>
> > On May 21, 2021, at 6:07 PM, 张铎 <palomino219@gmail.com> wrote:
> >
> > Since we just make use of the general FileSystem API to do listing, is
> it
> > possible to make use of ' bucket index listing'?
>
> Yes, those words mean the same thing.
>
> >
> > Andrew Purtell <andrew.purtell@gmail.com> 于2021年5月22日周六 上午6:34写道:
> >
> >>
> >>
> >>> On May 20, 2021, at 4:00 AM, Wellington Chevreuil <
> >> wellington.chevreuil@gmail.com> wrote:
> >>>
> >>> 
> >>>>
> >>>>
> >>>> IMO it should be a file per store.
> >>>> Per region is not suitable here as compaction is per store.
> >>>> Per file means we still need to list all the files. And usually, after
> >>>> compaction, we need to do an atomic operation to remove several old
> >> files
> >>>> and add a new file, or even several files for stripe compaction. It
> >> will be
> >>>> easy if we just write one file to commit these changes.
> >>>>
> >>>
> >>> Fine for me if it's simpler. Mentioned the per file approach because I
> >>> thought it could be easier/faster to do that, rather than having to
> >> update
> >>> the store file list on every flush. AFAIK, append is out of the table,
> so
> >>> updating this file would mean read it, write original content plus new
> >>> hfile to a temp file, delete original file, rename it).
> >>>
> >>
> >> That sounds right to be.
> >>
> >> A minor potential optimization is the filename could have a timestamp
> >> component, so a bucket index listing at that path would pick up a list
> >> including the latest, and the latest would be used as the manifest of
> valid
> >> store files. The cloud object store is expected to provide an atomic
> >> listing semantic where the file is written and closed and only then is
> it
> >> visible, and it is visible at once to everyone. (I think this is
> available
> >> on most.) Old manifest file versions could be lazily deleted.
> >>
> >>
> >>>> Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <
> >> palomino219@gmail.com>
> >>>> escreveu:
> >>>>
> >>>> IIRC S3 is the only object storage which does not guarantee
> >>>> read-after-write consistency in the past...
> >>>>
> >>>> This is the quick result after googling
> >>>>
> >>>> AWS [1]
> >>>>
> >>>>> Amazon S3 delivers strong read-after-write consistency automatically
> >> for
> >>>>> all applications
> >>>>
> >>>>
> >>>> Azure[2]
> >>>>
> >>>>> Azure Storage was designed to embrace a strong consistency model
that
> >>>>> guarantees that after the service performs an insert or update
> >> operation,
> >>>>> subsequent read operations return the latest update.
> >>>>
> >>>>
> >>>> Aliyun[3]
> >>>>
> >>>>> A feature requires that object operations in OSS be atomic, which
> >>>>> indicates that operations can only either succeed or fail without
> >>>>> intermediate states. To ensure that users can access only complete
> >> data,
> >>>>> OSS does not return corrupted or partial data.
> >>>>>
> >>>>> Object operations in OSS are highly consistent. For example, when
a
> >> user
> >>>>> receives an upload (PUT) success response, the uploaded object can
be
> >>>> read
> >>>>> immediately, and copies of the object are written to multiple devices
> >> for
> >>>>> redundancy. Therefore, the situations where data is not obtained
when
> >> you
> >>>>> perform the read-after-write operation do not exist. The same is
true
> >> for
> >>>>> delete operations. After you delete an object, the object and its
> >> copies
> >>>> no
> >>>>> longer exist.
> >>>>>
> >>>>
> >>>> GCP[4]
> >>>>
> >>>>> Cloud Storage provides strong global consistency for the following
> >>>>> operations, including both data and metadata:
> >>>>>
> >>>>> Read-after-write
> >>>>> Read-after-metadata-update
> >>>>> Read-after-delete
> >>>>> Bucket listing
> >>>>> Object listing
> >>>>>
> >>>>
> >>>> I think these vendors could cover most end users in the world?
> >>>>
> >>>> 1. https://aws.amazon.com/cn/s3/consistency/
> >>>> 2.
> >>>>
> >>>>
> >>
> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
> >>>> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm
> >>>> 4. https://cloud.google.com/storage/docs/consistency
> >>>>
> >>>> Nick Dimiduk <ndimiduk@apache.org> 于2021年5月19日周三 下午11:40写道:
> >>>>
> >>>>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <palomino219@gmail.com
> >
> >>>>> wrote:
> >>>>>
> >>>>>> What about just storing the hfile list in a file? Since now
S3 has
> >>>> strong
> >>>>>> consistency, we could safely overwrite a file then I think?
> >>>>>>
> >>>>>
> >>>>> My concern is about portability. S3 isn't the only blob store in
> town,
> >>>> and
> >>>>> consistent read-what-you-wrote semantics are not a standard feature,
> as
> >>>> far
> >>>>> as I know. If we want something that can work on 3 or 5 major public
> >>>> cloud
> >>>>> blobstore products as well as a smattering of on-prem technologies,
> we
> >>>>> should be selective about what features we choose to rely on as
> >>>>> foundational to our implementation.
> >>>>>
> >>>>> Or we are explicitly saying this will only work on S3 and we'll
only
> >>>>> support other services when they can achieve this level of
> >> compatibility.
> >>>>>
> >>>>> Either way, we should be clear and up-front about what semantics
we
> >>>> demand.
> >>>>> Implementing some kind of a test harness that can check compatibility
> >>>> would
> >>>>> help here, a similar effort to that of defining standard behaviors
of
> >>>> HDFS
> >>>>> implementations.
> >>>>>
> >>>>> I love this discussion :)
> >>>>>
> >>>>> And since the hfile list file will be very small, renaming will
not
> be
> >> a
> >>>>>> big problem.
> >>>>>>
> >>>>>
> >>>>> Would this be a file per store? A file per region? Ah. Below you
> imply
> >>>> it's
> >>>>> per store.
> >>>>>
> >>>>> Wellington Chevreuil <wellington.chevreuil@gmail.com> 于2021年5月19日周三
> >>>>>> 下午10:43写道:
> >>>>>>
> >>>>>>> Thank you, Andrew and Duo,
> >>>>>>>
> >>>>>>> Talking internally with Josh Elser, initial idea was to
rebase the
> >>>>>> feature
> >>>>>>> branch with master (in order to catch with latest commits),
then
> >>>> focus
> >>>>> on
> >>>>>>> work to have a minimal functioning hbase, in other words,
together
> >>>> with
> >>>>>> the
> >>>>>>> already committed work from HBASE-25391, make sure flush,
> >>>> compactions,
> >>>>>>> splits and merges all can take advantage of the persistent
store
> file
> >>>>>>> manager and complete with no need to rely on renames. These
all map
> >>>> to
> >>>>>> the
> >>>>>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once
we could
> >>>> test
> >>>>>> and
> >>>>>>> validate this works well for our goals, we can then focus
on
> >>>> snapshots,
> >>>>>>> bulkloading and tooling.
> >>>>>>>
> >>>>>>> S3 now supports strong consistency, and I heard that they
are also
> >>>>>>>> implementing atomic renaming currently, so maybe that's
one of the
> >>>>>>> reasons
> >>>>>>>> why the development is silent now..
> >>>>>>>>
> >>>>>>> Interesting, I had no idea this was being implemented. I
know,
> >>>>> however, a
> >>>>>>> version of this feature is already available on latest EMR
releases
> >>>> (at
> >>>>>>> least from 6.2.0), and AWS team has published their own
blog post
> >>>> with
> >>>>>>> their results:
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> >>>>>>>
> >>>>>>> But I do not think store hfile list in meta is the only
solution.
> It
> >>>>> will
> >>>>>>>> cause cyclic dependencies for hbase:meta, and then force
us a have
> >>>> a
> >>>>>>>> fallback solution which makes the code a bit ugly. We
should try
> to
> >>>>> see
> >>>>>>> if
> >>>>>>>> this could be done with only the FileSystem.
> >>>>>>>>
> >>>>>>> This is indeed a relevant concern. One idea I had mentioned
in the
> >>>>>> original
> >>>>>>> design doc was to track committed/non-committed files through
xattr
> >>>> (or
> >>>>>>> tags), which may have its own performance issues as explained
by
> >>>>> Stephen
> >>>>>>> Wu, but is something that could be attempted.
> >>>>>>>
> >>>>>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang)
<
> >>>>>> palomino219@gmail.com
> >>>>>>>>
> >>>>>>> escreveu:
> >>>>>>>
> >>>>>>>> S3 now supports strong consistency, and I heard that
they are also
> >>>>>>>> implementing atomic renaming currently, so maybe that's
one of the
> >>>>>>> reasons
> >>>>>>>> why the development is silent now...
> >>>>>>>>
> >>>>>>>> For me, I also think deploying hbase on cloud storage
is the
> >>>> future,
> >>>>>> so I
> >>>>>>>> would also like to participate here.
> >>>>>>>>
> >>>>>>>> But I do not think store hfile list in meta is the only
solution.
> >>>> It
> >>>>>> will
> >>>>>>>> cause cyclic dependencies for hbase:meta, and then force
us a have
> >>>> a
> >>>>>>>> fallback solution which makes the code a bit ugly. We
should try
> to
> >>>>> see
> >>>>>>> if
> >>>>>>>> this could be done with only the FileSystem.
> >>>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>>
> >>>>>>>> Andrew Purtell <apurtell@apache.org> 于2021年5月19日周三
上午8:04写道:
> >>>>>>>>
> >>>>>>>>> Wellington (and et. al),
> >>>>>>>>>
> >>>>>>>>> S3 is also an important piece of our future production
plans.
> >>>>>>>>> Unfortunately,  we were unable to assist much with
last year's
> >>>>> work,
> >>>>>> on
> >>>>>>>>> account of being sidetracked by more immediate concerns.
> >>>>> Fortunately,
> >>>>>>>> this
> >>>>>>>>> renewed interest is timely in that we have an HBase
2 project
> >>>>> where,
> >>>>>> if
> >>>>>>>>> this can land in a 2.5 or a 2.6, it could be an
important cost to
> >>>>>> serve
> >>>>>>>>> optimization, and one we could and would make use
of. Therefore I
> >>>>>> would
> >>>>>>>>> like to restate my employer's interest in this work
too. It may
> >>>>> just
> >>>>>> be
> >>>>>>>>> Viraj and myself in the early days.
> >>>>>>>>>
> >>>>>>>>> I'm not sure how best to collaborate. We could review
changes
> >>>> from
> >>>>>> the
> >>>>>>>>> original authors, new changes, and/or divide up
the development
> >>>>>> tasks.
> >>>>>>> We
> >>>>>>>>> can certainly offer our time for testing, and can
afford the
> >>>> costs
> >>>>> of
> >>>>>>>>> testing against the S3 service.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil
<
> >>>>>>>>> wellington.chevreuil@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Greetings everyone,
> >>>>>>>>>>
> >>>>>>>>>> HBASE-24749 has been proposed almost a year
ago, introducing a
> >>>>> new
> >>>>>>>>>> StoreFile tracker as a way to allow for any
hbase hfile
> >>>>>> modifications
> >>>>>>>> to
> >>>>>>>>> be
> >>>>>>>>>> safely completed without needing a file system
rename. This
> >>>> seems
> >>>>>>>> pretty
> >>>>>>>>>> relevant for deployments over S3 file systems,
where rename
> >>>>>>> operations
> >>>>>>>>> are
> >>>>>>>>>> not atomic and can have a performance degradation
when multiple
> >>>>>>>> requests
> >>>>>>>>>> get concurrently submitted to the same bucket.
We had done
> >>>>>>> superficial
> >>>>>>>>>> tests and ycsb runs, where individual renames
of files larger
> >>>>> than
> >>>>>>> 5GB
> >>>>>>>>> can
> >>>>>>>>>> take a few hundreds of seconds to complete.
We also observed
> >>>>>> impacts
> >>>>>>> in
> >>>>>>>>>> write loads throughput, the bottleneck potentially
being the
> >>>>>> renames.
> >>>>>>>>>>
> >>>>>>>>>> With S3 being an important piece of my employer
cloud solution,
> >>>>> we
> >>>>>>>> would
> >>>>>>>>>> like to help it move forward. We plan to contribute
new patches
> >>>>> per
> >>>>>>> the
> >>>>>>>>>> original design/Jira, but we’d also be happy
to review changes
> >>>>> from
> >>>>>>> the
> >>>>>>>>>> original authors, too. Please let us know if
anyone has any
> >>>>>> concerns,
> >>>>>>>>>> otherwise we’ll start to self-assign issues
on HBASE-24749
> >>>>>>>>>>
> >>>>>>>>>> Wellington
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best regards,
> >>>>>>>>> Andrew
> >>>>>>>>>
> >>>>>>>>> Words like orphans lost among the crosstalk, meaning
torn from
> >>>>>> truth's
> >>>>>>>>> decrepit hands
> >>>>>>>>>  - A23, Crosstalk
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message