asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshay Manchale Sridhar <>
Subject Re: Question about LSM Disk Component LSN
Date Sat, 20 May 2017 06:12:07 GMT
Hi Chen,

The LSN only needs to indicate the last operation that was performed on the
flushed disk component so we know where to begin recovery for that index.
The flush operation is triggered only after the flush log record hits the
disk and while we wait for that to happen, the state of the mutable index
component will be READABLE_UNWRITABLE - this prevents any new writes going
into the mutable component whose LSN will be greater than the FLUSH_LSN
(the LSN when it was requested, not completed).

When the flush log record is persisted to disk, we can start the flush
operation for the component but now we also switch to its shadow buffer and
make that the current mutable component so that the ongoing flush operation
does not block incoming writes.

All new incoming writes with LSN > FLUSH_LSN will go into the second buffer
and not the one that is flushing. When the flush operation is complete, we
write the LSN corresponding to when the flush log was created (I'm not sure
how this information is pushed down to LSMBTreeIOOperationCallback).

Regardless of when a set of index flushes hit the disk, the corresponding
LSN in the disk component will be the last operation that modified it and
should not be when it was completed. The time between FLUSH_LSN and the
current LSN when the component hits the disk, no operations are performed
on the flushed component. If we set the LSN to the time when the component
is flushed to disk, lets call this FLUSH_COMPLETE_LSN, we will miss looking
at transaction log records generated between FLUSH_LSN and
FLUSH_COMPLETE_LSN during recovery.

So even though a set of flush operations complete writing to disk in an
order different from the order in which the flush operation was requested,
the timeline for what went into the disk components with respect to the
transaction log will be consistent and we will not risk losing data between
FLUSH_LSN and FLUSH_COMPLETE_LSN in case of failures by starting to scan at
FLUSH_COMPLETE_LSN instead of FLUSH_LSN. I don't know if there are any
other uses for the LSN in the disk component other than finding out when
the recovery needs to start that would break this assumption. Hope that

On Fri, May 19, 2017 at 10:29 PM, Chen Luo <> wrote:

> Hi Devs,
> Recently I was using LSN to set a component ID for every newly flushed disk
> component. From LSMBTreeIOOperationCallback (as well as other index
> callbacks), I saw that after every flush operation, the LSN was placed at
> the newly flushed disk component metadata. I was expecting that the LSN
> should be increasing for every newly flushed disk component. That is, if a
> disk component d1 is flushed later than another disk component d2, we
> should have d1.LSN>d2.LSN. (please correct me if I'm wrong)
> However, based on my testing, this condition does not always hold. It is
> possible that a later flushed disk component has a smaller LSN than the
> previous flushed disk components (I found this by recording the previous
> LSN, and throwing an exception when it is larger than the current LSN). Is
> this behavior expected? Or we do not have the guarantee that LSNs placed at
> flushed disk components are monotonic increasing?
> Any help is appreciated.
> Best regards,
> Chen Luo

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message