hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7430) Refactor the BlockScanner to use O(1) memory and use multiple threads
Date Tue, 16 Dec 2014 23:04:14 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249132#comment-14249132
] 

Colin Patrick McCabe commented on HDFS-7430:
--------------------------------------------

bq. getVolumeStats and VolumeScanner#getStatistics can be @VisibleForTesting

ok

bq. IIUC the getUnitTestLong is for the INTERNAL_ config keys, but it's not being used for
MAX_STALENESS_MS. This one is also seemingly not set by any unit tests, so maybe can be removed.

Yeah, we should use {{getUnitTestLong}} here.  Added.  I would prefer to keep this here in
case we ever wanted to expose it, or if we added a test for it later.  The unit tests does
test different staleness settings, but using the setter directly at the moment.

bq. Would prefer Precondition checks that abort with a message rather than Math.max'ing with
0. Users should know when the have an invalid config, maybe they typoed a minus sign rather
than wanting a value of zero. This isn't necessary for the INTERNAL_ ones though, since they're
only set by tests.

So, we already have a convention going on that setting {{dfs.datanode.scan.period.hours}}
to -1 indicates that the block scanner should be disabled.  It's not my favorite convention,
but I don't want to break user configs over a bikeshed issue.  I would also prefer to handle
all config keys uniformly, rather than throwing an exception if some were negative, but letting
others go negative.  I think the likelihood of accidentally setting a negative value is pretty
small... I would expect most mistakes to be leaving out a zero, or just not setting the key
at all, etc.

bq. Again a comment about locking would be nice. I see direct usage of FSVolumes which I think
is safe, but dunno really. 

[~eddyxu] and I have been discussing this in HDFS-7496.  Once that JIRA is complete, FsVolume
instances will be reference-counted, ensuring that we never perform an operation on a volume
that has already been removed.  That fits in nicely with the current code, because the volume
removal code instructs the volume scanner to shut down.

bq. Various races that lead to FNF too, which are already described in ScanResultHandler.

Yeah, these already exist in the current block scanner code.  Eventually maybe we can do a
spin-wait loop to make sure that we can distinguish real block FNF from race-condition-induced
FNF.  But I think that's something we should do in a follow-up... this patch is big enough
:)

bq. ScanResultHandler can be package-private

ok

bq. getStatistics and getVolume can be @VisibleForTesting

{{getStatistics}}, sure.  {{getVolume}} is called by {{BlockScanner}}

bq. I notice some Time.now() being used to calculate an interval in findNextUsableBlockIter,
can this be monotonicNow instead?

The iterator start time is saved in the cursor file, and it is a wall-clock time, not a monotonic
time.  This is necessary because if we shut down the datanode and reboot, we want to be able
to pick up in our scan where we left off.  But monotonic time resets every time we reboot.
 I added a comment in {{findNextUsableBlockIter}} to clarify this.

bq. scanBlock, the first series of try/catch, is this going to lead to lots of log prints?
I believe passing in the exception will print a stack trace. I noticed ScanResultHandler quashes
a lot of these.

Yeah, let's not print the stack trace... it's not interesting here.  Also, I will log a message
when {{FsVolumeSpi#getStoredBlock}} returns null instead of throwing {{FileNotFoundException}}
(apparently it's allowed to do that too :P )

bq. Seems like a mismatch here, since this is subtracting the average num bytes scanned per
minute, not per second

Yikes.  Yeah, we need the average bytes per *second* here, not the average bytes per *minute*,
which is what we'd get from just dividing the bytes in the last hour by 60.  Fixed.

bq. if (neededBytesPerSec <= conf.thresholdBytesPerSec) { ... should be just < 

Yeah, let's use < instead.  This will also allow setting the threshold to 0 to get continuous
scan.

bq. Comment seems unrelated.

So, actually the comment is related... we're taking one {{ExtendedBlock}} structure, which
always has 0 for its genstamp, and looking up another {{ExtendedBlock}} structure, which will
have a valid genstamp.  It's frustrating that we don't have a more expressive type system
that could express something like "this struct has poolId and blockId, but not genstamp."
 Instead we've got Block.java which has id / genstamp, and ExtendedBlock.java, which has all
that plus block pool id.

I will see if I can come up with a better comment in the code.

bq. Doesn't BlockIterator fit better in BlockPoolSlice, since it's tied to a block pool?

{{BlockPoolSlice}} is internal to {{FsDatasetImpl}}.  It's not a generic class, whereas the
volume scanner is intended to be able to work on any volume, not just ones from {{FsVolumeImpl}}.

bq. Implementing Iterator and Iterable might make a lot of sense here. For instance, hasNext
is more canonical than eof.

I did consider that, but I don't think it's really useful.  It's not like you'd want to put
this in a "for" loop or something.  {{Iterator}} is kind of annoying because it has a {{remove}}
method (which I certainly don't want to support here), and because {{hasNext}} is supposed
to return something valid even if you haven't called {{next}} yet.  This is annoying.  Another
annoying thing about {{hasNext}} is that someone could call {{hasNext}}, wait a long time,
and then call {{next}}.  Then I'd have to return "something" even if all blocks had been removed
from the volume.  It's a contract I don't really want to bind myself to here.

bq. Rather than hand-rolled serde, could we use Jackson to do JSON? I think it'll save some
lines of code too.

Hmm.  Let me think about that.  I think it's pretty readable as-is, both in the code and in
the saved file.  The serialization code is VERY short right now and I like having a greppable
file with one key per line.  If I saved as JSON, I'd want to at least pretty-print it so that
we had no more than one key per line as well.  And there would be some complexity around ensuring
that only appropriate values got saved.  I don't think the current format is really that problematic,
and if we want to add a field later that is a list or map, we can always make just that line
contain JSON.  Another nice thing about the current format is that it fits in with how we
do {{VERSION}}.

bq. Finally, have you considered implementing this as an iterator-of-iterators? The complexity
in this code comes from advancing curDir and curSubDir as appropriate, but we could encapsulate
the logic for each in its own iterator, e.g. BlockIterator -> SubdirIterator -> SubdirIterator
-> SubdirBlockIterator. Serializing would be saving the current iter position of each.
Right now this is coded in a very C-style with checking nulls.

My gut feeling is that splitting the code up would just make it longer and probably not any
less repetitious...  Instead of checking whether {{getNextCurFinalizedSubDir}} returned {{null}},
you'd check if {{finalizedSubDirIter#next}} returned {{null}}.  And all of the "encapsulated"
iterators would need references to the "encapsulating" iterators, in order to build a path.
 I think it would get kind of messy.

bq. Class javadoc for BlockIterator is a bit incorrect, it returns entries from a block pool,
not the entire volume.

Fixed

bq. Unused imports

Removed

one last set of comments to address...

> Refactor the BlockScanner to use O(1) memory and use multiple threads
> ---------------------------------------------------------------------
>
>                 Key: HDFS-7430
>                 URL: https://issues.apache.org/jira/browse/HDFS-7430
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.7.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, HDFS-7430.004.patch, HDFS-7430.005.patch,
memory.png
>
>
> We should update the BlockScanner to use a constant amount of memory by keeping track
of what block was scanned last, rather than by tracking the scan status of all blocks in memory.
 Also, instead of having just one thread, we should have a verification thread per hard disk
(or other volume), scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message