hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jakob Homan (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-1448) Create multi-format parser for edits logs file, support binary and XML formats initially
Date Wed, 13 Oct 2010 19:01:45 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jakob Homan updated HDFS-1448:

    Attachment: Viewer hierarchy.pdf

Code review:
* General: 
** All classes should be categorized with audience and stability
** No need for all the brackets in messages.  Breaks with what passes for our style.
** Do we need to write to disk for tests? Just write to output stream
* editsStored.xml
** Indent/format to make more human readable
* TestOfflineImageViewer.java
** Convert getBuildDir and getCacheDir to fields, rather than re-evaluating the method each
** Would it be better to split the single, large test into four smaller tests, with more descriptive
** The commented-out code should be removed.  If it's useful for manual testing, it can be
included in a static main in the test
** Style: runOev method calls don't follow code convention, they can be all on one line
** The methods runOevXmlToBinary/runOevBinaryToXml can be refactored to remove common code,
which is most of it.
** There is no need for a separate printToScreen variable in those methods
** fileEqualIgnoreTrailingZeroes: Since largeFilename is just aliased to filename1, there
is no need for filename1. Just use that name as the method parameter.
** loadFile(): I'm surprised we don't have a utility method in the test package to do this.
It's a general operation and this method may be better located there.
** A larger problem is that this test doesn't use asserts to verify correctness, which will
make working with it difficult.  The exceptions should be converted to fully described JUnit
* OfflineEditsViewerHelper.java
** Class needs Javadoc
** Is it necessary to copy the edits file? Instead, can we just leave it in place and test
it there? A better option, though I don't believe supported by MiniDFSCluster, would be if
we could just write the edits to a memory buffer and avoid the disk altogether.
** Commented out code: fc = FileContext.getFileContext(cluster.getURI(), config);
* Tokenizer.java
** Tokenizer works specifically with EditsElements, may be good to give it more specific name.
Same comment for Token.
** I'm torn on the individual Token* classes.  I'd rather if there were a way of directly
integrating them into edits, but that's a bridge too far for this patch.  Scala case classes
would be quite helpful here...
** Several referances to static method encode/decodeBase64 via instance variable
** The duplicated edits enums should be re-factored into shared class rather than duplicated.

** Style: case OP_CLOSE doesn't need to be surrounded by braces, as do several other cases.
** The more involved cases should be refactored into separate classes to aid readability.
This may be reasonable for all the cases to be consistent.
** OP_UPDATE_MASTER_KEY this seems to be the only place we check for the possibility of working
with an unsupported version. Is there a reason for this?
** The pattern: {noformat}v.visit(t.read(new Tokenizer.Token{Whatever}(EditsElement.LENGTH)));{noformat}
is repeated quite a lot.  Can this be refactored into a helper method to aid in readability?
** By doing a static import of the various Tokenizer classes (which can be made static) such
as: {noformat}import static org.apache.hadoop.hdfs.tools.offlineEditsViewer.Tokenizer.TokenInt;{noformat}
you can avoid the extra reference to Tokenizer in the visit calls.
** I'm not sure that the statistics functionality adds any value to this class.  It may be
better to create a separate statistics viewer that provides this information. 
** Several unnecessary imports
* EditsVisitor.java
** The DepthCounter duplicates the same class in the oiv.  May as well create a common utility
class and share it.
** Commented out code: {noformat} // abstract void visit(EditsElement element, String value)
throws IOException;  {noformat}
** Unnecessary import of DeprecatedUTF8
* EditsVisitorXml.java
** Consistent naming with oiv would be XmlEditsVisitor
** I believe this class is quite ripe for a shared generic implementation with oiv's Xml viewer.
This is discussed more below.
** unnecessary import of Base64 class
* OfflineEditsViewer.java
** Typo:  This class implements and offline edits viewer, tool that  (and -> an)
** No need to mention note about OfflineImageViewer.
** The command line parsing and options shares quite a bit of code with the oiv and may be
easy to merge.
* EditsVisitorBinary.java
** The printToScreen option is ignored and doesn't make sense for this viewer.  It may be
fine to keep the option, but we should probably add documentation about it being ignored by
some visitors
** No need for commented-out debugging code
* Tokenizers.java
** Since the class is a factory perhaps TokenizerFactory is a better name?
** The file type determination can be simplified by checking for .endsWith(".xml") 
** Typo:* Tokenizers (factory) for different (implementatios -> implementations) of Tokenizer
* EditsVisitorText.java
** Consistent naming with oiv would be TextEditsVisitor
** It may be reasonably easy to merge the oiv's TextVisitor, which this class duplicates,
at this stage.
* TokenizerBinary.java
** Several unnecessary imports

I'll have more comments after these revisions.

Overall, this is a good effort of extending the oiv's approach.  However, there does end up
being quite a lot of duplicated code.  I believe I've come up with a type hierarchy that we
can use (attached) that would eliminate all of the duplicated code.  The green classes are
extensions of the oiv to write and read the image, which we had discussed creating when the
oiv was written, but haven't done yet.

For this patch, I think there are three options:
# Modulo patch revisions as outlined above and forthcoming, commit the patch as is and immediately
open a new JIRA to de-duplicate code and merge the viewers together. This has the advantage
of providing a greenfield for the new JIRA that doesn't have worry about creating the oev
at the same time of merging it with the existing infrastructure.
# Implement the de-dupe and merging in this patch and commit that.  This has the advantage
of no duplicate code going into the source, but will increase the size and complexity of this
# Re-factor the duplicated code into private methods and have both viewers call that, and
immediately open a new JIRA to de-dupe and merge, as in #1.  This has the advantage of no
duplicate code going into the source, but the refactoring could get messy.

Since there is no upcoming release where the oev would be exposed and thus cause deprecation
lock-in, I favor #1.  Regardless, we should probably open a JIRA to finish off merging the
oiv and edits loading code.

> Create multi-format parser for edits logs file, support binary and XML formats initially
> ----------------------------------------------------------------------------------------
>                 Key: HDFS-1448
>                 URL: https://issues.apache.org/jira/browse/HDFS-1448
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 0.22.0
>            Reporter: Erik Steffl
>            Priority: Minor
>             Fix For: 0.22.0
>         Attachments: editsStored, HDFS-1448-0.22.patch, Viewer hierarchy.pdf
> Create multi-format parser for edits logs file, support binary and XML formats initially.
> Parsing should work from any supported format to any other supported format (e.g. from
binary to XML and from XML to binary).
> The binary format is the format used by FSEditLog class to read/write edits file.
> Primary reason to develop this tool is to help with troubleshooting, the binary format
is hard to read and edit (for human troubleshooters).
> Longer term it could be used to clean up and minimize parsers for fsimage and edits files.
Edits parser OfflineEditsViewer is written in a very similar fashion to OfflineImageViewer.
Next step would be to merge OfflineImageViewer and OfflineEditsViewer and use the result in
both FSImage and FSEditLog. This is subject to change, specifically depending on adoption
of avro (which would completely change how objects are serialized as well as provide ways
to convert files to different formats).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message