giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman" <initialcont...@gmail.com>
Subject Re: Review Request: GIRAPH-277: Text Vertex Input/Output Format base classes overhaul
Date Fri, 10 Aug 2012 00:59:39 GMT


> On Aug. 10, 2012, 12:47 a.m., Eli Reisman wrote:
> > Hey Jaeho, great work here. This was a lot and could not have been easy to wade
through. I love your attention to detail and fantastic comments that can lead the way for
others to figure out how to put together IO formats. The small things I noticed were:
> > 
> > - in AdjacencyListVertexOutputFormat, maybe call writeLine() writeVertex() like
it was, since it takes a vertex and is writing at the unit of one vertex per line. The other
formats consume input to write at this same granularity and it keeps the processing units
consistent throughout the code. Its just a detail that we are writing to text lines in this
family of OutputFormats, many folks here will be using other formats for IO that do not involve
text (including myself most of the time.) You might rename the enclosing class to indicate
lines and not just text or vertices are being written out here? Or just add some nice javadoc
comments to make it clear to users what is going to happen here.
> > 
> > - in TextVertexInputFormat some comments mix web-escaped sequences of greater-than
and less-than chars with the real angle-brackets.
> > 
> > - in VertexReaderFromEachLine, the getId(), getValue(), and getEdge() are a clean
solution, but in their Text input they force the user to reparse the same line over and over.
One pass for parsing is nice and easier not to mess up so I'm not sure this saved the users
any time. In the version that is preprocessed that takes a generic type, you could pass a
list of tokens, but then the user must be sure the number of tokens read is consistent with
what they found on the line or those methods will trusting the sequence in the token list.
This is not bad, but again it doesn't seem any easier for a user to debug and for the parsing
of single lines when the user is aware of the assumed formatting and can work with tokens
that don't make sense to the parser "in context" of the line. consuming each line in one place
as a stream is familiar to programmers in general, and they can break the work into methods
as they see fit. The place where this makes the code cleaner is the vertex creation piece
that puts all the parsed values together, but you've already written that for them, so it
doesn't really give the user less code to write. In the end, many users will and do use non-text
IO so I think just having access to the line information to parse might be enough here.
> > 
> > In some cases, while reading the old and new code in the diff, I got the feeling
a mix of both would be the best and tightest solution, mixed with a lot better commenting
of the sort you put into you new code to light the way for them.  How can we tweak this code
to focus on making the unfamiliar Giraph/Hadoop boilerplate go away without adding too many
components for a new user to remember to compose and wire up? I think the solution that would
make me do backflips would be the one that solves that puzzle, and I feel like you're already
most of the way there with this patch. 
> > 
> > I would be curious, have you run this code while mixing and matching these IO formats
and some vertices, and perhaps mixing in the wrong one sometimes to see if it breaks as it
should? I would like to verify that shifting some of the generic syntax around does not confuse
ReflectionUtils at any point in the game, this is why we have so many generic defs all over
the code right now, and its delicate. the utils have to successfully walk the inheritance
chain all the way to where the parameters are concretely set without missing a link in the
chain or they loose the type info. I think we're probably in the clear here, but I'd like
to try a dry run just to see what happens if you haven't already.
> > 
> > On the whole, I like this. It adds more clarity than it takes away. I think we could
go a step simpler here and it would be a win, but again thats just me, and with this in place
maybe the additional simplification would be the work of someone else in another JIRA.
> >

One note here, in the VertexReaderFromEachLine bit, I think my last point is this: we get
to say createVertex( getId(line), getValue(line), getEdge(line) .. ) etc but you do that for
them, and at the cost of them implementing 3 new methods and accepting that the tokens they
receive are in context, and in the order they expect. The alternative of just parsing their
own line seems like a familiar part of the puzzle for most programmers. The parts they need
help with are probably navigating the maze of less familiar Giraph/Hadoop objects to manipulate
and implement on the way to a finished IO format. Anything to make that checklist shorter
would be a big win.


- Eli


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6402/#review10101
-----------------------------------------------------------


On Aug. 6, 2012, 9:21 p.m., Jaeho Shin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6402/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 9:21 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> (copied from GIRAPH-277)
> 
> The current way of implementing VertexInputFormat and VertexReader had bad smell. It
required users to understand how these two classes are glued together, and forced similar
codes to be duplicated in every new input format. (Similarly for the VertexOutputFormat and
VertexWriter.) Anyone who wants to create a new format should create an underlying record
reader or writer at the right moment and delegate some calls to it, which seemed unnecessary
detail being exposed. Besides, type parameters had to appear all over every new format code,
which was extremely annoying for both reading existing code and writing a new one. I was very
frustrated writing my first format code especially when I compared it to writing a new vertex
code. I thought writing a new input/output format should be as simple as vertex.
> So, I have refactored TextVertexInputFormat and OutputFormat into new forms that have
no difference in their interfaces, but remove a lot of burden for subclassing. Instead of
providing static VertexReader base classes, I made it a non-static inner-class of its format
class, which helps eliminate the repeated code for gluing these two, already tightly coupled
classes. This has additional advantage of eliminating all the Generics type variables on the
VertexReader side, which makes overall code much more concise. I added several useful TextVertexReader
base classes that can save efforts for implementing line-oriented formats.
> 
> 
> This addresses bug GIRAPH-277.
>     https://issues.apache.org/jira/browse/GIRAPH-277
> 
> 
> Diffs
> -----
> 
>   trunk/checkstyle.xml 1369914 
>   trunk/src/main/java/org/apache/giraph/examples/IntIntNullIntTextInputFormat.java 1369914

>   trunk/src/main/java/org/apache/giraph/examples/SimplePageRankVertex.java 1369914 
>   trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java 1369914 
>   trunk/src/main/java/org/apache/giraph/examples/SimpleTextVertexOutputFormat.java 1369914

>   trunk/src/main/java/org/apache/giraph/examples/VertexWithComponentTextOutputFormat.java
1369914 
>   trunk/src/main/java/org/apache/giraph/graph/VertexWriter.java 1369914 
>   trunk/src/main/java/org/apache/giraph/lib/AdjacencyListTextVertexInputFormat.java PRE-CREATION

>   trunk/src/main/java/org/apache/giraph/lib/AdjacencyListTextVertexOutputFormat.java
1369914 
>   trunk/src/main/java/org/apache/giraph/lib/AdjacencyListVertexReader.java 1369914 
>   trunk/src/main/java/org/apache/giraph/lib/IdWithValueTextOutputFormat.java 1369914

>   trunk/src/main/java/org/apache/giraph/lib/JsonBase64VertexInputFormat.java 1369914

>   trunk/src/main/java/org/apache/giraph/lib/JsonBase64VertexOutputFormat.java 1369914

>   trunk/src/main/java/org/apache/giraph/lib/JsonLongDoubleFloatDoubleVertexInputFormat.java
1369914 
>   trunk/src/main/java/org/apache/giraph/lib/JsonLongDoubleFloatDoubleVertexOutputFormat.java
1369914 
>   trunk/src/main/java/org/apache/giraph/lib/LongDoubleDoubleAdjacencyListVertexInputFormat.java
1369914 
>   trunk/src/main/java/org/apache/giraph/lib/TextDoubleDoubleAdjacencyListVertexInputFormat.java
1369914 
>   trunk/src/main/java/org/apache/giraph/lib/TextVertexInputFormat.java 1369914 
>   trunk/src/main/java/org/apache/giraph/lib/TextVertexOutputFormat.java 1369914 
>   trunk/src/test/java/org/apache/giraph/lib/TestAdjacencyListTextVertexOutputFormat.java
1369914 
>   trunk/src/test/java/org/apache/giraph/lib/TestIdWithValueTextOutputFormat.java 1369914

>   trunk/src/test/java/org/apache/giraph/lib/TestLongDoubleDoubleAdjacencyListVertexInputFormat.java
1369914 
>   trunk/src/test/java/org/apache/giraph/lib/TestTextDoubleDoubleAdjacencyListVertexInputFormat.java
1369914 
> 
> Diff: https://reviews.apache.org/r/6402/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jaeho Shin
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message