oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: GSoC 2015
Date Mon, 23 Feb 2015 04:17:33 GMT
Hi Aditya,
Apologies for delay on this one :(
Thank you for your patience. Please see my inline responses.

On Tue, Feb 17, 2015 at 12:31 AM, Aditya Dhulipala <adhulipa@usc.edu> wrote:

> Hi Lewis,
> I've been reading up on the doc you provided earlier.


> I've made some progress. I've looked into the filemgr component and run a
> few commands to ingest files etc. I understand how it works now.


> About the potential workflow -- (This is just my initial understanding. I
> could be wrong about this, please correct me)
> I think I have to rewrite the entire component to conform to the avro style
> specification. So this means, I need to define the scheme for all the files
> inside filemanger/structs -- Product.java, ProductPage.java etc.

Yes, this is correct. The main data struxtures are documented in Avro
specification format as per the patch I attached to OODT-685
Please check them out.
There is an issues here as the DataStrutures in filemgr are dependent upon
additional data structures, namely Metadata which is contained within the
OODT metadata package.

> I should define the schema for each of these similar to that specified for
> "User" on this link -
> http://avro.apache.org/docs/current/gettingstartedjava.html#Defining+a+schema

Absolutely correct. Please see OODT-685

> Currently I think this piece of code (Product.java) constructs an xml file
> for each product and so that the rpcClient can send it over the xml-rcp
> interface to the filemgr server.


> This project aims to redefine this process
> to send the data as a binary encoding (for smaller size, and thus smaller
> latency) by using the avro protocol.

Yes this is correct. It reduces wire transfer as well as a more flexible
model for reading data which has been written by a particular writer. Avro
support schema evolution as well meaning that data does not need to be
static i nature if we consider it from the Avro point of view. This is
highly advantageous from a data archival and interoperability view.

> And then I should invoke the avro code generation tools from within
> org.apache...system.XmlRpcFileManagerClient (probably have to rewrite this
> module to fit Avro client specification as well)

... probably yes. I would imagine that by the time this project is
finished, there will be absolutely no references to XML anywhere. It will
be entirely replaces by Avro Schema's (JSON)

> I should also make the XmlRpcFileManger (server) fit to the avro specific
> implementation of the server interface.

Yes that is correct.

> I think this has to be repeated for all the components within oodt
> (workflow manager etc)

Absolutely. All key services e.g FileMgr, Workflow and Resource.

> I also have some questions:-
> 1. Is there any specific reason for picking Avro over Thrift or Protocol
> Buffers?

Please read upon some of Martin Kleppmann's blogs and commentary over the
years on this topic
He did a bunch of work on Avro whilst @LinkedIn and it will really help you
to read through some of his work.

> 2. I also came across this answer on quora on Avro vs. XML-RPC
> http://www.quora.com/What-merits-does-Avro-RPC-have-over-XML-RPC/answer/Ted-Dunning-1?__snids__=959769040&__nsrc__=1&__filter__=all
> The author talks about another binary format - Simple Binary Encoding. And
> recommends using protocol buffers for their wide use and documentation. Can
> you share your thoughts about this?

I can yes.
 - Protocol Buffers is described as Google's Interchange format. Does this
not sound a bit limiting? What happens if you want to change some of the
code to fit into OODT. Are you going to fork the project and maintain your
own Protocol Buffers implementation.
 - @Apache there is a saying EAT YOUR OWN DOG FOOD. I would much rather we
implement a well founded Apache project e.g. Avro over Protocl Buffers any
day of the week.
Avro is also widely used. It also has a pretty excellent specification
document which as you've already seen has enabled you to understand schema

> I'd also like to run some more examples of the filemgr client/server. That
> way I can run some commands like these
> https://cwiki.apache.org/confluence/display/OODT/Exploring+the+OODT+File+Manager+XML-RPC+Interface
> and understand the overhead caused by xml-rpc or get a sense of what the
> latency of using xml-rcp is.

My main justification for moving towards a replacement for XML-RPC in OODT
is multi-faceted
 -  the library is dated,
 - the plethora of XML in OODT is cumbersome,
 - none of the XML is accompanied by XSD
 - Avro has advanced significantly over the years and I am more familiar
with it than I am other data serialization frameworks out there. It defines
the Protocol layer which is a natural replacement for the XML-RPC
 - the Google Summer of Code project we are describing here is carving the
way for a complete Avro-RPC powered REST API for each OODT service. This is
a HUGE game changer for invoking remote OODT services.

> Can you also share examples of filemgr servers
> running in the real-world that I could query or use?

Most of the servers I am aware that are running are on VPN's and internal,
secure networks so the short answer is no.
This is something which we we get established once you were brought on as
the GSoC student for this project I would think.

> Any other comments/suggestions are welcome! :)
I would state that it would be really nice for you to put some of this
correspondence down to a proposal of sorts. You will require a working
proposal when you apply to Google.
Also, please feel free, if you have time, to pick up some issues on the
OODT Jira tracker. This will go a LONG way to us backing you as the
preferred GSoC applicant.
Thank you

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message