avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: My mainly idea about implement data communicate tool between json/xml/csv and avro data files
Date Wed, 24 Mar 2010 16:52:13 GMT
For CSV there is a disconnect on data types.   The "ordinary" CSV is typically quoted per field,
and requires escaping of data to ensure that delimiters and quote characters don't exist in
the data without escapes.   Furthermore, it does not handle Arrays, Maps, Unions, or nested
objects.  I'm not sure it makes sense to map Avro data into CSV.

For example an avro schema can be a linked list, or even a binary tree with a structure at
each node -- CSV typically has a fixed set of fields per row.  Extending it to be capable
of nested records, lists, unions, and maps would make it incompatible with generic CSV reader/writers.
 However, going the other way, and mapping a CSV into a subset of Avro would work -- Avro
could read CSV as a fixed simple record, but it could not write an arbitrary record as CSV.

XML, is sufficiently rich to create an Avro serialization scheme for, and one could probably
define a XML DTD for an avro compatible serialization format.

Currently the Avro data files only store the data internally as binary, since there has been
no need to store data in a larger and less efficient format.  However, reading the binary
file as JSON has been built as a command-line tool for debugging purposes.

On Mar 24, 2010, at 1:33 AM, Peng Cui wrote:

> Hi all,
> 
> I almost finished my GSoC proposal about the project " a data communicate
> tool between json/xml/csv and avro data files".I will describe it for you
> and expecting your advises.
> 
> Two mainly parts of the tool:
> 
> 1. Data communication module,i.e. read/write json/xml/csv records from/to
> avro data files
> There are two steps:
> 
> Step one: read/write json/xml/csv records to AVRO datum
> 
> For json:
> AVRO supplies ParsingDecoder and JsonGenerator already,we can use these two
> classes to communicate data between AVRO datum and json data.
> 
> For XML:
> I must extends the abstract ParsingDecoder,and build XMLDecoder  class to
> parse data from XML file,and convert it to AVRO datum. And also,a
> XMLGenerator class which is used to change AVRO datum to XML data file is
> also necessary. This section need some XML parse jobs,may be Apache Xerces
> is a good choice, fortunately, i am familiar with it.
> 
> For CSV:
> Also,i must build a CSVDecoder to convert CSV data to AVRO datum and a
> CSVGenerator class to convert AVRO datum to CSV files. This section need
> some operations with CSV data,I think Apache Commons csv can help us.
> 
> Step two: read/write AVRO datum to avro data files
> 
> AVRO has implemented this function already,so, it will not cost me much time
> and energy
> 
> 2.command-tool interface design
> 
> Basic interface design:
> 
> The tool is based on Java Swing,it is made up of a command input textarea
> and a information output panel which is used to show now status,command
> execute result and data output ect.
> 
> Command system design:
> 1).Each command is a class which implement a interface called
> BasicCommand,the interface has a execute function. Command implemention
> class must implement the concrete operations in the execute function.
> 2).Use a xml configuration file to register command classes in to the
> command system. At the beginning,this tool will have some basic commands(i
> will introduce the basic commands soon after),in the future,if we want to
> implement more commands for the tool, finish the corresponding command
> class,then register it,ok!
> 3).In the initialization period,the tool will parse command configuration
> xml file,instance
> command classes,and load them in the context. It will use a ArrayList to
> store all the
> system commands during running period.
> 4).when user input a command,the tool traversal command array list,if the
> command exist and have correct format argument,execute it (execution
> operation is to invoke command instance's execute function). If the command
> exist,but the arguments is not match with
> declaration,print out usage information about the command.If the tool can
> not find the
> command,tell user "the command is not an available command".
> 5).The tool use a xml configuration file to store some system
> attributes,such as default
> workspace,default work mode(json/xml or csv) and info output fonsize ect.
> 
> System initialization commands design:
> 1).workspace set up command;
> 2).get history workspace command;
> 3).work mode change command;
> 4).list data files command;
> 5).data output command;
> This command works different in different work mode,for example,in json
> mode,the data will
> output as a json string,but in xml mode,the data will output as a xml file.
> User can also assign specific output mode by argument,default output mode is
> current working mode.
> This command can assign specific output stream,export the data into a data
> file or just
> output in the tool interface,default output stream is the operation
> interface.
> 6).data input command:
> This command is used to input data and change it to AVRO data file. It has
> four work
> mode,user can assign its work model by command argument:
> 
> model 1:input schema data and content data from IO device;
> model 2:input schema data from IO devices but input content data from data
> file in the local
> disk;
> model 3:input schema data from data file in the local disk but input content
> data from IO
> devices;
> model 4:input schema data and content data from data files in the local
> disk.
> Default work mode is mode 1,when user input this command,press enter,a
> Graphic Swing Panel show up,user can finish its input job in this panel. Of
> course,different command mode will bring different Swing Input Panel,four in
> all.
> 7).system basic set up command,this may include set up font,fontsize,color
> ect.
> 
> This is my mainly ideas,any one have advises or suggestions,please let me
> know,thank you :-)
> 
> Peng
> On Mon, Mar 22, 2010 at 1:31 PM, Peng Cui <ajiu.009@gmail.com> wrote:
> 
>> Hi Doug,
>> 
>> My name is Cui Peng. I want to implement the data communicate tool between
>> json/xml/csv and avro data files as you described in the GSoC 2010 idea
>> list. I exported AVRO source code,research its design and architect,then i
>> got mainly idea about the tool, then i will show it to you,and expecting
>> your advises :-)
>> 
>> I think there are mainly two parts of jobs to do:
>> 
>> 1. Read/write json/xml/csv records from/to avro data files
>> There are two steps:
>> 
>> Step one: read/write json/xml/csv records to AVRO datum
>> 
>> For json:
>> AVRO supplies ParsingDecoder and JsonGenerator already,we can use these two
>> classes to communicate data between AVRO datum and json data.
>> For XML:
>> I must extends the abstract ParsingDecoder,and build XMLDecoder  class to
>> parse data from XML file,and convert it to AVRO datum. And also,a
>> XMLGenerator class which is used to change AVRO datum to XML data file is
>> also necessary. This section need some XML parse jobs,may be Apache Xerces
>> is a good choice, fortunately, i am familiar with it.
>> For CSV:
>> Also,i must build a CSVDecoder to convert CSV data to AVRO datum and a
>> CSVGenerator class to convert AVRO datum to CSV files. This section need
>> some operations with CSV data,I think Apache Commons csv can help us.
>> 
>> Step two: read/write AVRO datum to avro data files
>> AVRO has implemented this function already,so, i will not cost me much time
>> and energy
>> 
>> 2. A Swing based command-line tool,this tool will help us to execute some
>> commands, collect data from user input etc.
>> Step one give us data communicate support between json/xml/csv data files
>> and avro data files,then,we should build the command-line tool and design
>> its command system.
>> 
>> 1).this tool will have three mode,json,xml or csv model,can use special
>> command to  swith working model
>> 2).this tool will support two data input model,from keyborad or from exist
>> data file
>> 3).its command adopts command and argument form,for example,"input -f"
>> means import data from existing data files,"input -k" means give user
>> a graphics data input area,user can input data though keyboard
>> 4).data output format function
>> 5).if  exception occurs, it will show in the tool
>> 
>> 
>> That is all,if you have any ideas,please let me know. Thank you and best
>> regards
>> 


Mime
View raw message