Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4C03910D8E for ; Wed, 26 Mar 2014 09:56:24 +0000 (UTC) Received: (qmail 15036 invoked by uid 500); 26 Mar 2014 09:56:22 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 14131 invoked by uid 500); 26 Mar 2014 09:56:13 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 13862 invoked by uid 99); 26 Mar 2014 09:56:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Mar 2014 09:56:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lewis.mcgibbney@gmail.com designates 209.85.216.53 as permitted sender) Received: from [209.85.216.53] (HELO mail-qa0-f53.google.com) (209.85.216.53) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Mar 2014 09:56:07 +0000 Received: by mail-qa0-f53.google.com with SMTP id w8so1943626qac.12 for ; Wed, 26 Mar 2014 02:55:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=UOyXdMtTxZ1hyVoR4m1LsQJktaUEoBGlYVv7fFXApp8=; b=ssewy1FN6oHFqYKereeVYSK9cylDKYoqrIs2gJ8E7O6MY5CpOv7+5PfmzMXb5cI9tD Uehlix5Y/Js52YNQ95G5Sxj4a1XAcgljI1bezyiChQjisU/wNRmsPakM+dqCTdbE1lrt 2gh6PqnvaM6aUrlLnn4vCPcGzSVjPPPTz4033EXa8FYd3ye3aFiEBqIOl80hH6G07G+0 vGUwX5ZRL65vqvyVi3Im8i5CzFM/G4OITVyXoykurUgxGsh/yvX0JBn7hPvgXSdTmAYg c5LxinJRn69GfE8ieanj3pOmFdzWVKtmX0uWTJ962mxDv5JpedGcXPuqdSLSKy+iekXi DHOQ== MIME-Version: 1.0 X-Received: by 10.229.96.199 with SMTP id i7mr1435976qcn.20.1395827746297; Wed, 26 Mar 2014 02:55:46 -0700 (PDT) Received: by 10.96.224.99 with HTTP; Wed, 26 Mar 2014 02:55:46 -0700 (PDT) In-Reply-To: <9d4289338e194635bf8a19400280a334@MBX1.impetus.co.in> References: <3c8e3eb67efc40a0928db7f5b50fafd5@MBX1.impetus.co.in> <4ecbb69bc4d74da2a04e38ea528f61ec@HKNPR03MB401.apcprd03.prod.outlook.com> <9d4289338e194635bf8a19400280a334@MBX1.impetus.co.in> Date: Wed, 26 Mar 2014 09:55:46 +0000 Message-ID: Subject: Re: Schema not getting saved along with Data From: Lewis John Mcgibbney To: user@avro.apache.org Content-Type: multipart/alternative; boundary=001a1133886efc78a304f57f7829 X-Virus-Checked: Checked by ClamAV on apache.org --001a1133886efc78a304f57f7829 Content-Type: text/plain; charset=ISO-8859-1 Hi Sachneet, On Wed, Mar 26, 2014 at 8:37 AM, Sachneet Singh Bains < sachneets.bains@impetus.co.in> wrote: > Hi Sean, > > > > My use case is to store incoming data(various sources) into a database > like Cassandra. The data will be serialized using AVRO. > It would be foolish for me NOT to put in a plug here for Apache Gora [0]. Gora is an acronym for Generic Object Representation using Avro. So it will do possibly exactly what you are trying to do out of the box. Cassandra is just one of the NoSQL databases we support in Gora. You can see more by reading the site documentation. [0] http://gora.apache.org > My questions are: > > 1. What is the best way to do this ? > Right now in gora-cassandra module we support following Avro data types: Type.STRING, Type.BOOLEAN, Type.BYTES, Type.DOUBLE, Type.FLOAT, Type.INT, Type.LONG, Type.FIXED, Type.ARRAY, Type.MAP, Type.UNION, Type.RECORD. For a more comprehensive overview of how we actually store the data you can head over to dev@gora posting your question and we will reply in full. > 2. How should I keep the schema information along with each record > ? For e.g. two columns , one storing data and another schema/fingerprints ? > Well this is certainly an option, right now though it appear that we store (prepend) the Schema with the data as it is. Right now the storage logic is that we are focused on the data and not the data schema/fingerprints. Therefore when executing Gora Queries in Cassandra we query the Cassandra keyspace by families. When we add sub/supercolumns, Gora keys are mapped to Cassandra partition keys only. This is because we follow the Cassandra logic where column family data is partitioned across nodes based on row Key. You would therefore need to change some aspect of the data modeling if you really wished to store data metadata such as Schema & fingerprints separately. > 3. I see fingerprints as one option but how to make use of it ; > where to maintain the schema repository and how to add fingerprints to data > I've never used fingerprints so i cannot comment. Sorry! > 4. Also, I am wondering if there is ant feature to automatically > generate a schema from an incoming data (CSV format) ? > Everything for Java is Mavenized. There will be no ant target. You could possibly write an implementation for avro-tools which would achieve this for you. You can see current option in avro-tools by looking into the Main#Main() method https://svn.apache.org/repos/asf/avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/Main.java > 5. Is there any recommended database to store data in AVRO format > (relational or Nosql) ? > No there is no recommended DB. LOADS of use cases use many different DB's. I would suggest you consider your data and how you will be querying it before you choose your DB. Hopefully some of the above give food for thought. Lewis --001a1133886efc78a304f57f7829 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Sachneet,


On Wed, Mar 26, 2014 at 8:37 AM, Sachneet Singh Ba= ins <sachneets.bains@impetus.co.in> wrote:

Hi Sean,

=A0

My use case is to st= ore incoming data(various sources) into a database like Cassandra. The data= will be serialized using AVRO.


It would be foolish for me NOT= to put in a plug here for Apache Gora [0]. Gora is an acronym for Generic = Object Representation using Avro. So it will do possibly exactly what you a= re trying to do out of the box. Cassandra is just one of the NoSQL database= s we support in Gora. You can see more by reading the site documentation.
=A0

My questions are:

1.=A0=A0=A0=A0=A0=A0 What is the best way t= o do this ?

Right now in gora-cassa= ndra module we support following Avro data types:
Type.STRING, Type.BOOLEAN, Type.BYTES, Type.DOUBLE, Type.FLOAT, Type.INT, T= ype.LONG, Type.FIXED, Type.ARRAY, Type.MAP, Type.UNION, Type.RECORD. For a = more comprehensive overview of how we actually store the data you can head = over to dev@gora posting your question and we will reply in full.
=A0

2.=A0=A0=A0=A0=A0=A0 How should I keep the = schema information along with each record ? For e.g. two columns , one stor= ing data and another schema/fingerprints ?

Well this is certainly an option, right now t= hough it appear that we store (prepend) the Schema with the data as it is. = Right now the storage logic is that we are focused on the data and not the = data schema/fingerprints. Therefore when executing Gora Queries in Cassandr= a we query the Cassandra keyspace by families. When we add sub/supercolumns= , Gora keys are mapped to Cassandra partition keys only. This is because we= follow the Cassandra logic where column family data is partitioned across = nodes based on row Key. You would therefore need to change some aspect of t= he data modeling if you really wished to store data metadata such as Schema= & fingerprints separately.
=A0

3.=A0=A0=A0=A0=A0=A0 I see fingerprints as = one option but how to make use of it ; where to maintain the schema reposit= ory and how to add fingerprints to data

I've never used fingerprints so i cannot = comment. Sorry!
=A0

4.=A0=A0=A0=A0=A0=A0 =A0Also, I am wonderin= g if there is ant feature to automatically generate a schema from an incomi= ng data (CSV format) ?


Everything for Java is Maveniz= ed. There will be no ant target. You could possibly write an implementation= for avro-tools which would achieve this for you. You can see current optio= n in avro-tools by looking into the Main#Main() method
https://svn.apache.org/repos/asf/= avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/Main.java=

5.=A0=A0=A0=A0=A0=A0 Is there any recommend= ed database to store data in AVRO format (relational or Nosql) ?

No there is no recommended DB. LOADS of use c= ases use many different DB's. I would suggest you consider your data an= d how you will be querying it before you choose your DB.

Hopefully some of the above give food for thought.
Lewis
=
--001a1133886efc78a304f57f7829--