Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B274C10EFD for ; Mon, 27 Jan 2014 17:24:41 +0000 (UTC) Received: (qmail 65992 invoked by uid 500); 27 Jan 2014 17:22:12 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 64755 invoked by uid 500); 27 Jan 2014 17:20:14 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 64408 invoked by uid 99); 27 Jan 2014 17:19:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jan 2014 17:19:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of amrith@parelastic.com designates 207.46.163.158 as permitted sender) Received: from [207.46.163.158] (HELO na01-bn1-obe.outbound.protection.outlook.com) (207.46.163.158) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jan 2014 17:19:45 +0000 Received: from BN1PR07MB022.namprd07.prod.outlook.com (10.255.225.40) by BN1PR07MB024.namprd07.prod.outlook.com (10.255.225.42) with Microsoft SMTP Server (TLS) id 15.0.859.15; Mon, 27 Jan 2014 17:19:20 +0000 Received: from BN1PR07MB022.namprd07.prod.outlook.com ([169.254.7.226]) by BN1PR07MB022.namprd07.prod.outlook.com ([169.254.7.226]) with mapi id 15.00.0859.020; Mon, 27 Jan 2014 17:19:19 +0000 From: Amrith Kumar To: "user@avro.apache.org" Subject: RE: data missing in writing an AVRO file. Thread-Topic: data missing in writing an AVRO file. Thread-Index: Ac8be3WeUxlIY7rGR7ufkvcNP76znwAASmcwAADVaAAAANsz0A== Date: Mon, 27 Jan 2014 17:19:12 +0000 Deferred-Delivery: Mon, 27 Jan 2014 17:18:17 +0000 Message-ID: <4edc11aa39254a439e75308db30202ad@BN1PR07MB022.namprd07.prod.outlook.com> References: <75d44d342e9c40d98025dc52380f90fb@BN1PR07MB022.namprd07.prod.outlook.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [65.96.1.87] x-forefront-prvs: 0104247462 x-forefront-antispam-report: SFV:NSPM;SFS:(10009001)(377454003)(377424004)(189002)(199002)(41574002)(24454002)(66654002)(164054003)(47446002)(47736001)(74502001)(81342001)(74662001)(15975445006)(49866001)(74706001)(81542001)(63696002)(74316001)(50986001)(31966008)(47976001)(81686001)(80022001)(65816001)(76482001)(33646001)(56776001)(2656002)(69226001)(87266001)(16236675002)(81816001)(94316002)(85306002)(54316002)(15202345003)(86362001)(83322001)(87936001)(53806001)(74876001)(79102001)(59766001)(77982001)(4396001)(80976001)(19580405001)(16601075003)(56816005)(19300405004)(19580395003)(85852003)(92566001)(77096001)(74366001)(54356001)(66066001)(76786001)(93516002)(76796001)(76576001)(90146001)(46102001)(93136001)(83072002)(51856001)(19609705001)(24736002)(579004);DIR:OUT;SFP:1101;SCL:1;SRVR:BN1PR07MB024;H:BN1PR07MB022.namprd07.prod.outlook.com;CLIP:65.96.1.87;FPR:;InfoNoRecordsMX:1;A:1;LANG:en; Content-Type: multipart/alternative; boundary="_000_4edc11aa39254a439e75308db30202adBN1PR07MB022namprd07pro_" MIME-Version: 1.0 X-OriginatorOrg: parelastic.com X-Virus-Checked: Checked by ClamAV on apache.org --_000_4edc11aa39254a439e75308db30202adBN1PR07MB022namprd07pro_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Thanks for your email Mika, I had downloaded 1.7.5 on Jan 8th and hadn't th= ought to check for an upgrade. I've tried 1.7.6 and on a couple of files that I verified, the counts do ma= tch. Many thanks! -amrith From: Mika Ristimaki [mailto:mika.ristimaki@gmail.com] Sent: Monday, January 27, 2014 11:51 AM To: user@avro.apache.org Subject: Re: data missing in writing an AVRO file. Hi, This is most likely related to this issue https://issues.apache.org/jira/br= owse/AVRO-1364. It is fixed in Avro 1.7.6, so first try updating your Avro-= C lib. -Mika On Jan 27, 2014, at 6:35 PM, Amrith Kumar > wrote: Here is some additional debugging information ... I created this simple CSV file that looks thus. ubuntu@petest1:/mnt/avrotest$ head maketest.csv "data1", "data2", 0, 1804289383, 1, 846930886, 2, 1681692777, 3, 1714636915, 4, 1957747793, 5, 424238335, 6, 719885386, 7, 1649760492, 8, 596516649, ubuntu@petest1:/mnt/avrotest$ tail maketest.csv 499990, 1910331393, 499991, 1091319779, 499992, 805782879, 499993, 1636478990, 499994, 1827956658, 499995, 1695362021, 499996, 1235853180, 499997, 208721086, 499998, 1836333752, 499999, 699496062, Nothing fancy, just 500,000 rows of data with the row number in the first c= olumn and some random integer in the second. Here is the avro conversion. ubuntu@petest1:/mnt/avrotest$ csvtoavro -i maketest.csv -o maketest.avro 2014-01-27 11:28:40 csvtoavro: Processed maketest.csv with 500001 rows of = data Since there is a header row which gets counted it says 500,001. Now, here is the output from avrocat ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | head -n 10 {"data1": "0", "data2": " 1804289383"} {"data1": "1", "data2": " 846930886"} {"data1": "2", "data2": " 1681692777"} {"data1": "3", "data2": " 1714636915"} {"data1": "4", "data2": " 1957747793"} {"data1": "5", "data2": " 424238335"} {"data1": "6", "data2": " 719885386"} {"data1": "7", "data2": " 1649760492"} {"data1": "8", "data2": " 596516649"} {"data1": "9", "data2": " 1189641421"} ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | tail -n 10 {"data1": "499944", "data2": " 929606694"} {"data1": "499945", "data2": " 973636875"} {"data1": "499946", "data2": " 1942285618"} {"data1": "499947", "data2": " 2089133167"} {"data1": "499948", "data2": " 213614747"} {"data1": "499949", "data2": " 599060422"} {"data1": "499950", "data2": " 1885053377"} {"data1": "499951", "data2": " 2100042242"} {"data1": "499952", "data2": " 1491280709"} {"data1": "499953", "data2": " 1103081139"} ubuntu@petest1:/mnt/avrotest$./maketest.avro ./maketest.avro 499954 For completeness, here is some data from the CSV file showing values near a= round where the AVRO file appears to end. 499940, 1054581755, 499941, 600032353, 499942, 1997078786, 499943, 1508121989, 499944, 929606694, 499945, 973636875, 499946, 1942285618, 499947, 2089133167, 499948, 213614747, 499949, 599060422, 499950, 1885053377, 499951, 2100042242, 499952, 1491280709, 499953, 1103081139, 499954, 521709408, 499955, 494574550, 499956, 756884387, 499957, 2035729858, 499958, 1560742697, 499959, 923330093, In other words, the last 46 rows of data appear to be missing. -amrith From: Amrith Kumar [mailto:amrith@parelastic.com] Sent: Monday, January 27, 2014 11:23 AM To: user@avro.apache.org Subject: data missing in writing an AVRO file. Greetings, I'm attempting to convert some very large CSV files into AVRO format. To th= is end, I wrote a csvtoavro converter using C API v1.7.5. The essence of the program is this: // initialize line counter lineno =3D 0; // make a schema first avro_schema_from_json_length (...); // make a generic class from schema iface =3D avro_generic_class_from_schema( schema ); // get the record size and verify that it is 109 avro_schema_record_size (schema); // get a generic value avro_generic_value_new (iface, &tuple); // make me an output file fp =3D fopen ( outputfile, "wb" ); // make me a filewriter avro_file_writer_create_fp (fp, outputfile, 0, schema, &db); // now for the code to emit the data while (...) { avro_value_reset (&tuple); // get the CSV record into the tuple ... // write that tuple avro_file_writer_append_value (db, &tuple); lineno ++; // flush the file avro_file_writer_flush (db); } // close the output file avro_file_writer_close (db); // other cleanup avro_value_iface_decref (iface); avro_value_decref (&tuple); // close output file fflush (outfp); fclose (outfp); I read the file using a modified version of avrocat.c that looks like this. wschema =3D avro_file_reader_get_writer_schema(reader); iface =3D avro_generic_class_from_schema(wschema); avro_generic_value_new(iface, &value); int rval; lineno =3D 0; while ((rval =3D avro_file_reader_read_value(reader, &value)) =3D=3D 0) { lineno ++; avro_value_reset(&value); } // If it was not an EOF that caused it to fail, // print the error. if (rval !=3D EOF) { fprintf(stderr, "Error: %s\n", avro_strerror()); } else { printf ( "%s %lld\n", filename, lineno ); } On many files, I find no data is missing in the .AVRO file. However, quite = often I get files where several dozen rows of data are missing. I'm certain that I'm doing something wrong, and something very basic. Any h= elp debugging would be most appreciated. Thanks, -amrith --_000_4edc11aa39254a439e75308db30202adBN1PR07MB022namprd07pro_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Thanks for your email Mika, I had downloaded= 1.7.5 on Jan 8th and hadn’t thought to check for an upgra= de.

 

I’ve tried 1.7.6 and on a couple of fi= les that I verified, the counts do match.

 

Many thanks!

 

-amrith

 

 

From: Mika R= istimaki [mailto:mika.ristimaki@gmail.com]
Sent: Monday, January 27, 2014 11:51 AM
To: user@avro.apache.org
Subject: Re: data missing in writing an AVRO file.
=

 

Hi,

 

This is most likely related to this issue https://issues.apach= e.org/jira/browse/AVRO-1364. It is fixed in Avro 1.7.6, so first try up= dating your Avro-C lib.

 

-Mika

 

On Jan 27, 2014, at 6:35 PM, Amrith Kumar <amrith@parelastic.com> wrote:



Here is some additional debugging informatio= n …

 =

I created this simple CSV file that looks th= us.

 =

ubuntu@petest1:/mnt/avrotest$ head maketest.= csv

"data1", "data2",=

0, 1804289383,

1, 846930886,=

2, 1681692777,

3, 1714636915,

4, 1957747793,

5, 424238335,=

6, 719885386,=

7, 1649760492,

8, 596516649,=

ubuntu@petest1:/mnt/avrotest$ tail maketest.= csv

499990, 1910331393,=

499991, 1091319779,=

499992, 805782879,<= /o:p>

499993, 1636478990,=

499994, 1827956658,=

499995, 1695362021,=

499996, 1235853180,=

499997, 208721086,<= /o:p>

499998, 1836333752,=

499999, 699496062,<= /o:p>

 =

Nothing fancy, just 500,000 rows of data wit= h the row number in the first column and some random integer in the second.=

 =

Here is the avro conversion.

 =

ubuntu@petest1:/mnt/avrotest$ csvtoavro -i m= aketest.csv -o maketest.avro

2014-01-27 11:28:40  csvtoavro: Process= ed maketest.csv with 500001 rows of data

 =

Since there is a header row which gets count= ed it says 500,001.

 =

Now, here is the output from avrocat<= span style=3D"font-size:11.0pt;font-family:"Calibri","sans-s= erif"">

 =

ubuntu@petest1:/mnt/avrotest$ avrocat ./make= test.avro | head -n 10

{"data1": "0", "dat= a2": " 1804289383"}

{"data1": "1", "dat= a2": " 846930886"}

{"data1": "2", "dat= a2": " 1681692777"}

{"data1": "3", "dat= a2": " 1714636915"}

{"data1": "4", "dat= a2": " 1957747793"}

{"data1": "5", "dat= a2": " 424238335"}

{"data1": "6", "dat= a2": " 719885386"}

{"data1": "7", "dat= a2": " 1649760492"}

{"data1": "8", "dat= a2": " 596516649"}

{"data1": "9", "dat= a2": " 1189641421"}

ubuntu@petest1:/mnt/avrotest$ avrocat ./make= test.avro | tail -n 10

{"data1": "499944", &quo= t;data2": " 929606694"}

{"data1": "499945", &quo= t;data2": " 973636875"}

{"data1": "499946", &quo= t;data2": " 1942285618"}

{"data1": "499947", &quo= t;data2": " 2089133167"}

{"data1": "499948", &quo= t;data2": " 213614747"}

{"data1": "499949", &quo= t;data2": " 599060422"}

{"data1": "499950", &quo= t;data2": " 1885053377"}

{"data1": "499951", &quo= t;data2": " 2100042242"}

{"data1": "499952", &quo= t;data2": " 1491280709"}

{"data1": "499953", &quo= t;data2": " 1103081139"}

./maketest.avro 499954

 =

For completeness, here is some data from the= CSV file showing values near around where the AVRO file appears to end.

 =

499940, 1054581755,=

499941, 600032353,<= /o:p>

499942, 1997078786,=

499943, 1508121989,=

499944, 929606694,<= /o:p>

499945, 973636875,<= /o:p>

499946, 1942285618,=

499947, 2089133167,=

499948, 213614747,<= /o:p>

499949, 599060422,<= /o:p>

499950, 1885053377,=

499951, 2100042242,=

499952, 1491280709,=

499953, 1103081139,=

499954, 521709408,<= /o:p>

499955, 494574550,<= /o:p>

499956, 756884387,<= /o:p>

499957, 2035729858,=

499958, 1560742697,=

499959, 923330093,<= /o:p>

 =

In other words, the last 46 rows of data app= ear to be missing.

 =

-amrith

 =

From: Amrith Kumar [mailto:amrith@parelastic.com]=  
Sent: Monday, Janu= ary 27, 2014 11:23 AM
To: user@avro.apache.org
Subject: data miss= ing in writing an AVRO file.

 

Greetings,

 

I’m attempting to convert some very large CSV files = into AVRO format. To this end, I wrote a csvtoavro converter using C API v1= .7.5.

 

The essence of the program is this:

 

// initialize line counter<= /span>

lineno =3D 0;

 

// make a schema first

avro_schema_from_json_length (...);

 

// make a generic class from schema

iface =3D avro_generic_class_from_schema( schema );=

 

// get the record size and verify that it is 109

avro_schema_record_size (schema);=

 

// get a generic value

avro_generic_value_new (iface, &tuple);

 

// make me an output file

fp =3D fopen ( outputfile, "wb" );

 

// make me a filewriter

avro_file_writer_create_fp (fp, outputfile, 0, schema, &am= p;db);

 

// now for the code to emit the data

 

while (...)

{

    avro_value_reset (&tuple);

 

    // get the CSV record into the tuple

    ...

 

    // write that tuple

    avro_file_writer_append_value (db, &= ;tuple);

 

    lineno ++;<= o:p>

 

    // flush the file

    avro_file_writer_flush (db);

}

 

// close the output file

avro_file_writer_close (db);

 

// other cleanup

avro_value_iface_decref (iface);<= /o:p>

avro_value_decref (&tuple);

 

// close output file=

fflush (outfp);

fclose (outfp);

 

I read the file using a modified version of avrocat.c that= looks like this.

 

wschema =3D avro_file_reader_get_writ=
er_schema(reader);
iface =3D avro_generic_class_from_sch=
ema(wschema);
avro_generic_value_new(iface, &va=
lue);
 
int rval;
lineno =3D 0;
 
while ((rval =3D avro_file_reader_rea=
d_value(reader, &value)) =3D=3D 0) {
lineno ++;<=
/pre>
avro_value_reset(&value);<=
o:p>
}
 
// If it was not an EOF that caused i=
t to fail,
// print the error.=
if (rval !=3D EOF) =
{
fprintf(stderr, "Error: %s\n&quo=
t;, avro_strerror());
}
else
{
printf ( "%s %lld\n", filen=
ame, lineno );
 
}

 

On many files, I find no data is missing in the .AVRO file= . However, quite often I get files where several dozen rows of data are mis= sing.

 

I’m certain that I’m doing something wrong, an= d something very basic. Any help debugging would be most appreciated.

 

Thanks,

 

-amrith

 

--_000_4edc11aa39254a439e75308db30202adBN1PR07MB022namprd07pro_--