avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amrith Kumar <amr...@parelastic.com>
Subject RE: data missing in writing an AVRO file.
Date Mon, 27 Jan 2014 16:35:45 GMT
Here is some additional debugging information ...

I created this simple CSV file that looks thus.

ubuntu@petest1:/mnt/avrotest$ head maketest.csv
"data1", "data2",
0, 1804289383,
1, 846930886,
2, 1681692777,
3, 1714636915,
4, 1957747793,
5, 424238335,
6, 719885386,
7, 1649760492,
8, 596516649,
ubuntu@petest1:/mnt/avrotest$ tail maketest.csv
499990, 1910331393,
499991, 1091319779,
499992, 805782879,
499993, 1636478990,
499994, 1827956658,
499995, 1695362021,
499996, 1235853180,
499997, 208721086,
499998, 1836333752,
499999, 699496062,

Nothing fancy, just 500,000 rows of data with the row number in the first column and some
random integer in the second.

Here is the avro conversion.

ubuntu@petest1:/mnt/avrotest$ csvtoavro -i maketest.csv -o maketest.avro
2014-01-27 11:28:40  csvtoavro: Processed maketest.csv with 500001 rows of data

Since there is a header row which gets counted it says 500,001.

Now, here is the output from avrocat

ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | head -n 10
{"data1": "0", "data2": " 1804289383"}
{"data1": "1", "data2": " 846930886"}
{"data1": "2", "data2": " 1681692777"}
{"data1": "3", "data2": " 1714636915"}
{"data1": "4", "data2": " 1957747793"}
{"data1": "5", "data2": " 424238335"}
{"data1": "6", "data2": " 719885386"}
{"data1": "7", "data2": " 1649760492"}
{"data1": "8", "data2": " 596516649"}
{"data1": "9", "data2": " 1189641421"}
ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | tail -n 10
{"data1": "499944", "data2": " 929606694"}
{"data1": "499945", "data2": " 973636875"}
{"data1": "499946", "data2": " 1942285618"}
{"data1": "499947", "data2": " 2089133167"}
{"data1": "499948", "data2": " 213614747"}
{"data1": "499949", "data2": " 599060422"}
{"data1": "499950", "data2": " 1885053377"}
{"data1": "499951", "data2": " 2100042242"}
{"data1": "499952", "data2": " 1491280709"}
{"data1": "499953", "data2": " 1103081139"}
ubuntu@petest1:/mnt/avrotest$./maketest.avro
./maketest.avro 499954

For completeness, here is some data from the CSV file showing values near around where the
AVRO file appears to end.

499940, 1054581755,
499941, 600032353,
499942, 1997078786,
499943, 1508121989,
499944, 929606694,
499945, 973636875,
499946, 1942285618,
499947, 2089133167,
499948, 213614747,
499949, 599060422,
499950, 1885053377,
499951, 2100042242,
499952, 1491280709,
499953, 1103081139,
499954, 521709408,
499955, 494574550,
499956, 756884387,
499957, 2035729858,
499958, 1560742697,
499959, 923330093,

In other words, the last 46 rows of data appear to be missing.

-amrith

From: Amrith Kumar [mailto:amrith@parelastic.com]
Sent: Monday, January 27, 2014 11:23 AM
To: user@avro.apache.org
Subject: data missing in writing an AVRO file.

Greetings,

I'm attempting to convert some very large CSV files into AVRO format. To this end, I wrote
a csvtoavro converter using C API v1.7.5.

The essence of the program is this:

// initialize line counter
lineno = 0;

// make a schema first
avro_schema_from_json_length (...);

// make a generic class from schema
iface = avro_generic_class_from_schema( schema );

// get the record size and verify that it is 109
avro_schema_record_size (schema);

// get a generic value
avro_generic_value_new (iface, &tuple);

// make me an output file
fp = fopen ( outputfile, "wb" );

// make me a filewriter
avro_file_writer_create_fp (fp, outputfile, 0, schema, &db);

// now for the code to emit the data

while (...)
{
    avro_value_reset (&tuple);

    // get the CSV record into the tuple
    ...

    // write that tuple
    avro_file_writer_append_value (db, &tuple);

    lineno ++;

    // flush the file
    avro_file_writer_flush (db);
}

// close the output file
avro_file_writer_close (db);

// other cleanup
avro_value_iface_decref (iface);
avro_value_decref (&tuple);

// close output file
fflush (outfp);
fclose (outfp);

I read the file using a modified version of avrocat.c that looks like this.


wschema = avro_file_reader_get_writer_schema(reader);

iface = avro_generic_class_from_schema(wschema);

avro_generic_value_new(iface, &value);



int rval;

lineno = 0;



while ((rval = avro_file_reader_read_value(reader, &value)) == 0) {

lineno ++;

avro_value_reset(&value);

}



// If it was not an EOF that caused it to fail,

// print the error.

if (rval != EOF)

{

fprintf(stderr, "Error: %s\n", avro_strerror());

}

else

{

printf ( "%s %lld\n", filename, lineno );



}

On many files, I find no data is missing in the .AVRO file. However, quite often I get files
where several dozen rows of data are missing.

I'm certain that I'm doing something wrong, and something very basic. Any help debugging would
be most appreciated.

Thanks,

-amrith

Mime
View raw message