hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lenin Raj <emaille...@gmail.com>
Subject Re: typical JSON data sets
Date Thu, 04 Jul 2013 18:16:59 GMT
Hi John,

I have just started pulling Twitter conversions using Apache Flume. But I
have not started processing the pulled data yet. And my answers below:

1)      How large is each JSON document?

Averages from 100 KB to 2 MB. Flume rolls a new file every 1 minutes (which
is configurable). So the size depends on the number of events happened
during that interval

2)      Do they tend to be a single JSON doc per file, or multiples per
file?

Multiples per file - The max file (3.2 MB) had about 1100 JSON docs

3)      Do the JSON schemas change over time?

Nope. Since its the standard Twitter API
4)      Are there interesting public data sets you would recommend for
experiment?

Twitter API


Thanks,
Lenin


On Tue, Jul 2, 2013 at 9:34 PM, John Lilley <john.lilley@redpoint.net>wrote:

>  I would like to hear your experiences working with large JSON data sets,
> specifically:****
>
> **1)      **How large is each JSON document?****
>
> **2)      **Do they tend to be a single JSON doc per file, or multiples
> per file?****
>
> **3)      **Do the JSON schemas change over time?****
>
> **4)      **Are there interesting public data sets you would recommend
> for experiment?****
>
> Thanks****
>
> John****
>
> ** **
>

Mime
View raw message