flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Persist streams of data
Date Mon, 29 Sep 2014 16:21:32 GMT
Thanks Fabian for the support. See inline for answers:

On Mon, Sep 29, 2014 at 6:12 PM, Fabian Hueske <fhueske@apache.org> wrote:

> Hi,
> there the right answer depends on (at least) two aspects:
> a) Do you have an actual streaming case or is it batch, i.e., does the
> data come from a potentially infinite stream or not? This basically
> determines the system to handle your data.
>   - Stream: I don't have much experience here, but Flink's new
> Streaming feature, Kafka or Flume might be worth looking at.
>   - Batch: A regular Flink job might work.

Stream, triples are generated from an external program with some batch size

b) How do you want to access your data? This influences the format to store
> the data.
>       - Full scans of some columns (large fraction of tuples) -> Parquet
> or ORC in HDFS
>       - Point access to certain tuples (also subsets of columns, few or
> many tuples) -> HBase,
>       - always read all full tuples -> Avro, ProtoBufs in HDFS
> Full scans of some columns. Is it possible to add batch of rows to a
parquet file? Or do I need to create a new File for each batch?
Then can I read an entire directory containing those files at once?

> I don't know how much throughput these systems are able to handle though...
> Hope this helps,
> Fabian
> 2014-09-29 17:32 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>> Hi guys,
>> in my use case I have burst of data coming into my system (RDF triples
>> generated from a CSV that I need to process in a further step) and I was
>> trying to figure it out what is the best way to save them on HDFS.
>> Do you suggest me to save them on HBase or to use a serialization tool
>> like avro/parquet and similar? Do I need Flume as well or there's a Flink
>> solution for that?
>> Best,
>> Flavio

View raw message