storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ding,Dongchao" <dingdongc...@baidu.com>
Subject 答复: 答复: write to the same file in bolt?
Date Wed, 08 Jan 2014 04:15:02 GMT
Hi   , Chen
    “right now i am actually thinking about adding a timer(say 6min) in my bolt, collect
all input to memory,  and write to a single file on time out...”
I think it’s  OK

    In my application , I used storm to read data from MQ .
in spout , I  package many messages(50M or have read 30 mins) into  one pack ,and then
send to bolt .


发件人: Chen Wang [mailto:chen.apache.solr@gmail.com]
发送时间: 2014年1月8日 9:51
收件人: user@storm.incubator.apache.org
主题: Re: 答复: write to the same file in bolt?

Dongchao,
the problem is that i would not want to write each entry (very small) to hdfs, this will make
hive loading very inefficient.(though i can do file merging in separate job). So ideally,
i would like to write all entries within the same 6 min to the same file.
right now i am actually thinking about adding a timer(say 6min) in my bolt, collect all input
to memory,  and write to a single file on time out...
Chen

On Tue, Jan 7, 2014 at 5:00 PM, Ding,Dongchao <dingdongchao@baidu.com<mailto:dingdongchao@baidu.com>>
wrote:
Hi   ,some suggestions
You  didn’t need  to “instruct data within the same hourly tenth to the same bolt” 
 , just write   the entries within the same hourly tenth(6 min) to the same hdfs  directory
.
Because hive partition locates to one hdfs  directory ,not one hdfs file   .
thks
ding
发件人: Chen Wang [mailto:chen.apache.solr@gmail.com<mailto:chen.apache.solr@gmail.com>]
发送时间: 2014年1月8日 7:47
收件人: user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org>
主题: write to the same file in bolt?

Hey Guys,
I am using storm to read data from our socket server, entry by entry. Each entry has a time
stamp. In my bolt, i will need to write the entries within the same hourly tenth(6 min) to
the same hdfs file, so that later i can load them to hive. (with hourly tenth 6min as the
partition).

In order to achieve that, i will either need
    1 instruct data within the same hourly tenth to the same bolt
or  2. share the same file writer for all bolts that deal with data within the same hourly
tenth.

How can I achieve this? or  if there is some other approach for this problem?
Thank you very much!
Chen



Mime
View raw message