storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Svend Vanderveken <svend.vanderve...@gmail.com>
Subject Re: 答复: write to the same file in bolt?
Date Wed, 08 Jan 2014 11:03:14 GMT
Chen,


Have a look at Pail (https://github.com/nathanmarz/dfs-datastores), I've no
experience with it but according to a book written by its creator it's a
good library :D

I think it fits your model:


   - when writing, all your distributed data providers (bolts in your case)
   write to the same "pail", e.g. /data/logs/ts-1234567
   - Behind the scene, /data/logs/ts-1234567 is actually an HDFS folder
   and pail makes sure each source is appending to a different, potentially
   small, file inside that folder
   - when reading, you can ask pail to "absorb" the
   pail /data/logs/ts-1234567 into one single stream of data that you can feed
   into hive or wherever.


Does this make sense for you use case?

Cheers

S








On Wed, Jan 8, 2014 at 2:51 AM, Chen Wang <chen.apache.solr@gmail.com>wrote:

> Dongchao,
> the problem is that i would not want to write each entry (very small) to
> hdfs, this will make hive loading very inefficient.(though i can do file
> merging in separate job). So ideally, i would like to write all entries
> within the same 6 min to the same file.
> right now i am actually thinking about adding a timer(say 6min) in my
> bolt, collect all input to memory,  and write to a single file on time
> out...
> Chen
>
>
> On Tue, Jan 7, 2014 at 5:00 PM, Ding,Dongchao <dingdongchao@baidu.com>wrote:
>
>>   Hi   ,some suggestions
>>
>> You  didn’t need  to “instruct data within the same hourly tenth to the
>> same bolt”   , just write   the entries within the same hourly tenth(6
>> min) to the same hdfs  directory .
>>
>> Because hive partition locates to one hdfs  directory ,not one hdfs
>> file   .
>>
>> thks
>>
>> ding
>>
>> *发件人:* Chen Wang [mailto:chen.apache.solr@gmail.com]
>> *发送时间:* 2014年1月8日 7:47
>> *收件人:* user@storm.incubator.apache.org
>> *主题:* write to the same file in bolt?
>>
>>
>>
>> Hey Guys,
>>
>> I am using storm to read data from our socket server, entry by entry.
>> Each entry has a time stamp. In my bolt, i will need to write the entries
>> within the same hourly tenth(6 min) to the same hdfs file, so that later i
>> can load them to hive. (with hourly tenth 6min as the partition).
>>
>>
>>
>> In order to achieve that, i will either need
>>
>>     1 instruct data within the same hourly tenth to the same bolt
>>
>> or  2. share the same file writer for all bolts that deal with data
>> within the same hourly tenth.
>>
>>
>>
>> How can I achieve this? or  if there is some other approach for this
>> problem?
>>
>> Thank you very much!
>>
>> Chen
>>
>>
>>
>>
>>
>
>

Mime
View raw message