storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Svend Vanderveken <svend.vanderve...@gmail.com>
Subject Re: 答复: write to the same file in bolt?
Date Wed, 08 Jan 2014 11:09:13 GMT
Oh, just realized: actually chapter 3 of the book I was referring to is
free on the editor's web page, you'll find there an illustrated explanation
of Pail:

http://manning.com/marz/


On Wed, Jan 8, 2014 at 12:03 PM, Svend Vanderveken <
svend.vanderveken@gmail.com> wrote:

> Chen,
>
>
> Have a look at Pail (https://github.com/nathanmarz/dfs-datastores), I've
> no experience with it but according to a book written by its creator it's a
> good library :D
>
> I think it fits your model:
>
>
>    - when writing, all your distributed data providers (bolts in your
>    case) write to the same "pail", e.g. /data/logs/ts-1234567
>    - Behind the scene, /data/logs/ts-1234567 is actually an HDFS folder
>    and pail makes sure each source is appending to a different, potentially
>    small, file inside that folder
>    - when reading, you can ask pail to "absorb" the
>    pail /data/logs/ts-1234567 into one single stream of data that you can feed
>    into hive or wherever.
>
>
> Does this make sense for you use case?
>
> Cheers
>
> S
>
>
>
>
>
>
>
>
> On Wed, Jan 8, 2014 at 2:51 AM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>
>> Dongchao,
>> the problem is that i would not want to write each entry (very small) to
>> hdfs, this will make hive loading very inefficient.(though i can do file
>> merging in separate job). So ideally, i would like to write all entries
>> within the same 6 min to the same file.
>> right now i am actually thinking about adding a timer(say 6min) in my
>> bolt, collect all input to memory,  and write to a single file on time
>> out...
>> Chen
>>
>>
>> On Tue, Jan 7, 2014 at 5:00 PM, Ding,Dongchao <dingdongchao@baidu.com>wrote:
>>
>>>   Hi   ,some suggestions
>>>
>>> You  didn’t need  to “instruct data within the same hourly tenth to the
>>> same bolt”   , just write   the entries within the same hourly tenth(6
>>> min) to the same hdfs  directory .
>>>
>>> Because hive partition locates to one hdfs  directory ,not one hdfs
>>> file   .
>>>
>>> thks
>>>
>>> ding
>>>
>>> *发件人:* Chen Wang [mailto:chen.apache.solr@gmail.com]
>>> *发送时间:* 2014年1月8日 7:47
>>> *收件人:* user@storm.incubator.apache.org
>>> *主题:* write to the same file in bolt?
>>>
>>>
>>>
>>> Hey Guys,
>>>
>>> I am using storm to read data from our socket server, entry by entry.
>>> Each entry has a time stamp. In my bolt, i will need to write the entries
>>> within the same hourly tenth(6 min) to the same hdfs file, so that later i
>>> can load them to hive. (with hourly tenth 6min as the partition).
>>>
>>>
>>>
>>> In order to achieve that, i will either need
>>>
>>>     1 instruct data within the same hourly tenth to the same bolt
>>>
>>> or  2. share the same file writer for all bolts that deal with data
>>> within the same hourly tenth.
>>>
>>>
>>>
>>> How can I achieve this? or  if there is some other approach for this
>>> problem?
>>>
>>> Thank you very much!
>>>
>>> Chen
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Mime
View raw message