Dongchao,the problem is that i would not want to write each entry (very small) to hdfs, this will make hive loading very inefficient.(though i can do file merging in separate job). So ideally, i would like to write all entries within the same 6 min to the same file.right now i am actually thinking about adding a timer(say 6min) in my bolt, collect all input to memory, and write to a single file on time out...ChenOn Tue, Jan 7, 2014 at 5:00 PM, Ding,Dongchao <email@example.com> wrote:
Hi ，some suggestions
You didn’t need to “instruct data within the same hourly tenth to the same bolt” ， just write the entries within the same hourly tenth(6 min) to the same hdfs directory .
Because hive partition locates to one hdfs directory ,not one hdfs file .
I am using storm to read data from our socket server, entry by entry. Each entry has a time stamp. In my bolt, i will need to write the entries within the same hourly tenth(6 min) to the same hdfs file, so that later i can load them to hive. (with hourly tenth 6min as the partition).
In order to achieve that, i will either need
1 instruct data within the same hourly tenth to the same bolt
or 2. share the same file writer for all bolts that deal with data within the same hourly tenth.
How can I achieve this? or if there is some other approach for this problem?
Thank you very much!