hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitaliy Semochkin <vitaliy...@gmail.com>
Subject Distributed Updateable Cache
Date Thu, 22 Jul 2010 11:56:40 GMT

I need to do calculations that would benefit from storing information in
distributed updateable cache.
What are best practices for such things in hadoop?

In case there is no good solution for my problem, here are details and ideas
I have.
I'm going to count unique visitors of a site several times per day(each 5
mins), for that I will need distributed cache that will be accessible from
all mappers to store already counted visitors.

My plan is:
store unique visitors in a file on hdfs
each time mapper jvm starts  store in HashSet in each jvm (I
use mapred.job.reuse.jvm.num.tasks=-1)
after each map/reduce job add additional data to this file

any critics and advises are welcome :-)

Vitaliy S

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message