hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <asrab...@gmail.com>
Subject Re: How to make data available in 10 minutes.
Date Fri, 10 Jul 2009 23:46:33 GMT
Chukwa uses a mapreduce job for this, with a daemon process to
identify the files to be merged.  It's unfortunately not as generic as
it could be; it assumes a lot about the way the directory structure is
laid out and the files are named.

I've been tempted to rewrite this to be more generic. But it'll be a
lot easier to do once we have appends, which will be comparatively
soon, so it doesn't necessarily seem worth it.


On Fri, Jul 10, 2009 at 12:27 AM, zsongbo<zsongbo@gmail.com> wrote:
> Ariel,
> Coud you please put more detail about how Chukwa merge 10min files -> 1hour
> files-> 1day files?
> 1. Is it run a background process/thread to do the merge periodically? How
> about the performance?
> 2. How about to run a MapReduce Job to do the merge periodically? How about
> the performance?
> Schubert
> On Fri, Jul 10, 2009 at 4:05 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>> You are basically re-inventing lots of capabilities that others have solved
>> before.
>> The idea of building an index that refers to files which are constructed by
>> progressive merging is very standard and very similar to the way that
>> Lucene
>> works.
>> You don't say how much data you are moving, but I would recommend one of
>> the
>> following approaches:
>> a) if you have a small number of records (< 50 million), you can just stay
>> with your current system.  I would recommend that you attach your index to
>> the file and not update the large index when a new incremental file
>> appears.  The cost is that you have to search many small indexes near the
>> end of the day.  This can be solved by building a new incremental file
>> every
>> 10 minutes and then building a larger file by the same process every 2
>> hours.  When the 2 hour file is available, you can start searching that 2
>> hour increment instead of the 12 10 minute increments that it covers.  This
>> would result in log T files which is usually manageable.
>> You should investigate the MapFile object from Hadoop since it is likely to
>> be a better implementation of what you are doing now.
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/MapFile.Reader.html
>> b) if you have a large number of records that need to be sharded across
>> multiple servers and only need key-value access, consider something like
>> Voldemort which allows instant switching to a new read-only index.
>> http://project-voldemort.com/
>> c) you should also consider using a real-time extension to Lucene.  The way
>> that these work (at least the IBM extensions) is that as documents arrive,
>> they are added to an index on disk and to an index in memory.  Every 10-30
>> minutes, the reference to the disk index is re-opened (causing the recent
>> documents to appear there) and the index in memory is cleared.  This gives
>> you very fast access to recent records and essentially real-time update.
>> See here for a slide show on IBM's work in this area:
>> https://www.box.net/file/296389642/encoded/26352752/04cc9b4b1385fa0a2419b87498bc887c
>> See here for information on how LinkedIn does this: http://
>> files.meetup.com/1460078/*Lucene*Near*Realtime*Search2.ppt
>> d) you should also consider a shard manager for Lucene such as Katta.  It
>> is
>> very easy for katta to search 50-100 indexes on a single host.  This would
>> allow you to generate new full indexes twice a day with small incremental
>> indexes every 10 minutes.  Katta can handle distribution and replication of
>> indexes very nicely.
>> http://katta.sourceforge.net/
>> On Mon, Jul 6, 2009 at 7:03 PM, zsongbo <zsongbo@gmail.com> wrote:
>>  > Hi all,
>> >
>> > We have buildup a system which use hadoop MapRdeuce to sort and index the
>> > input files. The index is straightforward blocked-key=>files+offsets.
>> > Then we can query the dataset with low lentacy.
>> > Usually, we run the MapReduce jobs in periodic of one day or hours.  Then
>> > the data before one day or hours will be available.
>> >
>> > But if we need the data to be available as soon as possible (such as 10
>> > minutes later). We need to run small MapReduce jobs with 10 minutes
>> > periodic. But in this way, the sorted files are small and many. When
>> query
>> > the data of one day, we must read many small files. Thus, the I/O
>> overhead
>> > is expensive.
>> >
>> > We considered to merge the small sorted data files to build big files to
>> > reduce the I/O overhead, but when merge the files, we must update the
>> > index.
>> > It is very expensive and slow to update the index now.
>> >
>> > Can anyone have such experience and solution?
>> >
>> > The Chukwa's paper says that they can make data available in 10 minutes,
>> > but
>> > there is no detail about their solution.
>> > I had used HBase to store and index the dataset, but HBase cannot support
>> > our volume of dataset (each node cannot server too many regions).
>> >
>> > Schubert
>> >

Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

View raw message