hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zsongbo <zson...@gmail.com>
Subject Re: How to make data available in 10 minutes.
Date Fri, 10 Jul 2009 07:27:40 GMT

Coud you please put more detail about how Chukwa merge 10min files -> 1hour
files-> 1day files?

1. Is it run a background process/thread to do the merge periodically? How
about the performance?
2. How about to run a MapReduce Job to do the merge periodically? How about
the performance?


On Fri, Jul 10, 2009 at 4:05 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> You are basically re-inventing lots of capabilities that others have solved
> before.
> The idea of building an index that refers to files which are constructed by
> progressive merging is very standard and very similar to the way that
> Lucene
> works.
> You don't say how much data you are moving, but I would recommend one of
> the
> following approaches:
> a) if you have a small number of records (< 50 million), you can just stay
> with your current system.  I would recommend that you attach your index to
> the file and not update the large index when a new incremental file
> appears.  The cost is that you have to search many small indexes near the
> end of the day.  This can be solved by building a new incremental file
> every
> 10 minutes and then building a larger file by the same process every 2
> hours.  When the 2 hour file is available, you can start searching that 2
> hour increment instead of the 12 10 minute increments that it covers.  This
> would result in log T files which is usually manageable.
> You should investigate the MapFile object from Hadoop since it is likely to
> be a better implementation of what you are doing now.
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/MapFile.Reader.html
> b) if you have a large number of records that need to be sharded across
> multiple servers and only need key-value access, consider something like
> Voldemort which allows instant switching to a new read-only index.
> http://project-voldemort.com/
> c) you should also consider using a real-time extension to Lucene.  The way
> that these work (at least the IBM extensions) is that as documents arrive,
> they are added to an index on disk and to an index in memory.  Every 10-30
> minutes, the reference to the disk index is re-opened (causing the recent
> documents to appear there) and the index in memory is cleared.  This gives
> you very fast access to recent records and essentially real-time update.
> See here for a slide show on IBM's work in this area:
> https://www.box.net/file/296389642/encoded/26352752/04cc9b4b1385fa0a2419b87498bc887c
> See here for information on how LinkedIn does this: http://
> files.meetup.com/1460078/*Lucene*Near*Realtime*Search2.ppt
> d) you should also consider a shard manager for Lucene such as Katta.  It
> is
> very easy for katta to search 50-100 indexes on a single host.  This would
> allow you to generate new full indexes twice a day with small incremental
> indexes every 10 minutes.  Katta can handle distribution and replication of
> indexes very nicely.
> http://katta.sourceforge.net/
> On Mon, Jul 6, 2009 at 7:03 PM, zsongbo <zsongbo@gmail.com> wrote:
>  > Hi all,
> >
> > We have buildup a system which use hadoop MapRdeuce to sort and index the
> > input files. The index is straightforward blocked-key=>files+offsets.
> > Then we can query the dataset with low lentacy.
> > Usually, we run the MapReduce jobs in periodic of one day or hours.  Then
> > the data before one day or hours will be available.
> >
> > But if we need the data to be available as soon as possible (such as 10
> > minutes later). We need to run small MapReduce jobs with 10 minutes
> > periodic. But in this way, the sorted files are small and many. When
> query
> > the data of one day, we must read many small files. Thus, the I/O
> overhead
> > is expensive.
> >
> > We considered to merge the small sorted data files to build big files to
> > reduce the I/O overhead, but when merge the files, we must update the
> > index.
> > It is very expensive and slow to update the index now.
> >
> > Can anyone have such experience and solution?
> >
> > The Chukwa's paper says that they can make data available in 10 minutes,
> > but
> > there is no detail about their solution.
> > I had used HBase to store and index the dataset, but HBase cannot support
> > our volume of dataset (each node cannot server too many regions).
> >
> > Schubert
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message