hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <asrab...@gmail.com>
Subject Re: How to make data available in 10 minutes.
Date Wed, 22 Jul 2009 21:01:19 GMT
They're designed to take a few minutes and seem to in operations here
and at Yahoo. Details, of course, will vary depending on data volumes
and hardware. More benchmarks welcome. :)

--Ari

On Mon, Jul 20, 2009 at 3:04 AM, zsongbo<zsongbo@gmail.com> wrote:
> Hi Ari,
>
> Thanks.
> In Chukwa, how about the performance of the MapReduce jobs for merge.
> The 1-hour merge and 1-day merge mapreduce jobs would run simultaneously,
> how about the performance?
>
> Schubert
> On Sat, Jul 11, 2009 at 7:46 AM, Ariel Rabkin <asrabkin@gmail.com> wrote:
>
>> Chukwa uses a mapreduce job for this, with a daemon process to
>> identify the files to be merged.  It's unfortunately not as generic as
>> it could be; it assumes a lot about the way the directory structure is
>> laid out and the files are named.
>>
>> I've been tempted to rewrite this to be more generic. But it'll be a
>> lot easier to do once we have appends, which will be comparatively
>> soon, so it doesn't necessarily seem worth it.
>>
>> --Ari
>>
>> On Fri, Jul 10, 2009 at 12:27 AM, zsongbo<zsongbo@gmail.com> wrote:
>> > Ariel,
>> >
>> > Coud you please put more detail about how Chukwa merge 10min files ->
>> 1hour
>> > files-> 1day files?
>> >
>> > 1. Is it run a background process/thread to do the merge periodically?
>> How
>> > about the performance?
>> > 2. How about to run a MapReduce Job to do the merge periodically? How
>> about
>> > the performance?
>> >
>> > Schubert
>> >
>> > On Fri, Jul 10, 2009 at 4:05 AM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> >
>> >> You are basically re-inventing lots of capabilities that others have
>> solved
>> >> before.
>> >>
>> >> The idea of building an index that refers to files which are constructed
>> by
>> >> progressive merging is very standard and very similar to the way that
>> >> Lucene
>> >> works.
>> >>
>> >> You don't say how much data you are moving, but I would recommend one of
>> >> the
>> >> following approaches:
>> >>
>> >> a) if you have a small number of records (< 50 million), you can just
>> stay
>> >> with your current system.  I would recommend that you attach your index
>> to
>> >> the file and not update the large index when a new incremental file
>> >> appears.  The cost is that you have to search many small indexes near
>> the
>> >> end of the day.  This can be solved by building a new incremental file
>> >> every
>> >> 10 minutes and then building a larger file by the same process every 2
>> >> hours.  When the 2 hour file is available, you can start searching that
>> 2
>> >> hour increment instead of the 12 10 minute increments that it covers.
>>  This
>> >> would result in log T files which is usually manageable.
>> >>
>> >> You should investigate the MapFile object from Hadoop since it is likely
>> to
>> >> be a better implementation of what you are doing now.
>> >>
>> >>
>> >>
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/MapFile.Reader.html
>> >>
>> >> b) if you have a large number of records that need to be sharded across
>> >> multiple servers and only need key-value access, consider something like
>> >> Voldemort which allows instant switching to a new read-only index.
>> >>
>> >> http://project-voldemort.com/
>> >>
>> >> c) you should also consider using a real-time extension to Lucene.  The
>> way
>> >> that these work (at least the IBM extensions) is that as documents
>> arrive,
>> >> they are added to an index on disk and to an index in memory.  Every
>> 10-30
>> >> minutes, the reference to the disk index is re-opened (causing the
>> recent
>> >> documents to appear there) and the index in memory is cleared.  This
>> gives
>> >> you very fast access to recent records and essentially real-time update.
>> >>
>> >> See here for a slide show on IBM's work in this area:
>> >>
>> >>
>> https://www.box.net/file/296389642/encoded/26352752/04cc9b4b1385fa0a2419b87498bc887c
>> >>
>> >> See here for information on how LinkedIn does this: http://
>> >> files.meetup.com/1460078/*Lucene*Near*Realtime*Search2.ppt
>> >>
>> >> d) you should also consider a shard manager for Lucene such as Katta.
>>  It
>> >> is
>> >> very easy for katta to search 50-100 indexes on a single host.  This
>> would
>> >> allow you to generate new full indexes twice a day with small
>> incremental
>> >> indexes every 10 minutes.  Katta can handle distribution and replication
>> of
>> >> indexes very nicely.
>> >>
>> >> http://katta.sourceforge.net/
>> >>
>> >> On Mon, Jul 6, 2009 at 7:03 PM, zsongbo <zsongbo@gmail.com> wrote:
>> >>
>> >>  > Hi all,
>> >> >
>> >> > We have buildup a system which use hadoop MapRdeuce to sort and index
>> the
>> >> > input files. The index is straightforward blocked-key=>files+offsets.
>> >> > Then we can query the dataset with low lentacy.
>> >> > Usually, we run the MapReduce jobs in periodic of one day or hours.
>>  Then
>> >> > the data before one day or hours will be available.
>> >> >
>> >> > But if we need the data to be available as soon as possible (such as
>> 10
>> >> > minutes later). We need to run small MapReduce jobs with 10 minutes
>> >> > periodic. But in this way, the sorted files are small and many. When
>> >> query
>> >> > the data of one day, we must read many small files. Thus, the I/O
>> >> overhead
>> >> > is expensive.
>> >> >
>> >> > We considered to merge the small sorted data files to build big files
>> to
>> >> > reduce the I/O overhead, but when merge the files, we must update the
>> >> > index.
>> >> > It is very expensive and slow to update the index now.
>> >> >
>> >> > Can anyone have such experience and solution?
>> >> >
>> >> > The Chukwa's paper says that they can make data available in 10
>> minutes,
>> >> > but
>> >> > there is no detail about their solution.
>> >> > I had used HBase to store and index the dataset, but HBase cannot
>> support
>> >> > our volume of dataset (each node cannot server too many regions).
>> >> >
>> >> > Schubert
>> >> >
>> >>
>> >
>>
>>
>>
>>  --
>> Ari Rabkin asrabkin@gmail.com
>> UC Berkeley Computer Science Department
>>
>



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Mime
View raw message