hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Thièvre INA <jthie...@ina.fr>
Subject Re: How to manage large record in MapReduce
Date Fri, 07 Jan 2011 13:16:11 GMT
Hi Sonal,

thank you, I have just implemented a solution similar to yours (without
copying to a temp file as suggested in my inital post), and it seems to
work.
Best Regards,

Jérôme

2011/1/7 Sonal Goyal <sonalgoyal4@gmail.com>

> Jerome,
>
> You can take a look at FileStreamInputFormat at
>
> https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input
>
> This provides an input stream per file. In our case, we are using the input
> stream to load data into the database directly. Maybe you can use this or a
> similar approach for working with your videos.
>
> HTH
>
> Thanks and Regards,
> Sonal
> <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
> Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
>
> On Thu, Jan 6, 2011 at 4:23 PM, Jérôme Thièvre <jthievre@gmail.com> wrote:
>
> > Hi,
> >
> > we are currently using Hadoop (version 0.20.2) to manage some web
> archiving
> > processes like fulltext indexing, and it works very well with small
> records
> > that contains html.
> > Now, we would like to work with other type of web data like videos. These
> > kind of data could be really large and of course these records doesn't
> fit
> > in memory.
> >
> > Is it possible to manage record which content doesn't reside in memory
> but
> > on disk.
> > A possibility would be to implements a Writable that read its content
> from
> > a
> > DataInput but doesn't load it in memory, instead it would copy that
> content
> > to a temporary file in the local file system and allows to stream its
> > content using an InputStream (an InputStreamWritable).
> >
> > Has somebody tested a similar approach, and if not do you think some big
> > problems could happen (that impacts performance) with this method ?
> >
> > Thanks,
> >
> > Jérôme Thièvre
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message