hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 臧冬松 <donal0...@gmail.com>
Subject Re: structured data split
Date Fri, 11 Nov 2011 15:57:23 GMT
Thanks Bejoy, that help a lot!

2011/11/11, Bejoy KS <bejoy.hadoop@gmail.com>:
> Hi Donal
>          I don't have much of an expose to the domain which you are
> pointing on to, but from a plain map reduce developer terms there would be
> my way of looking into processing such data format with map reduce
> - If the data is kind of flowing in continuously then I'd use flume to
> collect the binary data and write the same into sequence files and load
> into hdfs
> - If it is already existing large data, I'd use a sequence file writer to
> write the binary data as sequence files into hdfs. Where hdfs would take
> care of the splits.
> - I'd use SequenceFileInputFormat for my map reduce
> - If my application code is in other compatible language than java then I'd
> be using Streaming API to trigger my map reduce job.
>
> If there is any specific constraints with reading your data, as Will
> metioned you may need to go in with your custom Input Formats for
> processing the same.
>
>
> Hope it helps!...
>
>
> On Fri, Nov 11, 2011 at 8:12 PM, Charles Earl <charlescearl@me.com> wrote:
>
>> Hi,
>> Please also feel free to contact me. I'm working with STAR project at
>> Brookhaven Lab, and we are trying to build a MR workflow for analysis of
>> particle data. I've done some preliminary experiments running Root and
>> other nuclear physics analysis software in MR and have been looking at
>> various file layouts.
>> Charles
>> On Nov 11, 2011, at 9:26 AM, Will Maier wrote:
>>
>> > Hi Donal-
>> >
>> > On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote:
>> >> My scenario is that I have lots of files from High Energy Physics
>> experiment.
>> >> These files are in binary format,about 2G each, but basically they are
>> >> composed by lots of "Event", each Event is independent with others. The
>> >> physicists use a C++ program called ROOT to analysis these files,and
>> write the
>> >> output to a result file(use open(),read(),write()).  I'm considering
>> how to
>> >> store the files in HDFS, and use the Map-reduce to analize them.
>> >
>> > May I ask which experiment you're working on? We run a HDFS cluster at
>> one of
>> > the analysis centers for the CMS detector at the LHC. I'm not aware of
>> anyone
>> > using Hadoop's MR for analysis, though about 10 PB of LHC data is now
>> stored in
>> > HDFS. For your/our use case, I think that you would have to implement a
>> > domain-specific InputFormat yielding Events. ROOT files would be stored
>> as-is in
>> > HDFS.
>> >
>> > In CMS, we mostly run traditional HEP simulation and analysis workflows
>> using
>> > plain batch jobs managed by common schedulers like Condor or PBS. These
>> of
>> > course lack some of the features of the MR schedulers (like location
>> awareness),
>> > but have some advantages. For example, we run Condor schedulers that
>> > transparently manage workflows of tens of thousands of jobs on dozens of
>> > heterogeneous clusters across North America.
>> >
>> > Feel free to contact me off-list if have more HEP-specific questions
>> about HDFS.
>> >
>> > Thanks!
>> >
>> > --
>> >
>> > Will Maier - UW High Energy Physics
>> > cel: 608.438.6162
>> > tel: 608.263.9692
>> > web: http://www.hep.wisc.edu/~wcmaier/
>>
>>
>

Mime
View raw message