hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madhav Sharan <msha...@usc.edu>
Subject Re: Fast way to read thousands of double value in hadoop jobs
Date Fri, 19 Aug 2016 03:47:21 GMT
Thanks for your suggestion Daniel. I was already using SequenceFile but my
format was poor. I was storing file contents as Text in my SeqFile,

So all my map jobs did repeated conversion from Text to double. I resolved
this by correcting SequenceFile format. Now I store serialised java object
in SeqFile and my map jobs are faster.

--
Madhav Sharan


On Wed, Aug 17, 2016 at 11:07 PM, Daniel Haviv <danielrulez@gmail.com>
wrote:

> Store them within a sequencefile
>
>
> On Thursday, 18 August 2016, Madhav Sharan <msharan@usc.edu> wrote:
>
>> Hi , can someone please recommend a fast way in hadoop to store and
>> retrieve matrix of double values?
>>
>> As of now we store values in text files and the read it in java using
>> HDFS inputstream and Scanner. *[0]* These files are actually vectors
>> representing a video file. Each vector is 883 X 200 and for one map job we
>> read 4 such vectors so *job is to convert 706,400 values to double*.
>>
>> Using this approach we take ~ 1.5 second to convert all these values. I
>> can use a external cache server to avoid repeated conversion but I am
>> looking for a better solution.
>>
>> [0] - https://github.com/USCDataScience/hadoop-pot/blob/master/
>> src/main/java/org/pooledtimeseries/PoT.java#L596
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_USCDataScience_hadoop-2Dpot_blob_master_src_main_java_org_pooledtimeseries_PoT.java-23L596&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=6105jJkHPEbDi_yojUYcLP3vpvkzg0AV-r1MdgyCG1g&s=PNNdBOT8PCJ4RFaHzF9EYPJaDfjlLKJfyvlIobonBxA&e=>
>>
>>
>> --
>> Madhav Sharan
>>
>>

Mime
View raw message