spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: reading LZO compressed file in spark
Date Thu, 26 Dec 2013 23:22:50 GMT
Rajeev,

You should have something like this in in your core-site.xml file in Hadoop:

    <property>
        <name>io.compression.codecs</name>

<value>com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>

I also had to add the LZO jar into Spark with SPARK_CLASSPATH in
spark-env.sh so you may need to do that too.

Cheers,
Andrew



On Thu, Dec 26, 2013 at 3:48 PM, Rajeev Srivastava <rajeev@silverline-da.com
> wrote:

> Hi Andrew,
>      Thanks for your example
> I used your command and i get the following errors from worker  ( missing
> codec from worker i guess)
> How do i get codecs over to worker machines
> regards
> Rajeev
> *******************************************************************
> 13/12/26 12:34:42 INFO TaskSetManager: Loss was due to
> java.io.IOException: Codec for file
> hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not
> found, cannot
> run
> at
> com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97)
> at
> spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:68)
> at
> spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
> at
> spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
> at
> spark.RDD.iterator(RDD.scala:196)
> at
> spark.scheduler.ResultTask.run(ResultTask.scala:77)
> at
> spark.executor.Executor$TaskRunner.run(Executor.scala:98)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at
> java.lang.Thread.run(Thread.java:724)
> 13/12/26 12:34:42 INFO TaskSetManager: Starting task 0.0:15 as TID 28 on
> executor 4: hadoop02
> (preferred)
> 13/12/26 12:34:42 INFO TaskSetManager: Serialized task 0.0:15 as 1358 bytes
> in 0 ms                                      13/12/26 12:34:42 INFO
> TaskSetManager: Lost TID 22 (task
> 0.0:20)                                                         13/12/26
> 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: Codec
> for file hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo
> not found, cannot run [duplicate 1]
>
> Rajeev Srivastava
> Silverline Design Inc
> 2118 Walsh ave, suite 204
> Santa Clara, CA, 95050
> cell : 408-409-0940
>
>
> On Tue, Dec 24, 2013 at 5:20 PM, Andrew Ash <andrew@andrewash.com> wrote:
>
>> Hi Berkeley,
>>
>> By RF=3 I mean replication factor of 3 on the files in HDFS, so each
>> block is stored 3 times across the cluster.  It's a pretty standard choice
>> for the replication factor in order to give a hardware team time to replace
>> bad hardware in the case of failure.  With RF=3 the cluster can sustain
>> failure on any two nodes without data loss, but the loss of the third node
>> may cause loss.
>>
>> When reading the LZO files with the newAPIHadoopFile() call I showed
>> below, the data in the RDD is already decompressed -- it transparently
>> looks the same to my Spark program as if I was operating on an uncompressed
>> file.
>>
>> Cheers,
>> Andrew
>>
>>
>> On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon <
>> berkeley@firestickgames.com> wrote:
>>
>>> Andrew, This is great.
>>>
>>> Excuse my ignorance, but what do you mean by RF=3? Also, after reading
>>> the LZO files, are you able to access the contents directly, or do you have
>>> to decompress them after reading them?
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 24, 2013, at 12:03 AM, Andrew Ash <andrew@andrewash.com> wrote:
>>>
>>> Hi Rajeev,
>>>
>>> I'm not sure if you ever got it working, but I just got mine up and
>>> going.  If you just use sc.textFile(...) the file will be read but the LZO
>>> index won't be used so a .count() on my 1B+ row file took 2483s.  When I
>>> ran it like this though:
>>>
>>> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo",
>>> classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>>> classOf[org.apache.hadoop.io.LongWritable],
>>> classOf[org.apache.hadoop.io.Text]).count
>>>
>>> the LZO index file was used and the .count() took just 101s.  For
>>> reference this file is 43GB when .gz compressed and 78.4GB when .lzo
>>> compressed.  I have RF=3 and this is across 4 pretty beefy machines with
>>> Hadoop DataNodes and Spark both running on each machine.
>>>
>>> Cheers!
>>> Andrew
>>>
>>>
>>> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava <
>>> rajeev@silverline-da.com> wrote:
>>>
>>>> Thanks for your suggestion. I will try this and update by late evening.
>>>>
>>>> regards
>>>> Rajeev
>>>>
>>>> Rajeev Srivastava
>>>> Silverline Design Inc
>>>> 2118 Walsh ave, suite 204
>>>> Santa Clara, CA, 95050
>>>> cell : 408-409-0940
>>>>
>>>>
>>>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <andrew@andrewash.com>wrote:
>>>>
>>>>> Hi Rajeev,
>>>>>
>>>>> It looks like you're using the com.hadoop.mapred.DeprecatedLzoTextInputFormat
>>>>> input format above, while Stephen referred to com.hadoop.mapreduce.
>>>>> LzoTextInputFormat
>>>>>
>>>>> I think the way to use this in Spark would be to use the
>>>>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods
with
>>>>> the path and the InputFormat as parameters.  Can you give those a shot?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <
>>>>> rajeev@silverline-da.com> wrote:
>>>>>
>>>>>> Hi Stephen,
>>>>>>      I tried the same lzo file with a simple hadoop script
>>>>>> this seems to work fine
>>>>>>
>>>>>> HADOOP_HOME=/usr/lib/hadoop
>>>>>> /usr/bin/hadoop  jar
>>>>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>>>> \
>>>>>> -libjars
>>>>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
>>>>>> \
>>>>>> -input /tmp/ldpc.sstv3.lzo \
>>>>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>>>>>> -output wc_test \
>>>>>> -mapper 'cat' \
>>>>>> -reducer 'wc -l'
>>>>>>
>>>>>> This means hadoop is able to handle the lzo file correctly
>>>>>>
>>>>>> Can you suggest me what i should do in spark for it to work
>>>>>>
>>>>>> regards
>>>>>> Rajeev
>>>>>>
>>>>>>
>>>>>> Rajeev Srivastava
>>>>>> Silverline Design Inc
>>>>>> 2118 Walsh ave, suite 204
>>>>>> Santa Clara, CA, 95050
>>>>>> cell : 408-409-0940
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <
>>>>>> stephen.haberman@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> > System.setProperty("spark.io.compression.codec",
>>>>>>> > "com.hadoop.compression.lzo.LzopCodec")
>>>>>>>
>>>>>>> This spark.io.compression.codec is a completely different setting
>>>>>>> than the
>>>>>>> codecs that are used for reading/writing from HDFS. (It is for
>>>>>>> compressing
>>>>>>> Spark's internal/non-HDFS intermediate output.)
>>>>>>>
>>>>>>> > Hope this helps and someone can help read a LZO file
>>>>>>>
>>>>>>> Spark just uses the regular Hadoop File System API, so any issues
>>>>>>> with reading
>>>>>>> LZO files would be Hadoop issues. I would search in the Hadoop
issue
>>>>>>> tracker,
>>>>>>> and look for information on using LZO files with Hadoop/Hive,
and
>>>>>>> whatever works
>>>>>>> for them, should magically work for Spark as well.
>>>>>>>
>>>>>>> This looks like a good place to start:
>>>>>>>
>>>>>>> https://github.com/twitter/hadoop-lzo
>>>>>>>
>>>>>>> IANAE, but I would try passing one of these:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>>>>>>>
>>>>>>> To the SparkContext.hadoopFile method.
>>>>>>>
>>>>>>> - Stephen
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message