spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: reading LZO compressed file in spark
Date Tue, 24 Dec 2013 08:03:35 GMT
Hi Rajeev,

I'm not sure if you ever got it working, but I just got mine up and going.
 If you just use sc.textFile(...) the file will be read but the LZO index
won't be used so a .count() on my 1B+ row file took 2483s.  When I ran it
like this though:

sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text]).count

the LZO index file was used and the .count() took just 101s.  For reference
this file is 43GB when .gz compressed and 78.4GB when .lzo compressed.  I
have RF=3 and this is across 4 pretty beefy machines with Hadoop DataNodes
and Spark both running on each machine.

Cheers!
Andrew


On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava <rajeev@silverline-da.com
> wrote:

> Thanks for your suggestion. I will try this and update by late evening.
>
> regards
> Rajeev
>
> Rajeev Srivastava
> Silverline Design Inc
> 2118 Walsh ave, suite 204
> Santa Clara, CA, 95050
> cell : 408-409-0940
>
>
> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <andrew@andrewash.com> wrote:
>
>> Hi Rajeev,
>>
>> It looks like you're using the com.hadoop.mapred.DeprecatedLzoTextInputFormat
>> input format above, while Stephen referred to com.hadoop.mapreduce.
>> LzoTextInputFormat
>>
>> I think the way to use this in Spark would be to use the
>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with
>> the path and the InputFormat as parameters.  Can you give those a shot?
>>
>> Andrew
>>
>>
>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <
>> rajeev@silverline-da.com> wrote:
>>
>>> Hi Stephen,
>>>      I tried the same lzo file with a simple hadoop script
>>> this seems to work fine
>>>
>>> HADOOP_HOME=/usr/lib/hadoop
>>> /usr/bin/hadoop  jar
>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
>>> \
>>> -libjars
>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
>>> \
>>> -input /tmp/ldpc.sstv3.lzo \
>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>>> -output wc_test \
>>> -mapper 'cat' \
>>> -reducer 'wc -l'
>>>
>>> This means hadoop is able to handle the lzo file correctly
>>>
>>> Can you suggest me what i should do in spark for it to work
>>>
>>> regards
>>> Rajeev
>>>
>>>
>>> Rajeev Srivastava
>>> Silverline Design Inc
>>> 2118 Walsh ave, suite 204
>>> Santa Clara, CA, 95050
>>> cell : 408-409-0940
>>>
>>>
>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <
>>> stephen.haberman@gmail.com> wrote:
>>>
>>>>
>>>> > System.setProperty("spark.io.compression.codec",
>>>> > "com.hadoop.compression.lzo.LzopCodec")
>>>>
>>>> This spark.io.compression.codec is a completely different setting than
>>>> the
>>>> codecs that are used for reading/writing from HDFS. (It is for
>>>> compressing
>>>> Spark's internal/non-HDFS intermediate output.)
>>>>
>>>> > Hope this helps and someone can help read a LZO file
>>>>
>>>> Spark just uses the regular Hadoop File System API, so any issues with
>>>> reading
>>>> LZO files would be Hadoop issues. I would search in the Hadoop issue
>>>> tracker,
>>>> and look for information on using LZO files with Hadoop/Hive, and
>>>> whatever works
>>>> for them, should magically work for Spark as well.
>>>>
>>>> This looks like a good place to start:
>>>>
>>>> https://github.com/twitter/hadoop-lzo
>>>>
>>>> IANAE, but I would try passing one of these:
>>>>
>>>>
>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>>>>
>>>> To the SparkContext.hadoopFile method.
>>>>
>>>> - Stephen
>>>>
>>>>
>>>
>>
>

Mime
View raw message