spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hector Yee <hector....@gmail.com>
Subject Re: Storing large data for MLlib machine learning
Date Wed, 01 Apr 2015 19:52:09 GMT
Just using sc.textfile then a .map(decode)
Yes by default it is multiple files .. our training data is 1TB gzipped
into 5000 shards.

On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander <alexander.ulanov@hp.com>
wrote:

>  Thanks, sounds interesting! How do you load files to Spark? Did you
> consider having multiple files instead of file lines?
>
>
>
> *From:* Hector Yee [mailto:hector.yee@gmail.com]
> *Sent:* Wednesday, April 01, 2015 11:36 AM
> *To:* Ulanov, Alexander
> *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
>
> *Subject:* Re: Storing large data for MLlib machine learning
>
>
>
> I use Thrift and then base64 encode the binary and save it as text file
> lines that are snappy or gzip encoded.
>
>
>
> It makes it very easy to copy small chunks locally and play with subsets
> of the data and not have dependencies on HDFS / hadoop for server stuff for
> example.
>
>
>
>
>
> On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <
> alexander.ulanov@hp.com> wrote:
>
> Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
> From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
> Sent: Thursday, March 26, 2015 2:34 PM
> To: Stephen Boesch
> Cc: Ulanov, Alexander; dev@spark.apache.org
> Subject: Re: Storing large data for MLlib machine learning
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <javadba@gmail.com<mailto:
> javadba@gmail.com>> wrote:
> There are some convenience methods you might consider including:
>
>            MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ulanov@hp.com
> <mailto:alexander.ulanov@hp.com>>:
>
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>
>
>
>
> --
>
> Yee Yang Li Hector <http://google.com/+HectorYee>
>
> *google.com/+HectorYee <http://google.com/+HectorYee>*
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message