Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (athena.apache.org: domain of javadba@gmail.com designates
 209.85.213.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <9D5B00849D2CDA4386BDA89E83F69E6C0FE3AD2B@G4W3292.americas.hpqcorp.net>
References: 
 <9D5B00849D2CDA4386BDA89E83F69E6C0FE3AD2B@G4W3292.americas.hpqcorp.net>
Date: Thu, 26 Mar 2015 14:26:33 -0700
Message-ID: 
 <CACkSZy2z5FXiXtsO_O_r7xOdHF1tQ+-2RyWWWGO9oSxhw4ydUA@mail.gmail.com>
Subject: Re: Storing large data for MLlib machine learning
From: Stephen Boesch <javadba@gmail.com>
To: "Ulanov, Alexander" <alexander.ulanov@hp.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary=089e0149c0a084d314051237ab2b

--089e0149c0a084d314051237ab2b
Content-Type: text/plain; charset=UTF-8

There are some convenience methods you might consider including:

           MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ulanov@hp.com>:

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>

--089e0149c0a084d314051237ab2b--