Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6BD8D17D14 for ; Thu, 26 Mar 2015 21:27:08 +0000 (UTC) Received: (qmail 52751 invoked by uid 500); 26 Mar 2015 21:27:00 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 52677 invoked by uid 500); 26 Mar 2015 21:27:00 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 52663 invoked by uid 99); 26 Mar 2015 21:26:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Mar 2015 21:26:59 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of javadba@gmail.com designates 209.85.213.173 as permitted sender) Received: from [209.85.213.173] (HELO mail-ig0-f173.google.com) (209.85.213.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Mar 2015 21:26:54 +0000 Received: by igcxg11 with SMTP id xg11so4155900igc.0 for ; Thu, 26 Mar 2015 14:26:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=4g1FONFA1NgPhUaDZM60UcrjrwBlgfhYls92NRDT+7c=; b=0vThmFFG+m9NLhsYCSl83iT8AWNWL5m59haW9dQiZMhVipHJalBQYRqM/ulzLSe5cA JyzJDS32Lshi7s0LPZPcr4e9Wb7QxmMDKiNjzmEhM59I+lPOqo6NX/I5E23busgo27W7 N6uEpo/o0eUiP99G14+MdjDvZ1s53KL/CJUlJCVmB77ummqveu3cq78MeiteTeNaL6fu z4qrIfSzwXDf/o6ZBO80q4uud9S27vv4vpgfoZIesmJj4F8SCjjGennnBYuQuZ5lqc8m 95NX53fVvlg3Ju3dde748U8zVi3qj6Oo/QFSaSvSZ4hZ34jqOomc1I5nv2zDyWrMGGwO SHhA== MIME-Version: 1.0 X-Received: by 10.50.18.49 with SMTP id t17mr39932330igd.3.1427405193661; Thu, 26 Mar 2015 14:26:33 -0700 (PDT) Received: by 10.107.155.143 with HTTP; Thu, 26 Mar 2015 14:26:33 -0700 (PDT) In-Reply-To: <9D5B00849D2CDA4386BDA89E83F69E6C0FE3AD2B@G4W3292.americas.hpqcorp.net> References: <9D5B00849D2CDA4386BDA89E83F69E6C0FE3AD2B@G4W3292.americas.hpqcorp.net> Date: Thu, 26 Mar 2015 14:26:33 -0700 Message-ID: Subject: Re: Storing large data for MLlib machine learning From: Stephen Boesch To: "Ulanov, Alexander" Cc: "dev@spark.apache.org" Content-Type: multipart/alternative; boundary=089e0149c0a084d314051237ab2b X-Virus-Checked: Checked by ClamAV on apache.org --089e0149c0a084d314051237ab2b Content-Type: text/plain; charset=UTF-8 There are some convenience methods you might consider including: MLUtils.loadLibSVMFile and MLUtils.loadLabeledPoint 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander : > Hi, > > Could you suggest what would be the reasonable file format to store > feature vector data for machine learning in Spark MLlib? Are there any best > practices for Spark? > > My data is dense feature vectors with labels. Some of the requirements are > that the format should be easy loaded/serialized, randomly accessible, with > a small footprint (binary). I am considering Parquet, hdf5, protocol buffer > (protobuf), but I have little to no experience with them, so any > suggestions would be really appreciated. > > Best regards, Alexander > --089e0149c0a084d314051237ab2b--