Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F42310FB1 for ; Tue, 4 Mar 2014 13:11:08 +0000 (UTC) Received: (qmail 83313 invoked by uid 500); 4 Mar 2014 13:11:07 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 83111 invoked by uid 500); 4 Mar 2014 13:11:05 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@spark.apache.org Delivered-To: mailing list user@spark.apache.org Received: (qmail 83100 invoked by uid 99); 4 Mar 2014 13:11:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Mar 2014 13:11:05 +0000 X-ASF-Spam-Status: No, hits=0.6 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sowen@cloudera.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vc0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Mar 2014 13:10:59 +0000 Received: by mail-vc0-f175.google.com with SMTP id il7so98567vcb.34 for ; Tue, 04 Mar 2014 05:10:38 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=/GduOKPIvJ+SOXbTEX5lla/NTzpU0xAYdZIyFmaJVoU=; b=KAygrolgJR48cixOABTILRcfFLXetZiNp3BnbdqD0Cs68kqL/eA3l5i0eXEPj3/kua oQMCGk12lehGqMtfy/4qwnvfg0xsV9vtbvQGLMYtMp7vH3p8zafT5OxxQ3+Z/r+Z4xTA DloikDVTJCrd35bVp5jbAFWOJnT/RWXHjfomPYJ/B8ADdEC4NNh8WdGt2ZeXQyjBdM+3 2lHqmMPJHMPM/QZFoCRHjCC0mawu6Hh+DGIivkEFj4DoCS5dmSsEIQShy+JepGqpzxkI G5Lc7KDDmZvPvxuGifOgeAKV7/Qa/WiIyx2M1k6es38j1Sdds5j26oJrFUm3IISJuc0H 1VFA== X-Gm-Message-State: ALoCoQlpMZ8M+T31uem3pNZaEjyuvPiw5HOF58DNxxrgMDnCU7zdHM9cXA1Yv0C+XwImRc3WCgZN MIME-Version: 1.0 X-Received: by 10.58.190.99 with SMTP id gp3mr2410751vec.32.1393938638753; Tue, 04 Mar 2014 05:10:38 -0800 (PST) Received: by 10.58.37.103 with HTTP; Tue, 4 Mar 2014 05:10:38 -0800 (PST) In-Reply-To: <1393938372226-2285.post@n3.nabble.com> References: <1393938372226-2285.post@n3.nabble.com> Date: Tue, 4 Mar 2014 13:10:38 +0000 Message-ID: Subject: Re: RDD Manipulation in Scala. From: Sean Owen To: user@spark.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org data.filter(_.split("\t")(1) == "A2") ? -- Sean Owen | Director, Data Science | London On Tue, Mar 4, 2014 at 1:06 PM, trottdw wrote: > Hello, I am using Spark with Scala and I am attempting to understand the > different filtering and mapping capabilities available. I haven't found an > example of the specific task I would like to do. > > I am trying to read in a tab spaced text file and filter specific entries. > I would like this filter to be applied to different "columns" and not lines. > I was using the following to split the data but attempts to filter by > "column" afterwards are not working. > ----------------------------- > val data = sc.textFile("test_data.txt") > var parsedData = data.map( _.split("\t").map(_.toString)) > ------------------------------ > > To try to give a more concrete example of my goal, > Suppose the data file is: > A1 A2 A3 A4 > B1 B2 A3 A4 > C1 A2 C2 C3 > > > How would I filter the data based on the second column to only return those > entries which have A2 in column two? So, that the resulting RDD would just > be: > > A1 A2 A3 A4 > C1 A2 C2 C3 > > Is there a convenient way to do this? Any suggestions or assistance would > be appreciated. > > > > > -- > View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Manipulation-in-Scala-tp2285.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.