Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E648B200D3B for ; Fri, 10 Nov 2017 15:38:06 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id E4BC6160BCB; Fri, 10 Nov 2017 14:38:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id ACAB6160BEE for ; Fri, 10 Nov 2017 15:38:05 +0100 (CET) Received: (qmail 3956 invoked by uid 500); 10 Nov 2017 14:38:04 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 3661 invoked by uid 99); 10 Nov 2017 14:38:04 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Nov 2017 14:38:04 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 8AF9BDFC36; Fri, 10 Nov 2017 14:38:04 +0000 (UTC) From: thunterdb To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea... Content-Type: text/plain Message-Id: <20171110143804.8AF9BDFC36@git1-us-west.apache.org> Date: Fri, 10 Nov 2017 14:38:04 +0000 (UTC) archived-at: Fri, 10 Nov 2017 14:38:07 -0000 Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150250118 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import scala.language.existentials +import scala.util.Random + +import org.apache.commons.io.FilenameUtils +import org.apache.hadoop.conf.{Configuration, Configured} +import org.apache.hadoop.fs.{Path, PathFilter} +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat + +import org.apache.spark.sql.SparkSession + +private object RecursiveFlag { + /** + * Sets the spark recursive flag and then restores it. + * + * @param value Value to set + * @param spark Existing spark session + * @param f The function to evaluate after setting the flag + * @return Returns the evaluation result T of the function + */ + def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): T = { + val flagName = FileInputFormat.INPUT_DIR_RECURSIVE + val hadoopConf = spark.sparkContext.hadoopConfiguration + val old = Option(hadoopConf.get(flagName)) + hadoopConf.set(flagName, value.toString) + try f finally { + old match { + case Some(v) => hadoopConf.set(flagName, v) + case None => hadoopConf.unset(flagName) + } + } + } +} + +/** + * Filter that allows loading a fraction of HDFS files. + */ +private class SamplePathFilter extends Configured with PathFilter { --- End diff -- Actually, I take this comment back, since it would break on some pathological cases such as all the names being the same. When users want some samples, they most probably want a result that is a fraction of the original, whatever it may contain. @jkbradley do you prefer a something that may not be deterministic (using random numbers) or deterministic but not respecting the sampling ratio in pathological cases? The only way to do both that I can think of is deduplicating, which requires a shuffle. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org