spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cfregly <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support
Date Tue, 29 Jul 2014 22:04:10 GMT
Github user cfregly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1434#discussion_r15555717
  
    --- Diff: extras/spark-kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala
---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.streaming.kinesis
    +
    +import java.net.InetAddress
    +import java.util.UUID
    +import org.apache.spark.Logging
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.streaming.receiver.Receiver
    +import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
    +import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessor
    +import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorFactory
    +import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
    +import com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibConfiguration
    +import com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker
    +import java.nio.ByteBuffer
    +import org.apache.spark.streaming.util.SystemClock
    +
    +/**
    + * Custom AWS Kinesis-specific implementation of Spark Streaming's Receiver.
    + * This implementation relies on the Kinesis Client Library (KCL) Worker as described
here:
    + * https://github.com/awslabs/amazon-kinesis-client
    + * This is a custom receiver used with StreamingContext.receiverStream(Receiver) as described
here:
    + * http://spark.apache.org/docs/latest/streaming-custom-receivers.html
    + * Instances of this class will get shipped to the Spark Streaming Workers to run within
a Spark Executor.
    + *
    + * @param app name
    + * @param Kinesis stream name
    + * @param endpoint url of Kinesis service
    + * @param checkpoint interval (millis) for Kinesis checkpointing (not Spark checkpointing).
    + *   See the Kinesis Spark Streaming documentation for more details on the different
types of checkpoints.
    + * @param in the absence of Kinesis checkpoint info, this is the worker's initial starting
position in the stream.
    + *   The values are either the beginning of the stream per Kinesis' limit of 24 hours
(InitialPositionInStream.TRIM_HORIZON)
    + *      or the tip of the stream using InitialPositionInStream.LATEST.
    + * @param persistence strategy for RDDs and DStreams.
    + */
    +private[streaming] class KinesisReceiver(
    +  app: String,
    +  stream: String,
    +  endpoint: String,
    +  checkpointIntervalMillis: Long,
    +  initialPositionInStream: InitialPositionInStream,
    +  storageLevel: StorageLevel)
    +  extends Receiver[Array[Byte]](storageLevel) with Logging { receiver =>
    +
    +  /**
    +   *  The lazy val's below will get instantiated in the remote Executor after the closure
is shipped to the Spark Worker. 
    +   *  These are all lazy because they're from third-party Amazon libraries and are not
Serializable.
    +   *  If they're not marked lazy, they will cause NotSerializableExceptions when they're
shipped to the Spark Worker.
    +   */
    +
    +  /**
    +   *  workerId is lazy because we want the address of the actual Worker where the code
runs - not the Driver's ip address.
    +   *  This makes a difference when running in a cluster.
    +   */
    +  lazy val workerId = InetAddress.getLocalHost.getHostAddress() + ":" + UUID.randomUUID()
    --- End diff --
    
    there can be multiple workers per host, so i can't just use the host address.  but to
answer your question, i guess i don't really need the host address since i'm generating a
random UUID.  
    
    however, i found it useful when reviewing logs for debugging purposes.  i'll keep for
now unless you have a strong objection.
    
    good catch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message