spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pralabh Kumar <pralabhku...@gmail.com>
Subject Re: Best alternative for Category Type in Spark Dataframe
Date Fri, 16 Jun 2017 15:28:30 GMT
Hi Saatvik

You can write your own transformer to make sure that column contains ,value
which u provided , and filter out rows which doesn't follow the same.

Something like this


case class CategoryTransformer(override val uid : String) extends
Transformer{
  override def transform(inputData: DataFrame): DataFrame = {
    inputData.select("col1").filter("col1 in ('happy')")
  }
  override def copy(extra: ParamMap): Transformer = ???
  @DeveloperApi
  override def transformSchema(schema: StructType): StructType ={
   schema
  }
}


Usage

val data = sc.parallelize(List("abce","happy")).toDF("col1")
val trans = new CategoryTransformer("1")
data.show()
trans.transform(data).show()


This transformer will make sure , you always have values in col1 as
provided by you.


Regards
Pralabh Kumar

On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <saatvikshah1994@gmail.com>
wrote:

> Hi Pralabh,
>
> I want the ability to create a column such that its values be restricted
> to a specific set of predefined values.
> For example, suppose I have a column called EMOTION: I want to ensure each
> row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>
> Thanks and Regards,
> Saatvik Shah
>
>
> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pralabhkumar@gmail.com>
> wrote:
>
>> Hi satvik
>>
>> Can u please provide an example of what exactly you want.
>>
>>
>>
>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1994@gmail.com> wrote:
>>
>>> Hi Yan,
>>>
>>> Basically the reason I was looking for the categorical datatype is as
>>> given here
>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>> ability to fix column values to specific categories. Is it possible to
>>> create a user defined data type which could do so?
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai.yan@gmail.com>
>>> wrote:
>>>
>>>> You can use some Transformers to handle categorical data,
>>>> For example,
>>>> StringIndexer encodes a string column of labels to a column of label
>>>> indices:
>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>
>>>>
>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>> saatvikshah1994@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns
>>>>> I have
>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>> support for
>>>>> this same type in Spark. What is the best alternative?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://apache-spark-user-list.
>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>> Spark-Dataframe-tp28764.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>

Mime
View raw message