flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: reading csv file from null value
Date Mon, 26 Oct 2015 10:01:07 GMT
Hi Philip,

the CsvInputFormat does not support to read empty fields.

I see two ways to achieve this functionality:
- Use a TextInputFormat that returns each line as a String and do the
parsing in a subsequent MapFunction
- Extend the CsvInputFormat to support empty fields

Cheers,
Fabian

2015-10-26 10:43 GMT+01:00 Philip Lee <philjjoon@gmail.com>:

> Thanks for your reply.
>
> What if I do not use Table API?
> The error happens when using just env.readFromCsvFile().
>
> I heard that using RowSerializer would handle this null value, but its
> error of TypeInformation happens when it is converted
>
> On Mon, Oct 26, 2015 at 10:26 AM, Maximilian Michels <mxm@apache.org>
> wrote:
>
>> As far as I know the null support was removed from the Table API because
>> its support was consistently supported with all operations. See
>> https://issues.apache.org/jira/browse/FLINK-2236
>>
>> On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena <ssaxena.ece@gmail.com>
>> wrote:
>>
>>> For a similar problem where we wanted to preserve and track null
>>> entries, we load the CSV as a DataSet[Array[Object]] and then transform it
>>> into DataSet[Row] using a custom RowSerializer(
>>> https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.
>>>
>>>
>>> The Table API(which supports null) can then be used on the resulting
>>> DataSet[Row].
>>>
>>>
>>> On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels <mxm@apache.org>
>>> wrote:
>>>
>>>> Hi Philip,
>>>>
>>>> How about making the empty field of type String? Then you can read the
>>>> CSV into a DataSet and treat the empty string as a null value. Not very
>>>> nice but a workaround. As of now, Flink deliberately doesn't support null
>>>> values.
>>>>
>>>> Regards,
>>>> Max
>>>>
>>>>
>>>> On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <philjjoon@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to load the dataset with the part of null value by using
>>>>> readCsvFile().
>>>>>
>>>>> // e.g  _date|_click|_sales|_item|_web_page|_user
>>>>>
>>>>> case class WebClick(_click_date: Long, _click_time: Long, _sales: Int,
_item: Int,_page: Int, _user: Int)
>>>>>
>>>>> private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick]
= {
>>>>>
>>>>>   env.readCsvFile[WebClick](
>>>>>     webClickPath,
>>>>>     fieldDelimiter = "|",
>>>>>     includedFields = Array(0, 1, 2, 3, 4, 5),
>>>>>     // lenient = true
>>>>>   )
>>>>> }
>>>>>
>>>>>
>>>>> Well, I know there is an option to ignore malformed value, but I have
>>>>> to read the dataset even though it has null value.
>>>>>
>>>>> as it follows, dataset (third column is null) looks like
>>>>> 37794|24669||16705|23|54810
>>>>> but I have to read null value as well because I have to use filter or
>>>>> where function ( _sales == null )
>>>>>
>>>>> Is there any detail suggestion to do it?
>>>>>
>>>>> Thanks,
>>>>> Philip
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ==========================================================
>>>>>
>>>>> *Hae Joon Lee*
>>>>>
>>>>>
>>>>> Now, in Germany,
>>>>>
>>>>> M.S. Candidate, Interested in Distributed System, Iterative Processing
>>>>>
>>>>> Dept. of Computer Science, Informatik in German, TUB
>>>>>
>>>>> Technical University of Berlin
>>>>>
>>>>>
>>>>> In Korea,
>>>>>
>>>>> M.S. Candidate, Computer Architecture Laboratory
>>>>>
>>>>> Dept. of Computer Science, KAIST
>>>>>
>>>>>
>>>>> Rm# 4414 CS Dept. KAIST
>>>>>
>>>>> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>>>>>
>>>>>
>>>>> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>>>>>
>>>>> ==========================================================
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
>
> ==========================================================
>
> *Hae Joon Lee*
>
>
> Now, in Germany,
>
> M.S. Candidate, Interested in Distributed System, Iterative Processing
>
> Dept. of Computer Science, Informatik in German, TUB
>
> Technical University of Berlin
>
>
> In Korea,
>
> M.S. Candidate, Computer Architecture Laboratory
>
> Dept. of Computer Science, KAIST
>
>
> Rm# 4414 CS Dept. KAIST
>
> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>
>
> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>
> ==========================================================
>

Mime
View raw message