flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Punit Naik <naik.puni...@gmail.com>
Subject Re: Read JSON file as input
Date Wed, 27 Apr 2016 10:35:34 GMT
I have my Apache Flink program:

import org.apache.flink.api.scala._import scala.util.parsing.json._
object numHits extends App {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val data=env.readTextFile("file:///path/to/json/file")
    val j=data.map { x => ("\"\"\""+x+"\"\"\"") }
    /*1*/ println( ((j.first(1).collect())(0)).getClass() )

    /*2*/ println( ((j.first(1).collect())(0)) )

    /*3*/ println( JSON.parseFull((j.first(1).collect())(0)) )
    }

I want to parse the input JSON file into normal scala Map and for that I am
using the default scala.util.parsing.json._ library.

The output of the first println statement is class java.lang.String which
is required by the JSON parsing function.

Output of the second println function is the actual JSON string appended
and prepended by "\"\"\"" which is also required by the JSON parser.

Now at this point if I copy the output of the second println command
printed in the console and pass it to the JSON.parseFull() function, it
properly parses it.

Therefore the third println function should properly parse the same string
passed to it but it does not as it outputs a "None" string which means it
failed.

Why does this happen and how can I make it work?

On Wed, Apr 27, 2016 at 12:41 PM, Punit Naik <naik.punit44@gmail.com> wrote:

> I just tried it and it still cannot parse it. It still takes the input as
> a dataset object rather than a string.
>
> On Wed, Apr 27, 2016 at 12:36 PM, Punit Naik <naik.punit44@gmail.com>
> wrote:
>
>> Okay Thanks a lot Fabian!
>>
>> On Wed, Apr 27, 2016 at 12:34 PM, Fabian Hueske <fhueske@gmail.com>
>> wrote:
>>
>>> You should do the parsing in a Map operator. Map applies the MapFunction
>>> to
>>> each element in the DataSet.
>>> So you can either implement another MapFunction or extend the one you
>>> have
>>> to call the JSON parser.
>>>
>>> 2016-04-27 6:40 GMT+02:00 Punit Naik <naik.punit44@gmail.com>:
>>>
>>> > Hi
>>> >
>>> > So I managed to do the map part. I stuc with the "import
>>> > scala.util.parsing.json._" library for parsing.
>>> >
>>> > First I read my JSON:
>>> >
>>> > val data=env.readTextFile("file:///home/punit/vik-in")
>>> >
>>> > Then I transformed it so that it can be parsed to a map:
>>> >
>>> > val j=data.map { x => ("\"\"\"").+(x).+("\"\"\"") }
>>> >
>>> >
>>> > I check it by printing "j"s 1st value and its proper.
>>> >
>>> > But when I tried to parse "j" like this:
>>> >
>>> > JSON.parseFull(j.first(1)) ; it did not parse because the object
>>> > "j.first(1)" is still a Dataset object and not a String object.
>>> >
>>> > So how can I get the underlying java object from the dataset object?
>>> >
>>> > On Tue, Apr 26, 2016 at 3:32 PM, Fabian Hueske <fhueske@gmail.com>
>>> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > you need to implement the MapFunction interface [1].
>>> > > Inside the MapFunction you can use any JSON parser library such as
>>> > Jackson
>>> > > to parse the String.
>>> > > The exact logic depends on your use case.
>>> > >
>>> > > However, you should be careful to not initialize a new parser in each
>>> > map()
>>> > > call, because that would be quite expensive.
>>> > > I recommend to extend the RichMapFunction and instantiate a parser
>>> in the
>>> > > open() method.
>>> > >
>>> > > Best, Fabian
>>> > >
>>> > > [1]
>>> > >
>>> > >
>>> >
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/dataset_transformations.html#map
>>> > > [2]
>>> > >
>>> > >
>>> >
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-transformation-functions
>>> > >
>>> > > 2016-04-26 10:44 GMT+02:00 Punit Naik <naik.punit44@gmail.com>:
>>> > >
>>> > > > Hi Fabian
>>> > > >
>>> > > > Thanks for the reply. Yes my json is separated by new lines. It
>>> would
>>> > > have
>>> > > > been great if you had explained the function that goes inside
the
>>> map.
>>> > I
>>> > > > tried to use the 'scala.util.parsing.json._' library but got no
>>> luck.
>>> > > >
>>> > > > On Tue, Apr 26, 2016 at 1:11 PM, Fabian Hueske <fhueske@gmail.com>
>>> > > wrote:
>>> > > >
>>> > > > > Hi Punit,
>>> > > > >
>>> > > > > JSON can be hard to parse in parallel due to its nested
>>> structure. It
>>> > > > > depends on the schema and (textual) representation of the
JSON
>>> > whether
>>> > > > and
>>> > > > > how it can be done. The problem is that a parallel input
format
>>> needs
>>> > > to
>>> > > > be
>>> > > > > able to identify record boundaries without context information.
>>> This
>>> > > can
>>> > > > be
>>> > > > > very easy, if your JSON data is a list of JSON objects which
are
>>> > > > separated
>>> > > > > by a new line character. However, this is hard to generalize.
>>> That's
>>> > > why
>>> > > > > Flink does not offer tooling for it (yet).
>>> > > > >
>>> > > > > If your JSON objects are separated by new line characters,
the
>>> > easiest
>>> > > > way
>>> > > > > is to read it as text file, where each line results in a
String
>>> and
>>> > > parse
>>> > > > > each object using a standard JSON parser. This would look
like:
>>> > > > >
>>> > > > > ExecutionEnvironment env =
>>> > > > ExecutionEnvironment.getExecutionEnvironment();
>>> > > > >
>>> > > > > DataSet<String> text = env.readTextFile("/path/to/jsonfile");
>>> > > > > DataSet<YourObject> json = text.map(new
>>> > > > YourMapFunctionWhichParsesJSON());
>>> > > > >
>>> > > > > Best, Fabian
>>> > > > >
>>> > > > > 2016-04-26 8:06 GMT+02:00 Punit Naik <naik.punit44@gmail.com>:
>>> > > > >
>>> > > > > > Hi
>>> > > > > >
>>> > > > > > I am new to Flink. I was experimenting with the Dataset
API and
>>> > found
>>> > > > out
>>> > > > > > that there is no explicit method for loading a JSON
file as
>>> input.
>>> > > Can
>>> > > > > > anyone please suggest me a workaround?
>>> > > > > >
>>> > > > > > --
>>> > > > > > Thank You
>>> > > > > >
>>> > > > > > Regards
>>> > > > > >
>>> > > > > > Punit Naik
>>> > > > > >
>>> > > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Thank You
>>> > > >
>>> > > > Regards
>>> > > >
>>> > > > Punit Naik
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Thank You
>>> >
>>> > Regards
>>> >
>>> > Punit Naik
>>> >
>>>
>>
>>
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>



-- 
Thank You

Regards

Punit Naik

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message