flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Punit Naik <naik.puni...@gmail.com>
Subject Re: Read JSON file as input
Date Thu, 28 Apr 2016 10:34:57 GMT
I managed to fix this error. I basically had to do val j=data.map { x => (
x.replaceAll("\"","\\\"")) } instead of val j=data.map { x => ("\"\"\""+x+
"\"\"\"") }

On Wed, Apr 27, 2016 at 4:05 PM, Punit Naik <naik.punit44@gmail.com> wrote:

> I have my Apache Flink program:
>
> import org.apache.flink.api.scala._import scala.util.parsing.json._
> object numHits extends App {
>     val env = ExecutionEnvironment.getExecutionEnvironment
>     val data=env.readTextFile("file:///path/to/json/file")
>     val j=data.map { x => ("\"\"\""+x+"\"\"\"") }
>     /*1*/ println( ((j.first(1).collect())(0)).getClass() )
>
>     /*2*/ println( ((j.first(1).collect())(0)) )
>
>     /*3*/ println( JSON.parseFull((j.first(1).collect())(0)) )
>     }
>
> I want to parse the input JSON file into normal scala Map and for that I
> am using the default scala.util.parsing.json._ library.
>
> The output of the first println statement is class java.lang.String which
> is required by the JSON parsing function.
>
> Output of the second println function is the actual JSON string appended
> and prepended by "\"\"\"" which is also required by the JSON parser.
>
> Now at this point if I copy the output of the second println command
> printed in the console and pass it to the JSON.parseFull() function, it
> properly parses it.
>
> Therefore the third println function should properly parse the same
> string passed to it but it does not as it outputs a "None" string which
> means it failed.
>
> Why does this happen and how can I make it work?
>
> On Wed, Apr 27, 2016 at 12:41 PM, Punit Naik <naik.punit44@gmail.com>
> wrote:
>
>> I just tried it and it still cannot parse it. It still takes the input as
>> a dataset object rather than a string.
>>
>> On Wed, Apr 27, 2016 at 12:36 PM, Punit Naik <naik.punit44@gmail.com>
>> wrote:
>>
>>> Okay Thanks a lot Fabian!
>>>
>>> On Wed, Apr 27, 2016 at 12:34 PM, Fabian Hueske <fhueske@gmail.com>
>>> wrote:
>>>
>>>> You should do the parsing in a Map operator. Map applies the
>>>> MapFunction to
>>>> each element in the DataSet.
>>>> So you can either implement another MapFunction or extend the one you
>>>> have
>>>> to call the JSON parser.
>>>>
>>>> 2016-04-27 6:40 GMT+02:00 Punit Naik <naik.punit44@gmail.com>:
>>>>
>>>> > Hi
>>>> >
>>>> > So I managed to do the map part. I stuc with the "import
>>>> > scala.util.parsing.json._" library for parsing.
>>>> >
>>>> > First I read my JSON:
>>>> >
>>>> > val data=env.readTextFile("file:///home/punit/vik-in")
>>>> >
>>>> > Then I transformed it so that it can be parsed to a map:
>>>> >
>>>> > val j=data.map { x => ("\"\"\"").+(x).+("\"\"\"") }
>>>> >
>>>> >
>>>> > I check it by printing "j"s 1st value and its proper.
>>>> >
>>>> > But when I tried to parse "j" like this:
>>>> >
>>>> > JSON.parseFull(j.first(1)) ; it did not parse because the object
>>>> > "j.first(1)" is still a Dataset object and not a String object.
>>>> >
>>>> > So how can I get the underlying java object from the dataset object?
>>>> >
>>>> > On Tue, Apr 26, 2016 at 3:32 PM, Fabian Hueske <fhueske@gmail.com>
>>>> wrote:
>>>> >
>>>> > > Hi,
>>>> > >
>>>> > > you need to implement the MapFunction interface [1].
>>>> > > Inside the MapFunction you can use any JSON parser library such
as
>>>> > Jackson
>>>> > > to parse the String.
>>>> > > The exact logic depends on your use case.
>>>> > >
>>>> > > However, you should be careful to not initialize a new parser in
>>>> each
>>>> > map()
>>>> > > call, because that would be quite expensive.
>>>> > > I recommend to extend the RichMapFunction and instantiate a parser
>>>> in the
>>>> > > open() method.
>>>> > >
>>>> > > Best, Fabian
>>>> > >
>>>> > > [1]
>>>> > >
>>>> > >
>>>> >
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/dataset_transformations.html#map
>>>> > > [2]
>>>> > >
>>>> > >
>>>> >
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-transformation-functions
>>>> > >
>>>> > > 2016-04-26 10:44 GMT+02:00 Punit Naik <naik.punit44@gmail.com>:
>>>> > >
>>>> > > > Hi Fabian
>>>> > > >
>>>> > > > Thanks for the reply. Yes my json is separated by new lines.
It
>>>> would
>>>> > > have
>>>> > > > been great if you had explained the function that goes inside
the
>>>> map.
>>>> > I
>>>> > > > tried to use the 'scala.util.parsing.json._' library but got
no
>>>> luck.
>>>> > > >
>>>> > > > On Tue, Apr 26, 2016 at 1:11 PM, Fabian Hueske <fhueske@gmail.com
>>>> >
>>>> > > wrote:
>>>> > > >
>>>> > > > > Hi Punit,
>>>> > > > >
>>>> > > > > JSON can be hard to parse in parallel due to its nested
>>>> structure. It
>>>> > > > > depends on the schema and (textual) representation of
the JSON
>>>> > whether
>>>> > > > and
>>>> > > > > how it can be done. The problem is that a parallel input
format
>>>> needs
>>>> > > to
>>>> > > > be
>>>> > > > > able to identify record boundaries without context information.
>>>> This
>>>> > > can
>>>> > > > be
>>>> > > > > very easy, if your JSON data is a list of JSON objects
which are
>>>> > > > separated
>>>> > > > > by a new line character. However, this is hard to generalize.
>>>> That's
>>>> > > why
>>>> > > > > Flink does not offer tooling for it (yet).
>>>> > > > >
>>>> > > > > If your JSON objects are separated by new line characters,
the
>>>> > easiest
>>>> > > > way
>>>> > > > > is to read it as text file, where each line results in
a String
>>>> and
>>>> > > parse
>>>> > > > > each object using a standard JSON parser. This would
look like:
>>>> > > > >
>>>> > > > > ExecutionEnvironment env =
>>>> > > > ExecutionEnvironment.getExecutionEnvironment();
>>>> > > > >
>>>> > > > > DataSet<String> text = env.readTextFile("/path/to/jsonfile");
>>>> > > > > DataSet<YourObject> json = text.map(new
>>>> > > > YourMapFunctionWhichParsesJSON());
>>>> > > > >
>>>> > > > > Best, Fabian
>>>> > > > >
>>>> > > > > 2016-04-26 8:06 GMT+02:00 Punit Naik <naik.punit44@gmail.com>:
>>>> > > > >
>>>> > > > > > Hi
>>>> > > > > >
>>>> > > > > > I am new to Flink. I was experimenting with the
Dataset API
>>>> and
>>>> > found
>>>> > > > out
>>>> > > > > > that there is no explicit method for loading a JSON
file as
>>>> input.
>>>> > > Can
>>>> > > > > > anyone please suggest me a workaround?
>>>> > > > > >
>>>> > > > > > --
>>>> > > > > > Thank You
>>>> > > > > >
>>>> > > > > > Regards
>>>> > > > > >
>>>> > > > > > Punit Naik
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > --
>>>> > > > Thank You
>>>> > > >
>>>> > > > Regards
>>>> > > >
>>>> > > > Punit Naik
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Thank You
>>>> >
>>>> > Regards
>>>> >
>>>> > Punit Naik
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Thank You
>>>
>>> Regards
>>>
>>> Punit Naik
>>>
>>
>>
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>



-- 
Thank You

Regards

Punit Naik

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message