drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pradeeban Kathiravelu <kk.pradee...@gmail.com>
Subject Re: Nested/n-dimensional Arrays in Mongo/Json and Drill
Date Wed, 14 Sep 2016 14:03:13 GMT
Hi Sudeesh,
Probably I was not clear in my previous email. The error is due to
accessing an array from the SELECT directly, which is not supported by
Drill. My question was, how to overcome this hurdle without using the
performance-hungry FLATTEN keyword or using the array indices (impractical
as the data is not structured and we do not know the number of elements,
and it is going to be a lot and a 3-D array). I am planning to use
Protobufs with Drill to serialize and deserialize the data.

Anyway, since you asked, given below is the error logs when accessing the
nested array element using SELECT. (The entire query and the sample data
were given in the previous email).


Error: SYSTEM ERROR: NullPointerException

Fragment 0:0

[Error Id: 98390161-00e7-4982-b40b-171c8f1029c3 on llovizna:31010]

  (java.lang.NullPointerException) null
    org.apache.drill.exec.vector.complex.impl.MapOrListWriterImpl.list():73
    org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
    org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
    org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():139
    org.apache.drill.exec.store.bson.BsonRecordReader.write():75
    org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
    org.apache.drill.exec.physical.impl.ScanBatch.next():191
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1657
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():744 (state=,code=0)


Thank you.
Regards,
Pradeeban.

On Tue, Sep 13, 2016 at 8:34 PM, Sudheesh Katkam <skatkam@maprtech.com>
wrote:

> Hi Pradeeban,
>
> Can you post the detailed error message?
>
> First set the option:
> > SET `exec.errors.verbose` = true;
>
> And then run the query. The detailed output will point us to where the
> error occurred.
>
> Thank you,
> Sudheesh
>
> > On Sep 12, 2016, at 10:47 AM, Pradeeban Kathiravelu <
> kk.pradeeban@gmail.com> wrote:
> >
> > Hi,
> > I have a complex json data stored in Mongo.
> >
> > The data has nested arrays as shown below ****.
> >
> > here when I try to access the geometry field in select queries, the Drill
> > query fails.
> >
> > select camic._id, camic.type, camic.parent_id, camic.randval,
> > camic.creation_date, camic.object_type, camic.x, camic.y,
> camic.normalized,
> > camic.bbox, camic.geometry, camic.footprint, camic.properties,
> > camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2`
> as
> > camic WHERE ((camic.provenance.image.case_id =
> 'TCGA-02-0001-01Z-00-DX1')
> > AND (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
> > (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND
> (camic.y
> >> = 0) AND (camic.y <= 1));
> >
> > would fail with a below error:
> >
> >
> >
> >
> >
> > *Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id:
> > 8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)*
> >
> > However, after removing the geometry from the select query, it would work
> > just fine.
> >
> > select camic._id, camic.type, camic.parent_id, camic.randval,
> > camic.creation_date, camic.object_type, camic.x, camic.y,
> camic.normalized,
> > camic.bbox, camic.footprint, camic.properties, camic.provenance,
> > camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as camic WHERE
> > ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') AND
> > (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
> > (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND
> (camic.y
> >> = 0) AND (camic.y <= 1));
> >
> >
> > I know this is due to the array in the output. I am also aware of the
> > commonly suggested options.
> > 1. Using the array indexes in the select query: This is *impractical*. I
> do
> > not know how many elements I would have in this geojson - the
> coordinates.
> > It may be millions or as low as 3.
> >
> > 2. Flatten keyword: I am using Drill on top of Mongo - and finding an
> > interesting case where Drill outperforms certain queries in a distributed
> > execution than just using Mongo. Using Flatten basically kills all the
> > performance benefits I have with Drill otherwise. Flatten is just plain
> > *expensive* operation for the scale of my data (around 48 GB. But I can
> > split them into a few GB each).
> >
> > Now given that I have explained why I cannot use the commonly suggested
> > options to deal with the complex Drill queries involving arrays, what are
> > the other alternatives?
> >
> > I am thinking of using Protobufs with Drill to serialize the nested
> arrays.
> > Has anyone else tried it before, and any pointers, or anyother
> suggestions
> > on this overall requirement of querying the nested arrays?
> >
> > Thank you.
> > Regards,
> > Pradeeban.
> >
> >
> > **** The sample data in Mongo (data anonymized).
> >> db.testUAIM2.findOne()
> > {
> >    "_id" : ObjectId("22"),
> >    "bbox" : [
> >        0.11,
> >        0.33,
> >        0.44,
> >        0.44
> >    ],
> >    "*geometry*" : {
> >        "type" : "Polygon",
> >        "coordinates" : [
> >            [
> >                [
> >                    0.04272915795445442,
> >                    0.9368849396705627
> >                ],
> >                [
> >                    0.04272915795445442,
> >                    0.9369083046913147
> >                ],
> >                [
> >                    0.042739588767290115,
> >                    0.9369083046913147
> >                ],
> >                [
> >                    0.042739588767290115,
> >                    0.9369550347328186
> >                ],
> >                [
> >                    0.04275001958012581,
> >                    0.9369550347328186
> >                ],
> >                [
> >                    0.04275001958012581,
> >                    0.9370251893997192
> >                ],
> >                [
> >                    0.042760446667671204,
> >                    0.9370251893997192
> >                ],
> >                [
> >                    0.042760446667671204,
> >                    0.9371420741081238
> >                ],
> >                [
> >                    0.04275001958012581,
> >                    0.9371420741081238
> >                ],
> >                [
> >                    0.04275001958012581,
> >                    0.9372121691703796
> >                ],
> >                [
> >                    0.042739588767290115,
> >                    0.9372121691703796
> >                ],
> >                [
> >                    0.042739588767290115,
> >                    0.9372823238372803
> >                ],
> >                [
> >                    0.04272915795445442,
> >                    0.9372823238372803
> >                ],
> >                [
> >                    0.04272915795445442,
> >                    0.9373056888580322
> >                ],
> >                [
> >                    0.04272915795445442,
> >                    0.9368849396705627
> >                ]
> >            ]
> >        ]
> >    },
> >    "footprint" : 5,
> >    "properties" : {
> >        "scalar_features" : [
> >            {
> >                "nv" : [
> >                    {
> >                        "name" : "f",
> >                        "value" : 16.7036895751953
> >                    }
> >                ]
> >            }
> >        ]
> >    }}
> >
> >
> > --
> > Pradeeban Kathiravelu.
> > PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
> > INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
> > Portugal.
> > Biomedical Informatics Software Engineer, Emory University School of
> > Medicine.
> >
> > Blog: [Llovizna] http://kkpradeeban.blogspot.com/
> > LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03
>
>


-- 
Pradeeban Kathiravelu.
PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
Portugal.
Biomedical Informatics Software Engineer, Emory University School of
Medicine.

Blog: [Llovizna] http://kkpradeeban.blogspot.com/
LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message