Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C5104200D61 for ; Tue, 5 Dec 2017 03:00:53 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C39B8160C05; Tue, 5 Dec 2017 02:00:53 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 93DBC160BF9 for ; Tue, 5 Dec 2017 03:00:52 +0100 (CET) Received: (qmail 66272 invoked by uid 500); 5 Dec 2017 02:00:51 -0000 Mailing-List: contact user-help@predictionio.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@predictionio.apache.org Delivered-To: mailing list user@predictionio.apache.org Received: (qmail 66262 invoked by uid 99); 5 Dec 2017 02:00:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Dec 2017 02:00:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id DC9EA180676 for ; Tue, 5 Dec 2017 02:00:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.121 X-Spam-Level: X-Spam-Status: No, score=-0.121 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Dry0soYr5EJD for ; Tue, 5 Dec 2017 02:00:47 +0000 (UTC) Received: from mail-wm0-f53.google.com (mail-wm0-f53.google.com [74.125.82.53]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 3D1C35F2A9 for ; Tue, 5 Dec 2017 02:00:47 +0000 (UTC) Received: by mail-wm0-f53.google.com with SMTP id f9so17708189wmh.0 for ; Mon, 04 Dec 2017 18:00:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=+JNNVQNpiTphDhjlIp2MAcwiGG3ZAI+SBRPyvR8rC48=; b=RZa0dyI5bkDi9MUZ8VtYTOKpcvkY/N1HKT3Ady5wNpVZg+GCvpkjWz8dkeil1iFZLU do+9BlI4aaAq523ocVKm7jV3sHn9/0IZa9T4iw91yOHhrvJQH3aDCMuUy/PEwFjA/dYe lNhsCZ0NMiU3+2c/kwzWifL/sUNagruzVimhMwnmnMKVtzJBvSrCdC2iGMEITyBPzbDA B7iXvctGIgfoquQJ2YJ5sltAy/rmYb5oQCeJkpt4jlgzgWSatQEDbDDeVpAfAeEsu95u PCwfP4xvdpvkcgyPnH3rsP7oo24psaXxjGwXVXJa2PSU/Xz5sY+mfdU7f8y7S77TnO8g CV4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=+JNNVQNpiTphDhjlIp2MAcwiGG3ZAI+SBRPyvR8rC48=; b=llmQqP35VzOX71Z/lzuMrYB0I1HJnGDeS8X+Vbo5G5BzfWGfzHbWXFEzeu/4cL36p9 sKl2od6E65gJXN+RjVppO/IyaQpZCrqIopaeN5mb5bJmF5i19JanogKNN0DcLFmILZkF KYOyr3dfAwrEwmOUd9SgMBWaqi2Q4+nAcIUpAKDmL++bhsoKc0dMJPclCiYcGb2YHZ+n qWwc6Xse0BgT5ylXAKBAwPkNiFhhj4B2V3+WLW98qvZ7PyMJY7GeSQ4wTuECqUhpZyPB ULVBY3k+DUNaLJrtPd0w+hqKDpVBesWeVkV/OUlzEPDgz8FR+22tJ2OtGlggN+tw5N8C DwFQ== X-Gm-Message-State: AKGB3mKw5bmgpdxMSb8VRVMe4VYcA0ExfbvRGb0CoUSokP3L8SdTJ6WR No7II1JZNofdpvuKaDGkPGIpeWMtvDlpAw/pZSJfBkE0 X-Google-Smtp-Source: AGs4zMYGGCUJmP3x49K9dFrqjcKNipf/y8IlINeMr2tCuMGA/8iOXoTBpYbL+hIbWdBpIZsvC0S/KvuHnPmES67xkoM= X-Received: by 10.28.166.193 with SMTP id p184mr4549464wme.6.1512439245736; Mon, 04 Dec 2017 18:00:45 -0800 (PST) MIME-Version: 1.0 Received: by 10.223.166.12 with HTTP; Mon, 4 Dec 2017 18:00:45 -0800 (PST) In-Reply-To: References: <7CDD3B7D-B4DF-4FC1-804C-22594D344114@occamsmachete.com> <86A91172-3245-4F85-A642-3CC664AFB041@occamsmachete.com> <164D60A7-6BB5-4DD8-BC5F-CA0986290918@occamsmachete.com> From: takako shimamoto Date: Tue, 5 Dec 2017 11:00:45 +0900 Message-ID: Subject: Re: Data lost from HBase to DataSource To: user@predictionio.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable archived-at: Tue, 05 Dec 2017 02:00:54 -0000 Which version of HBase are you using? I guess it is because libraries of the storage/hbase subproject are too old that this causes. If you are using HBase 1.2.6, running assembly task against hbase-common, hbase-client and hbase-server 1.2.6 would work. 2017-11-30 17:25 GMT+09:00 Huang, Weiguang : > Hi Pat, > > > > We have compared the format of 2 records as attached from the json file f= or > import. The first one is imported and successfully read in $pio train as = we > printed out its entityID in logger, and the other should not have been re= ad > in pio successfully as its entityId is absent in logger. But the two reco= rds > have the same json format, as every record has been generated by the same > program. > > And here is an quick illustration of a record in json, with "encodedImage= " > being shortened from its actual 262,156 characters > > {"event": "imageNet", "entityId": 10004, "entityType": "JPEG", "propertie= s": > {"label": "n01484850", "encodedImage": "AAABAAA=E2=80=A6..Oynz4=3D=E2=80= =9D}} > > Only "entityId", "properties": {"label", "encodedImage"} could be differe= nt > among every record. > > > > We also noticed another weird thing. After the one-time $pio import of 6= 500 > records, we $pio export immediately and got 399 + 399 =3D 798 records in = 2 > $pio exported files. > > As we $pio train for a couple of rounds, the number of records in pio > increased to 399 + 399 + 399 =3D 1197 in 3 $pio exported files, > > and may to 399 + 399 + 399 + 399 =3D 1596 after more $pio train. > > > > Please see below the system logger for $pio import. It seems everything i= s > all right. > > $pio import --appid 8 --input > ../imageNetTemplate/data/imagenet_5_class_resized.json > > > > /opt/work/spark-2.1.1 is probably an Apache Spark development tree. Pleas= e > make sure you are using at least 1.3.0. > > SLF4J: Class path contains multiple SLF4J bindings. > > SLF4J: Found binding in > [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdf= s-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.= 0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > > [INFO] [Runner$] Submission command: /opt/work/spark-2.1.1/bin/spark-subm= it > --class org.apache.predictionio.tools.imprt.FileToEvents --jars > file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-ass= embly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/l= ib/spark/pio-data-localfs-assembly-0.11.0-incubating.jar,file:/opt/work/Pre= dictionIO-0.11.0-incubating/lib/spark/pio-data-jdbc-assembly-0.11.0-incubat= ing.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-el= asticsearch1-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.1= 1.0-incubating/lib/spark/pio-data-hbase-assembly-0.11.0-incubating.jar > --files > file:/opt/work/PredictionIO-0.11.0-incubating/conf/log4j.properties,file:= /opt/work/hbase-1.3.1/conf/hbase-site.xml > --driver-class-path > /opt/work/PredictionIO-0.11.0-incubating/conf:/opt/work/hbase-1.3.1/conf > --driver-java-options -Dpio.log.dir=3D/root > file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-inc= ubating.jar > --appid 8 --input > file:/opt/work/arda-data/pio-templates/dataImportTest/../imageNetTemplate= /data/imagenet_5_class_resized.json > --env > PIO_STORAGE_SOURCES_HBASE_TYPE=3Dhbase,PIO_ENV_LOADED=3D1,PIO_STORAGE_SOU= RCES_HBASE_HOSTS=3DGondolin-Node-050,PIO_STORAGE_REPOSITORIES_METADATA_NAME= =3Dpio_meta,PIO_VERSION=3D0.11.0,PIO_FS_BASEDIR=3D/root/.pio_store,PIO_STOR= AGE_SOURCES_ELASTICSEARCH_HOSTS=3Dlocalhost,PIO_STORAGE_SOURCES_HBASE_HOME= =3D/opt/work/hbase-1.3.1,PIO_HOME=3D/opt/work/PredictionIO-0.11.0-incubatin= g,PIO_FS_ENGINESDIR=3D/root/.pio_store/engines,PIO_STORAGE_SOURCES_LOCALFS_= PATH=3D/root/.pio_store/models,PIO_STORAGE_SOURCES_HBASE_PORTS=3D16000,PIO_= STORAGE_SOURCES_ELASTICSEARCH_TYPE=3Delasticsearch,PIO_STORAGE_REPOSITORIES= _METADATA_SOURCE=3DELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE= =3DLOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=3Dpio_event,PIO_STORAGE_= SOURCES_ELASTICSEARCH_CLUSTERNAME=3Dpredictionio,PIO_STORAGE_SOURCES_ELASTI= CSEARCH_HOME=3D/opt/work/elasticsearch-1.7.6,PIO_FS_TMPDIR=3D/root/.pio_sto= re/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=3Dpio_model,PIO_STORAGE_REPO= SITORIES_EVENTDATA_SOURCE=3DHBASE,PIO_CONF_DIR=3D/opt/work/PredictionIO-0.1= 1.0-incubating/conf,PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=3D9300,PIO_STOR= AGE_SOURCES_LOCALFS_TYPE=3Dlocalfs > > [INFO] [log] Logging initialized @4913ms > > [INFO] [Server] jetty-9.2.z-SNAPSHOT > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,AVAILABLE,@Spa= rk} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,AVAILABLE,@Spark= } > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,AVAILABLE,@Spar= k} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,AVAILABLE,= @Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,AVAILABLE,@Spark= } > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,AVAILABLE,@= Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@11787b64{/storage,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,AVAILABLE,@Spar= k} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,AVAILABLE,@Spark= } > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,AVAILABLE,@= Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@319642db{/environment,null,AVAILABLE,@Spark= } > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,AVAILABLE,@= Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,AVAILABLE,@Sp= ark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,AVAILAB= LE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,AV= AILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@357bc488{/static,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@4ea17147{/,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,AVAILABLE,@Spark} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,AVAILABLE,@Spa= rk} > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,AVAILABLE,= @Spark} > > [INFO] [ServerConnector] Started Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040} > > [INFO] [Server] Started @5086ms > > [INFO] [ContextHandler] Started > o.s.j.s.ServletContextHandler@4f114b{/metrics/json,null,AVAILABLE,@Spark} > > [INFO] [FileToEvents$] Events are imported. > > [INFO] [FileToEvents$] Done. > > [INFO] [ServerConnector] Stopped Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,UNAVAILABL= E,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,UNAVAILABLE,@S= park} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@4ea17147{/,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@357bc488{/static,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,UN= AVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,UNAVAIL= ABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,UNAVAILABLE,@= Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,UNAVAILABLE,@Spark= } > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,UNAVAILABLE= ,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@319642db{/environment,null,UNAVAILABLE,@Spa= rk} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,UNAVAILABLE= ,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,UNAVAILABLE,@Spa= rk} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,UNAVAILABLE,@Sp= ark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@11787b64{/storage,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,UNAVAILABLE= ,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,UNAVAILABLE,@Spa= rk} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,UNAVAILABL= E,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,UNAVAILABLE,@Sp= ark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,UNAVAILABLE,@Spa= rk} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,UNAVAILABLE,@S= park} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,UNAVAILABLE,@Spark} > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,UNAVAILABLE,@Spark= } > > [INFO] [ContextHandler] Stopped > o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,UNAVAILABLE,@Spark} > > > > Thanks for your advice. > > > > Weiguang > > > > From: Pat Ferrel [mailto:pat@occamsmachete.com] > Sent: Thursday, November 30, 2017 2:06 AM > To: user@predictionio.apache.org > Cc: Shi, Dongjie > > > Subject: Re: Data lost from HBase to DataSource > > > > 1596 is how many events were accepted by the EventServer, look at the > exported format and compare with the ones you imported. There must be a > formatting error or an error when importing (did you check responses for > each event import?) > > > > Looking below I see you are importing JPEG??? This is almost always a bad > idea. Image data is usually kept in a filesystems like HDFS and a referen= ce > kept in the DB, there are too may serialization questions to do otherwise= in > my experience. If your Engine requires this you are asking for the kind o= f > trouble you are seeing. > > > > > > On Nov 28, 2017, at 7:16 PM, Huang, Weiguang > wrote: > > > > Hi Pat, > > > > Here is the result when we tried out your suggestion. > > > > We checked the data from the Hbase, and the count of the records is exact= ly > the same as we imported into the Hbase, that is 6500. > > 2017-11-29 10:42:19 INFO DAGScheduler:54 - Job 0 finished: count at > ImageDataFromHBaseChecker.scala:27, took 12.016679 s > > Number of Records found : 6500 > > > > We exported data from Pio and checked, but got only 1596 =E2=80=93 see at= the bottom > of the below screen record. > > $ ls -al > > total 412212 > > drwxr-xr-x 2 root root 4096 Nov 29 02:48 . > > drwxr-xr-x 23 root root 4096 Nov 29 02:48 .. > > -rw-r--r-- 1 root root 8 Nov 29 02:48 ._SUCCESS.crc > > -rw-r--r-- 1 root root 817976 Nov 29 02:48 .part-00000.crc > > -rw-r--r-- 1 root root 817976 Nov 29 02:48 .part-00001.crc > > -rw-r--r-- 1 root root 817976 Nov 29 02:48 .part-00002.crc > > -rw-r--r-- 1 root root 817976 Nov 29 02:48 .part-00003.crc > > -rw-r--r-- 1 root root 0 Nov 29 02:48 _SUCCESS > > -rw-r--r-- 1 root root 104699844 Nov 29 02:48 part-00000 > > -rw-r--r-- 1 root root 104699877 Nov 29 02:48 part-00001 > > -rw-r--r-- 1 root root 104699843 Nov 29 02:48 part-00002 > > -rw-r--r-- 1 root root 104699863 Nov 29 02:48 part-00003 > > $ wc -l part-00000 > > 399 part-00000 > > $ wc -l part-00001 > > 399 part-00001 > > $ wc -l part-00002 > > 399 part-00002 > > $ wc -l part-00003 > > 399 part-00003 > > That is 399 * 4 =3D 1596 > > > > Is this data lost caused by schema changed, or ill data contents, or othe= r > possible reasons? Appreciate for your thoughts. > > > > Thanks, > > Weiguang > > > > From: Pat Ferrel [mailto:pat@occamsmachete.com] > Sent: Wednesday, November 29, 2017 10:16 AM > To: user@predictionio.apache.org > Cc: user@predictionio.incubator.apache.org > Subject: Re: Data lost from HBase to DataSource > > > > Try my suggestion with export and see if the number of events looks corre= ct. > I am suggesting that you may not be counting what you think you are using > HBase. > > > > > > On Nov 28, 2017, at 5:53 PM, Huang, Weiguang > wrote: > > > > Hi Pat, > > > > Thanks for your advice. However, we are not using HBase directly. We use > pio to import data into HBase by below command: > > pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName > /recordFile.json > > Could things go wrong here or somewhere else? > > > > Thanks, > > Weiguang > > From: Pat Ferrel [mailto:pat@occamsmachete.com] > Sent: Tuesday, November 28, 2017 11:54 PM > To: user@predictionio.apache.org > Cc: user@predictionio.incubator.apache.org > Subject: Re: Data lost from HBase to DataSource > > > > It is dangerous to use HBase directly because the schema may change at an= y > time. Export the data as json and examine it there. To see how many event= s > are in the stream you can just export then using bash to count lines (wc > -l). Each line is a JSON event. Or import the data as a dataframe in Spar= k > and use Spark SQL. > > > > There is no published contract about how events are stored in HBase. > > > > > > On Nov 27, 2017, at 9:24 PM, Sachin Kamkar wrote= : > > > > We are also facing the exact same issue. We have confirmed 1.5 million > records in HBase. However, I see only 19k records being fed for training > (eventsRDD.count()). > > > With Regards, > > > > Sachin > > =E2=9A=9CKTBFFH=E2=9A=9C > > > > On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang > wrote: > > Hi guys, > > > > I have encoded some JPEG images in json and imported to HBase, which show= s > 6500 records. When I read those data in DataSource with Pio, however only > some 1500 records were fed in PIO. > > I use PEventStore.find(appName, entityType, eventNames), and all the reco= rds > have the same entityType, eventNames. > > > > Any idea what could go wrong? The encoded string from JPEG is very wrong, > hundreds of thousands of characters, could this be a reason for the data > lost? > > > > Thank you for looking into my question. > > > > Best, > > Weiguang > >