predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From takako shimamoto <chiboch...@gmail.com>
Subject Re: Data lost from HBase to DataSource
Date Tue, 05 Dec 2017 02:00:45 GMT
Which version of HBase are you using?
I guess it is because libraries of the storage/hbase subproject are
too old that this causes. If you are using HBase 1.2.6, running
assembly task against hbase-common, hbase-client and hbase-server
1.2.6 would work.


2017-11-30 17:25 GMT+09:00 Huang, Weiguang <weiguang.huang@intel.com>:
> Hi Pat,
>
>
>
> We have compared the format of 2 records as attached from the json file for
> import. The first one is imported and successfully read in $pio train as we
> printed out its entityID in logger, and the other should not have been read
> in pio successfully as its entityId is absent in logger. But the two records
> have the same json format, as every record has been generated by the same
> program.
>
> And here is an quick illustration of a record in json, with "encodedImage"
> being shortened from its actual 262,156 characters
>
> {"event": "imageNet", "entityId": 10004, "entityType": "JPEG", "properties":
> {"label": "n01484850", "encodedImage": "AAABAAA…..Oynz4=”}}
>
> Only "entityId", "properties": {"label", "encodedImage"} could be different
> among every record.
>
>
>
> We also noticed another weird  thing. After the one-time $pio import of 6500
> records, we $pio export immediately and got 399 + 399 = 798 records in 2
> $pio exported files.
>
> As we $pio train for a couple of rounds, the number of records in pio
> increased to 399 + 399 + 399 = 1197 in 3 $pio exported files,
>
> and may to 399 + 399 + 399 + 399 = 1596 after more $pio train.
>
>
>
> Please see below the system logger for $pio import. It seems everything is
> all right.
>
> $pio import --appid 8 --input
> ../imageNetTemplate/data/imagenet_5_class_resized.json
>
>
>
> /opt/work/spark-2.1.1 is probably an Apache Spark development tree. Please
> make sure you are using at least 1.3.0.
>
> SLF4J: Class path contains multiple SLF4J bindings.
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
>
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> [INFO] [Runner$] Submission command: /opt/work/spark-2.1.1/bin/spark-submit
> --class org.apache.predictionio.tools.imprt.FileToEvents --jars
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-localfs-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-jdbc-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-elasticsearch1-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hbase-assembly-0.11.0-incubating.jar
> --files
> file:/opt/work/PredictionIO-0.11.0-incubating/conf/log4j.properties,file:/opt/work/hbase-1.3.1/conf/hbase-site.xml
> --driver-class-path
> /opt/work/PredictionIO-0.11.0-incubating/conf:/opt/work/hbase-1.3.1/conf
> --driver-java-options -Dpio.log.dir=/root
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar
> --appid 8 --input
> file:/opt/work/arda-data/pio-templates/dataImportTest/../imageNetTemplate/data/imagenet_5_class_resized.json
> --env
> PIO_STORAGE_SOURCES_HBASE_TYPE=hbase,PIO_ENV_LOADED=1,PIO_STORAGE_SOURCES_HBASE_HOSTS=Gondolin-Node-050,PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta,PIO_VERSION=0.11.0,PIO_FS_BASEDIR=/root/.pio_store,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost,PIO_STORAGE_SOURCES_HBASE_HOME=/opt/work/hbase-1.3.1,PIO_HOME=/opt/work/PredictionIO-0.11.0-incubating,PIO_FS_ENGINESDIR=/root/.pio_store/engines,PIO_STORAGE_SOURCES_LOCALFS_PATH=/root/.pio_store/models,PIO_STORAGE_SOURCES_HBASE_PORTS=16000,PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch,PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event,PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/opt/work/elasticsearch-1.7.6,PIO_FS_TMPDIR=/root/.pio_store/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE,PIO_CONF_DIR=/opt/work/PredictionIO-0.11.0-incubating/conf,PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300,PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
>
> [INFO] [log] Logging initialized @4913ms
>
> [INFO] [Server] jetty-9.2.z-SNAPSHOT
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@319642db{/environment,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@357bc488{/static,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4ea17147{/,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,AVAILABLE,@Spark}
>
> [INFO] [ServerConnector] Started Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [Server] Started @5086ms
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4f114b{/metrics/json,null,AVAILABLE,@Spark}
>
> [INFO] [FileToEvents$] Events are imported.
>
> [INFO] [FileToEvents$] Done.
>
> [INFO] [ServerConnector] Stopped Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@4ea17147{/,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@357bc488{/static,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@319642db{/environment,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,UNAVAILABLE,@Spark}
>
>
>
> Thanks for your advice.
>
>
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Thursday, November 30, 2017 2:06 AM
> To: user@predictionio.apache.org
> Cc: Shi, Dongjie <dongjie.shi@intel.com>
>
>
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> 1596 is how many events were accepted by the EventServer, look at the
> exported format and compare with the ones you imported. There must be a
> formatting error or an error when importing (did you check responses for
> each event import?)
>
>
>
> Looking below I see you are importing JPEG??? This is almost always a bad
> idea. Image data is usually kept in a filesystems like HDFS and a reference
> kept in the DB, there are too may serialization questions to do otherwise in
> my experience. If your Engine requires this you are asking for the kind of
> trouble you are seeing.
>
>
>
>
>
> On Nov 28, 2017, at 7:16 PM, Huang, Weiguang <weiguang.huang@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Here is the result when we tried out your suggestion.
>
>
>
> We checked the data from the Hbase, and the count of the records is exactly
> the same as we imported into the Hbase, that is 6500.
>
> 2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at
> ImageDataFromHBaseChecker.scala:27, took 12.016679 s
>
> Number of Records found : 6500
>
>
>
> We exported data from Pio and checked, but got only 1596 – see at the bottom
> of the below screen record.
>
> $ ls -al
>
> total 412212
>
> drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
>
> drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
>
> -rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
>
> -rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
>
> -rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
>
> -rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
>
> -rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
>
> -rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
>
> $ wc -l part-00000
>
> 399 part-00000
>
> $ wc -l part-00001
>
> 399 part-00001
>
> $ wc -l part-00002
>
> 399 part-00002
>
> $ wc -l part-00003
>
> 399 part-00003
>
> That is 399 * 4 = 1596
>
>
>
> Is this data lost caused by schema changed, or ill data contents, or other
> possible reasons? Appreciate for your thoughts.
>
>
>
> Thanks,
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Wednesday, November 29, 2017 10:16 AM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> Try my suggestion with export and see if the number of events looks correct.
> I am suggesting that you may not be counting what you think you are using
> HBase.
>
>
>
>
>
> On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <weiguang.huang@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Thanks for your advice.  However, we are not using HBase directly. We use
> pio to import data into HBase by below command:
>
> pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName
> /recordFile.json
>
> Could things go wrong here or somewhere else?
>
>
>
> Thanks,
>
> Weiguang
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Tuesday, November 28, 2017 11:54 PM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> It is dangerous to use HBase directly because the schema may change at any
> time. Export the data as json and examine it there. To see how many events
> are in the stream you can just export then using bash to count lines (wc
> -l). Each line is a JSON event. Or import the data as a dataframe in Spark
> and use Spark SQL.
>
>
>
> There is no published contract about how events are stored in HBase.
>
>
>
>
>
> On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sachinkamkar@gmail.com> wrote:
>
>
>
> We are also facing the exact same issue. We have confirmed 1.5 million
> records in HBase. However, I see only 19k records being fed for training
> (eventsRDD.count()).
>
>
> With Regards,
>
>
>
>      Sachin
>
> ⚜KTBFFH⚜
>
>
>
> On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <weiguang.huang@intel.com>
> wrote:
>
> Hi guys,
>
>
>
> I have encoded some JPEG images in json and imported to HBase, which shows
> 6500 records. When I read those data in DataSource with Pio, however only
> some 1500 records were fed in PIO.
>
> I use PEventStore.find(appName, entityType, eventNames), and all the records
> have  the same entityType, eventNames.
>
>
>
> Any idea what could go wrong? The encoded string from JPEG is very wrong,
> hundreds of thousands of characters, could this be a reason for the data
> lost?
>
>
>
> Thank you for looking into my question.
>
>
>
> Best,
>
> Weiguang
>
>

Mime
View raw message