predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Huang, Weiguang" <weiguang.hu...@intel.com>
Subject RE: Data lost from HBase to DataSource
Date Tue, 05 Dec 2017 05:40:51 GMT
Thanks Takako. I will have a try.

Best,
Weiguang

-----Original Message-----
From: takako shimamoto [mailto:chibochibo@gmail.com] 
Sent: Tuesday, December 5, 2017 10:01 AM
To: user@predictionio.apache.org
Subject: Re: Data lost from HBase to DataSource

Which version of HBase are you using?
I guess it is because libraries of the storage/hbase subproject are too old that this causes.
If you are using HBase 1.2.6, running assembly task against hbase-common, hbase-client and
hbase-server
1.2.6 would work.


2017-11-30 17:25 GMT+09:00 Huang, Weiguang <weiguang.huang@intel.com>:
> Hi Pat,
>
>
>
> We have compared the format of 2 records as attached from the json 
> file for import. The first one is imported and successfully read in 
> $pio train as we printed out its entityID in logger, and the other 
> should not have been read in pio successfully as its entityId is 
> absent in logger. But the two records have the same json format, as 
> every record has been generated by the same program.
>
> And here is an quick illustration of a record in json, with "encodedImage"
> being shortened from its actual 262,156 characters
>
> {"event": "imageNet", "entityId": 10004, "entityType": "JPEG", "properties":
> {"label": "n01484850", "encodedImage": "AAABAAA…..Oynz4=”}}
>
> Only "entityId", "properties": {"label", "encodedImage"} could be 
> different among every record.
>
>
>
> We also noticed another weird  thing. After the one-time $pio import 
> of 6500 records, we $pio export immediately and got 399 + 399 = 798 
> records in 2 $pio exported files.
>
> As we $pio train for a couple of rounds, the number of records in pio 
> increased to 399 + 399 + 399 = 1197 in 3 $pio exported files,
>
> and may to 399 + 399 + 399 + 399 = 1596 after more $pio train.
>
>
>
> Please see below the system logger for $pio import. It seems 
> everything is all right.
>
> $pio import --appid 8 --input
> ../imageNetTemplate/data/imagenet_5_class_resized.json
>
>
>
> /opt/work/spark-2.1.1 is probably an Apache Spark development tree. 
> Please make sure you are using at least 1.3.0.
>
> SLF4J: Class path contains multiple SLF4J bindings.
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-
> hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder
> .class]
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.
> 11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
>
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> [INFO] [Runner$] Submission command: 
> /opt/work/spark-2.1.1/bin/spark-submit
> --class org.apache.predictionio.tools.imprt.FileToEvents --jars 
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-
> assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incu
> bating/lib/spark/pio-data-localfs-assembly-0.11.0-incubating.jar,file:
> /opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-jdbc-assem
> bly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubatin
> g/lib/spark/pio-data-elasticsearch1-assembly-0.11.0-incubating.jar,fil
> e:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hbase-as
> sembly-0.11.0-incubating.jar
> --files
> file:/opt/work/PredictionIO-0.11.0-incubating/conf/log4j.properties,fi
> le:/opt/work/hbase-1.3.1/conf/hbase-site.xml
> --driver-class-path
> /opt/work/PredictionIO-0.11.0-incubating/conf:/opt/work/hbase-1.3.1/co
> nf --driver-java-options -Dpio.log.dir=/root 
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-
> incubating.jar
> --appid 8 --input
> file:/opt/work/arda-data/pio-templates/dataImportTest/../imageNetTempl
> ate/data/imagenet_5_class_resized.json
> --env
> PIO_STORAGE_SOURCES_HBASE_TYPE=hbase,PIO_ENV_LOADED=1,PIO_STORAGE_SOUR
> CES_HBASE_HOSTS=Gondolin-Node-050,PIO_STORAGE_REPOSITORIES_METADATA_NA
> ME=pio_meta,PIO_VERSION=0.11.0,PIO_FS_BASEDIR=/root/.pio_store,PIO_STO
> RAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost,PIO_STORAGE_SOURCES_HBASE_H
> OME=/opt/work/hbase-1.3.1,PIO_HOME=/opt/work/PredictionIO-0.11.0-incub
> ating,PIO_FS_ENGINESDIR=/root/.pio_store/engines,PIO_STORAGE_SOURCES_L
> OCALFS_PATH=/root/.pio_store/models,PIO_STORAGE_SOURCES_HBASE_PORTS=16
> 000,PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch,PIO_STORAGE_R
> EPOSITORIES_METADATA_SOURCE=ELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MOD
> ELDATA_SOURCE=LOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_even
> t,PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio,PIO_STORA
> GE_SOURCES_ELASTICSEARCH_HOME=/opt/work/elasticsearch-1.7.6,PIO_FS_TMP
> DIR=/root/.pio_store/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_m
> odel,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE,PIO_CONF_DIR=/opt
> /work/PredictionIO-0.11.0-incubating/conf,PIO_STORAGE_SOURCES_ELASTICS
> EARCH_PORTS=9300,PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
>
> [INFO] [log] Logging initialized @4913ms
>
> [INFO] [Server] jetty-9.2.z-SNAPSHOT
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,AVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,AVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,AVAILABLE,@S
> park}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,AVAILAB
> LE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,AVAILABLE,@S
> park}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@319642db{/environment,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,AVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,AVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,AVAI
> LABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null
> ,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@357bc488{/static,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4ea17147{/,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,AVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,AVAILAB
> LE,@Spark}
>
> [INFO] [ServerConnector] Started 
> Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [Server] Started @5086ms
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4f114b{/metrics/json,null,AVAILABLE,@Spa
> rk}
>
> [INFO] [FileToEvents$] Events are imported.
>
> [INFO] [FileToEvents$] Done.
>
> [INFO] [ServerConnector] Stopped 
> Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,UNAVAIL
> ABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,UNAVAILABLE
> ,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@4ea17147{/,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@357bc488{/static,null,UNAVAILABLE,@Spark
> }
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null
> ,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,UNAV
> AILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,UNAVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,UNAVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@319642db{/environment,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,UNAVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,UNAVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,UNAVAIL
> ABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,UNAVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,UNAVAILABLE,@Spark
> }
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,UNAVAILABLE
> ,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,UNAVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,UNAVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,UNAVAILABLE,@Spark}
>
>
>
> Thanks for your advice.
>
>
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Thursday, November 30, 2017 2:06 AM
> To: user@predictionio.apache.org
> Cc: Shi, Dongjie <dongjie.shi@intel.com>
>
>
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> 1596 is how many events were accepted by the EventServer, look at the 
> exported format and compare with the ones you imported. There must be 
> a formatting error or an error when importing (did you check responses 
> for each event import?)
>
>
>
> Looking below I see you are importing JPEG??? This is almost always a 
> bad idea. Image data is usually kept in a filesystems like HDFS and a 
> reference kept in the DB, there are too may serialization questions to 
> do otherwise in my experience. If your Engine requires this you are 
> asking for the kind of trouble you are seeing.
>
>
>
>
>
> On Nov 28, 2017, at 7:16 PM, Huang, Weiguang 
> <weiguang.huang@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Here is the result when we tried out your suggestion.
>
>
>
> We checked the data from the Hbase, and the count of the records is 
> exactly the same as we imported into the Hbase, that is 6500.
>
> 2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at 
> ImageDataFromHBaseChecker.scala:27, took 12.016679 s
>
> Number of Records found : 6500
>
>
>
> We exported data from Pio and checked, but got only 1596 – see at the 
> bottom of the below screen record.
>
> $ ls -al
>
> total 412212
>
> drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
>
> drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
>
> -rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
>
> -rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
>
> -rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
>
> -rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
>
> -rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
>
> -rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
>
> $ wc -l part-00000
>
> 399 part-00000
>
> $ wc -l part-00001
>
> 399 part-00001
>
> $ wc -l part-00002
>
> 399 part-00002
>
> $ wc -l part-00003
>
> 399 part-00003
>
> That is 399 * 4 = 1596
>
>
>
> Is this data lost caused by schema changed, or ill data contents, or 
> other possible reasons? Appreciate for your thoughts.
>
>
>
> Thanks,
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Wednesday, November 29, 2017 10:16 AM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> Try my suggestion with export and see if the number of events looks correct.
> I am suggesting that you may not be counting what you think you are 
> using HBase.
>
>
>
>
>
> On Nov 28, 2017, at 5:53 PM, Huang, Weiguang 
> <weiguang.huang@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Thanks for your advice.  However, we are not using HBase directly. We 
> use pio to import data into HBase by below command:
>
> pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName 
> /recordFile.json
>
> Could things go wrong here or somewhere else?
>
>
>
> Thanks,
>
> Weiguang
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Tuesday, November 28, 2017 11:54 PM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> It is dangerous to use HBase directly because the schema may change at 
> any time. Export the data as json and examine it there. To see how 
> many events are in the stream you can just export then using bash to 
> count lines (wc -l). Each line is a JSON event. Or import the data as 
> a dataframe in Spark and use Spark SQL.
>
>
>
> There is no published contract about how events are stored in HBase.
>
>
>
>
>
> On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sachinkamkar@gmail.com> wrote:
>
>
>
> We are also facing the exact same issue. We have confirmed 1.5 million 
> records in HBase. However, I see only 19k records being fed for 
> training (eventsRDD.count()).
>
>
> With Regards,
>
>
>
>      Sachin
>
> ⚜KTBFFH⚜
>
>
>
> On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang 
> <weiguang.huang@intel.com>
> wrote:
>
> Hi guys,
>
>
>
> I have encoded some JPEG images in json and imported to HBase, which 
> shows
> 6500 records. When I read those data in DataSource with Pio, however 
> only some 1500 records were fed in PIO.
>
> I use PEventStore.find(appName, entityType, eventNames), and all the 
> records have  the same entityType, eventNames.
>
>
>
> Any idea what could go wrong? The encoded string from JPEG is very 
> wrong, hundreds of thousands of characters, could this be a reason for 
> the data lost?
>
>
>
> Thank you for looking into my question.
>
>
>
> Best,
>
> Weiguang
>
>
Mime
View raw message