predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Huang, Weiguang" <>
Subject RE: Data lost from HBase to DataSource
Date Wed, 29 Nov 2017 01:53:51 GMT
Hi Pat,

Thanks for your advice.  However, we are not using HBase directly. We use pio to import data
into HBase by below command:
pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName /recordFile.json
Could things go wrong here or somewhere else?

From: Pat Ferrel []
Sent: Tuesday, November 28, 2017 11:54 PM
Subject: Re: Data lost from HBase to DataSource

It is dangerous to use HBase directly because the schema may change at any time. Export the
data as json and examine it there. To see how many events are in the stream you can just export
then using bash to count lines (wc -l). Each line is a JSON event. Or import the data as a
dataframe in Spark and use Spark SQL.

There is no published contract about how events are stored in HBase.

On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <<>>

We are also facing the exact same issue. We have confirmed 1.5 million records in HBase. However,
I see only 19k records being fed for training (eventsRDD.count()).

With Regards,


On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <<>>
Hi guys,

I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When
I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same
entityType, eventNames.

Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands
of characters, could this be a reason for the data lost?

Thank you for looking into my question.


View raw message