hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Loading into hbase from csv file issue
Date Tue, 04 Oct 2016 07:23:54 GMT
Thanks again.

If I wanted to store TSCO for a row and not bother for the rest of the rows
how will it work for the row key.

Currently this is trhe way table tsco is defined:

 create 'tsco','stock_daily'

and this is the attributes of stock_daily fc

hbase(main):144:0* scan 'tsco', LIMIT => 1
ROW                                                    COLUMN+CELL
 TSCO-1-Apr-08
column=stock_daily:Date, timestamp=1475525222488, value=1-Apr-08
 TSCO-1-Apr-08
column=stock_daily:close, timestamp=1475525222488, value=405.25
 TSCO-1-Apr-08
column=stock_daily:high, timestamp=1475525222488, value=406.75
 TSCO-1-Apr-08
column=stock_daily:low, timestamp=1475525222488, value=379.25
 TSCO-1-Apr-08
column=stock_daily:open, timestamp=1475525222488, value=380.00
 TSCO-1-Apr-08
column=stock_daily:stock, timestamp=1475525222488, value=TESCO PLC
 TSCO-1-Apr-08
column=stock_daily:ticker, timestamp=1475525222488, value=TSCO
 TSCO-1-Apr-08
column=stock_daily:volume, timestamp=1475525222488, value=49664486

Note that column=stock_daily:stock and column=stock_daily:ticker is
repeated in every row. That may not be efficient?

Kindly suggest the best way of creating row key and whether it is necessary
to store those above columns?

regards












Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 October 2016 at 01:53, Jean-Marc Spaggiari <jean-marc@spaggiari.org>
wrote:

> Hi Mich,
>
> that's better already, but now you have to think about the read pattern.
> How do you want to read this data? Are you going to read just one column at
> a time? Like reading stock_daily:high without reading stock_daily:close? If
> so, fine, keep it that way. But if you mostly read all of them together,
> then why not just keep them together instead of separating them into
> different columns? That way you save the key overhead storage for each new
> column...
>
> Also, I suspect you will have one row per stock per day, right? Does it
> mean you will repeat the stock_info information again and again and again?
> If so, why not just also storing  it once for the row "TSCO" and not repeat
> it for "TSCO-DATE"? That way you store it just one, you have an easy way to
> retrieve it and you can safe one column family?
>
> HTH,
>
> JMS
>
> 2016-10-03 11:16 GMT-04:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>
> > Hi Jean-Marc
> >
> > I decided to create a composite key *ticker-date* from the csv file
> >
> > I just did some manipulation on CSV file
> >
> > export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f;
> > do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f
> temp
> > tsco.csv
> >
> > Which basically takes the csv file, tells the shell that field separator
> > IFS=",", drops the header, reads every field in every line (1,b,c ..),
> > creates the composite key TSCO-$a, adds the stock name and ticker to the
> > csv file. The whole process can be automated and parameterised.
> >
> > Once the csv file is put into HDFS then, I run the following command
> >
> > $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> > -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW
> > _KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,sto
> > ck_daily:open,stock_daily:high,stock_daily:low,stock_daily:
> > close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv
> >
> > The Hbase table is created as below
> >
> > create 'tsco','stock_info','stock_daily'
> >
> > and this is the data (2 rows each 2 family and with 8 attributes)
> >
> > hbase(main):132:0> scan 'tsco', LIMIT => 2
> > ROW                                                    COLUMN+CELL
> >  TSCO-1-Apr-08
> > column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
> >  TSCO-1-Apr-08
> > column=stock_daily:close, timestamp=1475507091676, value=405.25
> >  TSCO-1-Apr-08
> > column=stock_daily:high, timestamp=1475507091676, value=406.75
> >  TSCO-1-Apr-08
> > column=stock_daily:low, timestamp=1475507091676, value=379.25
> >  TSCO-1-Apr-08
> > column=stock_daily:open, timestamp=1475507091676, value=380.00
> >  TSCO-1-Apr-08
> > column=stock_daily:volume, timestamp=1475507091676, value=49664486
> >  TSCO-1-Apr-08
> > column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
> >  TSCO-1-Apr-08
> > column=stock_info:ticker, timestamp=1475507091676, value=TSCO
> >
> >  TSCO-1-Apr-09
> > column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
> >  TSCO-1-Apr-09
> > column=stock_daily:close, timestamp=1475507091676, value=333.30
> >  TSCO-1-Apr-09
> > column=stock_daily:high, timestamp=1475507091676, value=334.60
> >  TSCO-1-Apr-09
> > column=stock_daily:low, timestamp=1475507091676, value=326.50
> >  TSCO-1-Apr-09
> > column=stock_daily:open, timestamp=1475507091676, value=331.10
> >  TSCO-1-Apr-09
> > column=stock_daily:volume, timestamp=1475507091676, value=24877341
> >  TSCO-1-Apr-09
> > column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
> >  TSCO-1-Apr-09
> > column=stock_info:ticker, timestamp=1475507091676, value=TSCO
> >
> >
> > What do you think?
> >
> > Thanks
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
> > d6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrb
> > Jd6zP6AcPCCdOABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 3 October 2016 at 15:10, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> >
> > wrote:
> >
> > > Hi Mich,
> > >
> > > As you said, it's most probably because it's all the same key... If you
> > > want to be 200% sure, just alter VERSIONS => '1' to be greater (like,
> 10)
> > > and scan all the versions of the cells. You should see the others.
> > >
> > > JMS
> > >
> > > 2016-10-03 3:41 GMT-04:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
> > >
> > > > Hi,
> > > >
> > > > when I use the command line utility ImportTsv  to load a file into
> > Hbase
> > > > with the following table format
> > > >
> > > > describe 'marketDataHbase'
> > > > Table marketDataHbase is ENABLED
> > > > marketDataHbase
> > > > COLUMN FAMILIES DESCRIPTION
> > > > {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1',
> IN_MEMORY
> > > =>
> > > > 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
> 'NONE',
> > > TTL
> > > > => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
> > > > ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
> > > > 1 row(s) in 0.0930 seconds
> > > >
> > > >
> > > > hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> > > > -Dimporttsv.separator=','
> > > > -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker,
> > > > stock_daily:tradedate, stock_daily:open,stock_daily:
> > > > high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
> > > > hdfs://rhes564:9000/data/stocks/tsco.csv
> > > >
> > > > There are with 1200 rows in the csv file,* but it only loads the
> first
> > > > row!*
> > > >
> > > > scan 'tsco'
> > > > ROW                                                    COLUMN+CELL
> > > >  Tesco PLC
> > > > column=stock_daily:close, timestamp=1475447365118, value=325.25
> > > >  Tesco PLC
> > > > column=stock_daily:high, timestamp=1475447365118, value=332.00
> > > >  Tesco PLC
> > > > column=stock_daily:low, timestamp=1475447365118, value=324.00
> > > >  Tesco PLC
> > > > column=stock_daily:open, timestamp=1475447365118, value=331.75
> > > >  Tesco PLC
> > > > column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
> > > >  Tesco PLC
> > > > column=stock_daily:tradedate, timestamp=1475447365118, value=
> 3-Jan-06
> > > >  Tesco PLC
> > > > column=stock_daily:volume, timestamp=1475447365118, value=46935045
> > > > 1 row(s) in 0.0390 seconds
> > > >
> > > > Is this because the hbase_row_key --> Tesco PLC is the same for all?
> I
> > > > thought that the row key can be anything.
> > > >
> > > > Thanks
> > > >
> > > > Dr Mich Talebzadeh
> > > >
> > > >
> > > >
> > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrb
> > Jd6zP6AcPCCd
> > > > OABUrV8Pw>*
> > > >
> > > >
> > > >
> > > > http://talebzadehmich.wordpress.com
> > > >
> > > >
> > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> > > > loss, damage or destruction of data or any other property which may
> > arise
> > > > from relying on this email's technical content is explicitly
> > disclaimed.
> > > > The author will in no case be liable for any monetary damages arising
> > > from
> > > > such loss, damage or destruction.
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message