Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 00FA2200B8B for ; Tue, 4 Oct 2016 09:24:02 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id F06B6160AC9; Tue, 4 Oct 2016 07:24:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D7D97160AC5 for ; Tue, 4 Oct 2016 09:24:00 +0200 (CEST) Received: (qmail 16030 invoked by uid 500); 4 Oct 2016 07:23:59 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 16016 invoked by uid 99); 4 Oct 2016 07:23:59 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Oct 2016 07:23:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A5507C1AAE for ; Tue, 4 Oct 2016 07:23:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.898 X-Spam-Level: * X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 69HAbm7dz6_c for ; Tue, 4 Oct 2016 07:23:56 +0000 (UTC) Received: from mail-vk0-f47.google.com (mail-vk0-f47.google.com [209.85.213.47]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 0478E5F399 for ; Tue, 4 Oct 2016 07:23:56 +0000 (UTC) Received: by mail-vk0-f47.google.com with SMTP id y190so150640807vkd.3 for ; Tue, 04 Oct 2016 00:23:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=s+Pa9IeRitCjmcwDhh7KO62qnkkKaK4nYFuqr1mT8nE=; b=vFoK2Zr4p/xLWM0qUuVIKRK5GyjBICJCxK+lXC/kBy8Vs07RG33Qhzm65uyWNwiwe3 urXBoOrxxSN1acVtt5lQ2TzYzQNhRQJmd0RA9oOurl1Hinh1Qrgo9NLCpDNdDhfskM+9 N+nB7R+l+qucq1owRyNMEEIQye+FYgqJUelkWwqVvqp+y8LynoDv6CGLxl+VmkJ/dbZ8 pMyYWc6mKS7n8wXA+AZdx0WlRrQyMDhdJG/fwArvfmZVJ5qlE0Q2KZc//6mkAKiMCCQl TmupwRh9bS4mMumRfPOXIlI7fa6oSA0WIB2Zv2jHIHo+59eatwuKD/RbrSEmgr07XV8p m22A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=s+Pa9IeRitCjmcwDhh7KO62qnkkKaK4nYFuqr1mT8nE=; b=j1/ha3xgeicXXDODyyvXzgF7hsMasYF1v4dXa2EuYbhrQxK5wlOdD1anGmkIo54Nfo sRsdQz/7PZCNWISJxOYOCjAinEecWPoHoB/FGA3agp70Xu5Z0m53vPoz46gImV+mjzTb HYiIczMATE06ZuWTYT7Q6wyNKxAPsRIYwm8wnijZclvEgDfQR6nsmU6DIkoO9bUOBSxE YlVPy8UizofIF8K9osKaxJMmfisyfHnDk2VW1bGNJhT/13ZzYnZBX6JabeiSu7ly1/gl Kc6rcPEqEO2X5iNqton2TfVnXGMiTp++zb0hyduO6V7wf2d4tCQo941HXt5736BVt/aN 76jQ== X-Gm-Message-State: AA6/9RmYxWesoji2X0tVubmtjt8afUM42loMlmGkkja83lPCt7vHPPyBOlGGCmuzPkXJEmYXCsBaEYHSgJCExg== X-Received: by 10.31.137.140 with SMTP id l134mr1229070vkd.90.1475565835408; Tue, 04 Oct 2016 00:23:55 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.188.198 with HTTP; Tue, 4 Oct 2016 00:23:54 -0700 (PDT) In-Reply-To: References: From: Mich Talebzadeh Date: Tue, 4 Oct 2016 08:23:54 +0100 Message-ID: Subject: Re: Loading into hbase from csv file issue To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a11459402769493053e04f117 archived-at: Tue, 04 Oct 2016 07:24:02 -0000 --001a11459402769493053e04f117 Content-Type: text/plain; charset=UTF-8 Thanks again. If I wanted to store TSCO for a row and not bother for the rest of the rows how will it work for the row key. Currently this is trhe way table tsco is defined: create 'tsco','stock_daily' and this is the attributes of stock_daily fc hbase(main):144:0* scan 'tsco', LIMIT => 1 ROW COLUMN+CELL TSCO-1-Apr-08 column=stock_daily:Date, timestamp=1475525222488, value=1-Apr-08 TSCO-1-Apr-08 column=stock_daily:close, timestamp=1475525222488, value=405.25 TSCO-1-Apr-08 column=stock_daily:high, timestamp=1475525222488, value=406.75 TSCO-1-Apr-08 column=stock_daily:low, timestamp=1475525222488, value=379.25 TSCO-1-Apr-08 column=stock_daily:open, timestamp=1475525222488, value=380.00 TSCO-1-Apr-08 column=stock_daily:stock, timestamp=1475525222488, value=TESCO PLC TSCO-1-Apr-08 column=stock_daily:ticker, timestamp=1475525222488, value=TSCO TSCO-1-Apr-08 column=stock_daily:volume, timestamp=1475525222488, value=49664486 Note that column=stock_daily:stock and column=stock_daily:ticker is repeated in every row. That may not be efficient? Kindly suggest the best way of creating row key and whether it is necessary to store those above columns? regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 4 October 2016 at 01:53, Jean-Marc Spaggiari wrote: > Hi Mich, > > that's better already, but now you have to think about the read pattern. > How do you want to read this data? Are you going to read just one column at > a time? Like reading stock_daily:high without reading stock_daily:close? If > so, fine, keep it that way. But if you mostly read all of them together, > then why not just keep them together instead of separating them into > different columns? That way you save the key overhead storage for each new > column... > > Also, I suspect you will have one row per stock per day, right? Does it > mean you will repeat the stock_info information again and again and again? > If so, why not just also storing it once for the row "TSCO" and not repeat > it for "TSCO-DATE"? That way you store it just one, you have an easy way to > retrieve it and you can safe one column family? > > HTH, > > JMS > > 2016-10-03 11:16 GMT-04:00 Mich Talebzadeh : > > > Hi Jean-Marc > > > > I decided to create a composite key *ticker-date* from the csv file > > > > I just did some manipulation on CSV file > > > > export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f; > > do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f > temp > > tsco.csv > > > > Which basically takes the csv file, tells the shell that field separator > > IFS=",", drops the header, reads every field in every line (1,b,c ..), > > creates the composite key TSCO-$a, adds the stock name and ticker to the > > csv file. The whole process can be automated and parameterised. > > > > Once the csv file is put into HDFS then, I run the following command > > > > $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv > > -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW > > _KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,sto > > ck_daily:open,stock_daily:high,stock_daily:low,stock_daily: > > close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv > > > > The Hbase table is created as below > > > > create 'tsco','stock_info','stock_daily' > > > > and this is the data (2 rows each 2 family and with 8 attributes) > > > > hbase(main):132:0> scan 'tsco', LIMIT => 2 > > ROW COLUMN+CELL > > TSCO-1-Apr-08 > > column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08 > > TSCO-1-Apr-08 > > column=stock_daily:close, timestamp=1475507091676, value=405.25 > > TSCO-1-Apr-08 > > column=stock_daily:high, timestamp=1475507091676, value=406.75 > > TSCO-1-Apr-08 > > column=stock_daily:low, timestamp=1475507091676, value=379.25 > > TSCO-1-Apr-08 > > column=stock_daily:open, timestamp=1475507091676, value=380.00 > > TSCO-1-Apr-08 > > column=stock_daily:volume, timestamp=1475507091676, value=49664486 > > TSCO-1-Apr-08 > > column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC > > TSCO-1-Apr-08 > > column=stock_info:ticker, timestamp=1475507091676, value=TSCO > > > > TSCO-1-Apr-09 > > column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09 > > TSCO-1-Apr-09 > > column=stock_daily:close, timestamp=1475507091676, value=333.30 > > TSCO-1-Apr-09 > > column=stock_daily:high, timestamp=1475507091676, value=334.60 > > TSCO-1-Apr-09 > > column=stock_daily:low, timestamp=1475507091676, value=326.50 > > TSCO-1-Apr-09 > > column=stock_daily:open, timestamp=1475507091676, value=331.10 > > TSCO-1-Apr-09 > > column=stock_daily:volume, timestamp=1475507091676, value=24877341 > > TSCO-1-Apr-09 > > column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC > > TSCO-1-Apr-09 > > column=stock_info:ticker, timestamp=1475507091676, value=TSCO > > > > > > What do you think? > > > > Thanks > > > > Dr Mich Talebzadeh > > > > > > > > LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ > > d6zP6AcPCCdOABUrV8Pw > > > Jd6zP6AcPCCdOABUrV8Pw>* > > > > > > > > http://talebzadehmich.wordpress.com > > > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > > loss, damage or destruction of data or any other property which may arise > > from relying on this email's technical content is explicitly disclaimed. > > The author will in no case be liable for any monetary damages arising > from > > such loss, damage or destruction. > > > > > > > > On 3 October 2016 at 15:10, Jean-Marc Spaggiari > > > wrote: > > > > > Hi Mich, > > > > > > As you said, it's most probably because it's all the same key... If you > > > want to be 200% sure, just alter VERSIONS => '1' to be greater (like, > 10) > > > and scan all the versions of the cells. You should see the others. > > > > > > JMS > > > > > > 2016-10-03 3:41 GMT-04:00 Mich Talebzadeh : > > > > > > > Hi, > > > > > > > > when I use the command line utility ImportTsv to load a file into > > Hbase > > > > with the following table format > > > > > > > > describe 'marketDataHbase' > > > > Table marketDataHbase is ENABLED > > > > marketDataHbase > > > > COLUMN FAMILIES DESCRIPTION > > > > {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', > IN_MEMORY > > > => > > > > 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => > 'NONE', > > > TTL > > > > => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC > > > > ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} > > > > 1 row(s) in 0.0930 seconds > > > > > > > > > > > > hbase org.apache.hadoop.hbase.mapreduce.ImportTsv > > > > -Dimporttsv.separator=',' > > > > -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, > > > > stock_daily:tradedate, stock_daily:open,stock_daily: > > > > high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco > > > > hdfs://rhes564:9000/data/stocks/tsco.csv > > > > > > > > There are with 1200 rows in the csv file,* but it only loads the > first > > > > row!* > > > > > > > > scan 'tsco' > > > > ROW COLUMN+CELL > > > > Tesco PLC > > > > column=stock_daily:close, timestamp=1475447365118, value=325.25 > > > > Tesco PLC > > > > column=stock_daily:high, timestamp=1475447365118, value=332.00 > > > > Tesco PLC > > > > column=stock_daily:low, timestamp=1475447365118, value=324.00 > > > > Tesco PLC > > > > column=stock_daily:open, timestamp=1475447365118, value=331.75 > > > > Tesco PLC > > > > column=stock_daily:ticker, timestamp=1475447365118, value=TSCO > > > > Tesco PLC > > > > column=stock_daily:tradedate, timestamp=1475447365118, value= > 3-Jan-06 > > > > Tesco PLC > > > > column=stock_daily:volume, timestamp=1475447365118, value=46935045 > > > > 1 row(s) in 0.0390 seconds > > > > > > > > Is this because the hbase_row_key --> Tesco PLC is the same for all? > I > > > > thought that the row key can be anything. > > > > > > > > Thanks > > > > > > > > Dr Mich Talebzadeh > > > > > > > > > > > > > > > > LinkedIn * https://www.linkedin.com/profile/view?id= > > > > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > > Jd6zP6AcPCCd > > > > OABUrV8Pw>* > > > > > > > > > > > > > > > > http://talebzadehmich.wordpress.com > > > > > > > > > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for > > any > > > > loss, damage or destruction of data or any other property which may > > arise > > > > from relying on this email's technical content is explicitly > > disclaimed. > > > > The author will in no case be liable for any monetary damages arising > > > from > > > > such loss, damage or destruction. > > > > > > > > > > --001a11459402769493053e04f117--