hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Schubert Zhang <zson...@gmail.com>
Subject Re: Indexed Table in Hbase
Date Tue, 18 Aug 2009 16:41:52 GMT
The tow approachs of  Gary.H and Travis.H are good to work.
But I think there is a risk for Travis.H's (columns) approach, when there
are many keys for a column value. Then the total size of a index table row
may large than a region-size. I think this is not a general approach, you
should be very clear about your application.

And in one of our implementations, we also use timestamps to store multiple
rowkeys in the index table, just as Bharrath says. But there is also risks:
(1) If two rows with same index-column-value are inserted at the same time,
the timestamp may be same, the the latest inserted index row will overwrite
the previous one. (2) same a Travis.H's (columns) approach.

Schubert

On Tue, Aug 18, 2009 at 6:39 PM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Thanks Gary .. for explaining .. I got it ...
>
> On Tue, Aug 18, 2009 at 12:02 AM, Gary Helmling <ghelmling@gmail.com>
> wrote:
>
> > Hi Bharath,
> >
> > If you're using the default key generator
> > (org.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator),
> > it actually appends the base table row key for you.  So even though
> > the column value may be the same for multiple rows, the secondary
> > index table will still have 1 row for each row with the value in the
> > original table.  Here is relevant method from SimpleIndexKeyGenerator:
> >
> >  public byte[] createIndexKey(byte[] rowKey, Map<byte[], byte[]> columns)
> {
> >    return Bytes.add(columns.get(column), rowKey);
> >  }
> >
> > So, say you have a table "mytable", with the columns:
> >    info:keycol       (say this is the one you want to index)
> >    info:col2
> >    info:col3
> >
> > If you define your table with the index specification -- new
> > IndexSpecification("keycol", Bytes.toBytes("info:keycol")) -- then
> > HBase will create the secondary index table named "mytable-by_keycol".
> >
> > Then, say you add the following rows to "mytable":
> >
> > "row1":  info:keycol="one", info:col2="abc", info:col3="def"
> > "row2":  info:keycol="one", info:col2="ghi", info:col3="jkl"
> >
> > At this point, your index table ("mytable-by_keycol") will have the
> > following rows:
> >
> > "onerow1": info:keycol="one", __INDEX__:ROW="row1"
> > "onerow2": info:keycol="one", __INDEX__:ROW="row2"
> >
> > So you wind up with 2 rows in the index table (with unique row keys)
> > pointing back at the original table rows, even though we've only
> > stored a single distinct value for info:keycol.
> >
> > To access the rows by the secondary index to create a scanner using
> > IndexedTable.getIndexedScanner(...).  I don't think there's support
> > for using the indexes when performing a random read with
> > HTable.getRow()/HTable.get().  (But maybe I'm wrong?)
> >
> > As Travis mentions, you could always use an alternate approach to
> > implement your own indexing (use the index value as the row key for
> > your own index table and store the original table row keys as
> > individual columns).  I'm using the same approach for one access
> > pattern and so far it seems to work very well.
> >
> > But as far as I know the built in secondary indexing assumes 1
> > secondary index table row -> 1 original table row.
> >
> > Sorry if this got a bit long-winded.  It gets a little complicated to
> > explain in text...
> >
> > --gh
> >
> >
> > On Mon, Aug 17, 2009 at 1:46 PM, bharath
> > vissapragada<bharathvissapragada1990@gmail.com> wrote:
> > > Thanks for ur explanation Gary ,
> > >
> > > Consider my case where i can have repetitions of values .. So u say
> that
> > i
> > > edit the IndexKeyGenerator in such a way that instead of storing
> > > (column->rowkey) i should do in such a way that (coulmn->
> > rowkey1,rowkey2)
> > > as diff timestamps ... if yes is that a good way ?
> > >
> > > On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling <ghelmling@gmail.com>
> > wrote:
> > >
> > >> When defining the IndexSpecification for your table, you can pass your
> > >> own implementation of
> > >> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
> > >>
> > >> This allows you to control how the row keys are generated for the
> > >> secondary index table.  For example, you could append the original
> > >> table's row key to the indexed value to ensure uniqueness in
> > >> referencing the original rows.
> > >>
> > >> When you create an indexed scanner, the secondary index code opens and
> > >> wraps a scanner on the secondary index table, based on the start row
> > >> you specify (the indexed value you're looking up).  It applies any
> > >> filter passed to rows on the secondary index table, so make sure
> > >> anything you want to filter on is listed in the "indexed columns" in
> > >> your IndexSpecification.
> > >>
> > >> For any rows returned by the wrapped scanner, the client code then
> > >> does a get for the original table record (the original row key is
> > >> stored in the "__INDEX__" column family I think).
> > >>
> > >> So in total, when using secondary indexes, you wind up with 1 scan + N
> > >> gets to look at N rows.
> > >>
> > >> At least, this was my understanding of how things worked as of 0.19.
> > >> I'm actually moving indexing into my app layer as I update to 0.20.
> > >>
> > >> Hope this helps.
> > >>
> > >> --gh
> > >>
> > >>
> > >> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jlist@streamy.com>
> > wrote:
> > >> > I'm actually unsure about that.  Look at the code or experiment.
> > >> >
> > >> > Seems to me that there would be a uniqueness requirement, otherwise
> > what
> > >> do
> > >> > you expect the behavior to be?  A get can only return a single row,
> so
> > >> > multiple index hits doesn't really make sense.
> > >> >
> > >> > Clint?  You out there? :)
> > >> >
> > >> > JG
> > >> >
> > >> > bharath vissapragada wrote:
> > >> >>
> > >> >> I got it ... I think this is definitely useful in my app because
> iam
> > >> >> performing a full table scan everytime for selecting the rowkeys
> > based
> > >> on
> > >> >> some column values .
> > >> >>
> > >> >> BUT ..
> > >> >>
> > >> >>  we can have more than one rowkey for the same column value .Can
> you
> > >> >> please
> > >> >> tell me how they are stored .
> > >> >>
> > >> >> Thanks in advance
> > >> >>
> > >> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jlist@streamy.com>
> > >> wrote:
> > >> >>
> > >> >>> It's not an actual hash or btree index, but rather secondary
> indexes
> > in
> > >> >>> HBase are implemented by creating an additional HBase table.
> > >> >>>
> > >> >>> If I have a table "users" (row key is userid) with family
"data"
> and
> > >> >>> column
> > >> >>> "email", and I want to index the value in that column...
> > >> >>>
> > >> >>> I can create a table "users_email" where the row key is the
email
> > >> address
> > >> >>> (value from the column in "users" table) and a single column
that
> > >> >>> contains
> > >> >>> the userid.
> > >> >>>
> > >> >>> Doing an "index lookup" would mean doing a get on "users_email"
> and
> > >> then
> > >> >>> using that userid to do a lookup on the "users" table.
> > >> >>>
> > >> >>> IndexedTable does this transparently, but still does require
two
> > >> queries.
> > >> >>>  So it's slower than a single query, but certainly faster
than a
> > full
> > >> >>> table
> > >> >>> scan.
> > >> >>>
> > >> >>> If you need hash-level performance on the index lookup, there
are
> > lots
> > >> of
> > >> >>> solutions outside of HBase that would work... In-memory Java
> > HashMap,
> > >> >>> Tokyo
> > >> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> > >> >>> indexing,
> > >> >>> you can use Lucene or the like.
> > >> >>>
> > >> >>> Make sense?
> > >> >>>
> > >> >>> JG
> > >> >>>
> > >> >>>
> > >> >>> bharath vissapragada wrote:
> > >> >>>
> > >> >>>> But i have read somewhere that Secondary indexes are somewhat
> slow
> > >> >>>> compared
> > >> >>>> to normal Hbase tables ..Does that effect the performance
?
> > >> >>>>
> > >> >>>> Also do you know the type of index created on the column(i
mean
> > Hash
> > >> >>>> type
> > >> >>>> or
> > >> >>>> Btree etc)
> > >> >>>>
> > >> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <
> e2k_1@yahoo.com>
> > >> >>>> wrote:
> > >> >>>>
> > >> >>>>  Hi!
> > >> >>>>>
> > >> >>>>> As far as I understand you are talking about the secondary
> > indexes.
> > >> >>>>> Yes,
> > >> >>>>> they can be used to quickly get the rowkey by a value
in the
> > indexed
> > >> >>>>> column.
> > >> >>>>>
> > >> >>>>> --Kirill
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> bharath vissapragada wrote:
> > >> >>>>>
> > >> >>>>>  Hi all ,
> > >> >>>>>>
> > >> >>>>>> I have gone through the IndexedTableAdmin classes
in Hbase
> 0.19.3
> > >> API
> > >> >>>>>> ..
> > >> >>>>>>  I
> > >> >>>>>> have seen some methods used to create an Indexed
Table (on some
> > >> >>>>>> column)..
> > >> >>>>>> I
> > >> >>>>>> have some doubts regarding the same ...
> > >> >>>>>>
> > >> >>>>>> 1) Are these somewhat similar to Hash indexes(in
RDBMS) where i
> > can
> > >> >>>>>> easily
> > >> >>>>>> lookup a column value and find it's corresponding
rowkey(s)
> > >> >>>>>> 2) Can i find any performance gain when i use
IndexedTable to
> > search
> > >> >>>>>> for
> > >> >>>>>> a
> > >> >>>>>> paritcular column value .. instead of scanning
an entire normal
> > >> HTable
> > >> >>>>>> ..
> > >> >>>>>>
> > >> >>>>>> Kindly clarify my doubts
> > >> >>>>>>
> > >> >>>>>> Thanks in advance
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>>
> > >> >>
> > >> >
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message