hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Wang <binwang...@gmail.com>
Subject Able to search by all the columns and faster than impala
Date Wed, 11 May 2016 21:21:53 GMT
Hi there,

I have a use case here where I have a table that have low billions of rows
and less than 50 columns. This is a very popular data sources where there
is a huge demand internally people want to query the table.
Nothing more complex than "select * from where .. and .."

However, not everyone is able to write SQL and we are thinking about
building this into a web application. However, we usually use the tool Hive
and Impala and neither of them is fast enough to be the backend of a web
application. (think about what happens if you click a button and it
returned the result back to you in minutes... annoying!)

We tried to dump the data into Solrcloud and it worked really well if the
number of returned results is small.. but the performance is really back
even when the row count is a few thousand...

We also tried to use HBase where it is really hard to design a schema since
a full table scan takes lots of time anyway.

Then I have an idea that we can write a map reduce to turn the data into a
melted <http://www.statmethods.net/management/reshape.html> format. And
then we group them by column_name and cell_value, and build a list of
row-id where this value appear, sort of the inverted index of Lucene. so we
have two tables in HBase.

So when user issues a query maybe where 3 where conditions, it will first
quickly go the the inverted index to search for what are the row ids that
meet each condition, then we find the intersection for those three lists.
Now we have a list of all the row id where meet the three criteria. In the
end, we look up those row id in another HBase table to retrieve the rows.

Can anyone shed some lights on this?

Best regards,

Bin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message