hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jl...@streamy.com>
Subject RE: Lucene from HBase - raw values in Lucene index or not?
Date Wed, 17 Dec 2008 16:46:05 GMT
Tim,

Your scenario is certainly a familiar one, distribution of your mysql
solution will be cumbersome at best and performance will degrade quickly.

Since ranking is not so important, then you really won't have any issues
with sharding your index in Solr, or Katta and Lucene.  Distributed Solr
queries are really just performing the query on each index and merging the
results; this should scale fairly well for you it seems.  And yes, relevancy
ranking I mean search ranking (relevancy of the results to your original
query).

With a wide range of potential results you are dealing with (most you say
are small, but some could be 10M+), you'll need to either design a
principled approach that will give you the best average-case, or start
thinking about dealing with them differently.  It certainly does seem clear
that HBase is a great persistent storage solution for you regardless of how
you handle querying.

Tokyo Cabinet is a very simple piece of software, C implementations of
disk-based key/val Hash Tables or B-Trees.  You can get constant-time access
to a very large set of data with surprisingly efficient use of disk space
and memory (a small index is kept in memory so fetches are single-seek
generally).

My suggestion to you would be to start (or continue) experimenting.  For
one, again, it seems clear that you should insert/persist your data in HBase
and query from there to build your indexes.  After that, you should look at
the tradeoffs between keeping the raw data in Lucene versus keeping it
external.  Rather than focusing on the time to fetch the records when
keeping external (lots can be done to speed that up), let's see how it
affects the size of your index and time to execute the queries.  I still
think the hardest piece to scale is the sharded index because performance
can degrade quickly as it grows, but I'm unsure what affect storing the raw
data will have on index size and query time.

As far as getting involved, this is about the extent to which I'm able to.
Keep sending your findings and questions to the list and I or others will be
more than happy to respond.  If you need someone to review your code, scope
out your cluster, answer a quick question, etc.. you might also jump into
the IRC channel #hbase on freenode where many of us hang out.

Good luck with everything.  Looking forward to seeing your results.

JG

 

> -----Original Message-----
> From: tim robertson [mailto:timrobertson100@gmail.com]
> Sent: Wednesday, December 17, 2008 8:11 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
> 
> Hey Jonathan
> 
> Firstly, thank you for your considered response and offer for help...
> this is much appreciated.
> 
> History:
> So, we have mysql solution now, and a big fat server (64G ram), with
> 150M or so records in the biggest table.  It acts as an index of
> several thousand resources that are harvested from around the world
> (all biodiversity data and all this is open source / 'open access with
> citation').  Because it is an index and the nature of the data (150M
> specimens, but tied to hierarchical trees - "taxonomy") it does not
> partition nicely (geospatially, taxonomically, temporally) since all
> partitions would need hit for one or other queries.  As we have grown
> we have had to limit the fields we offer as search fields.  We expect
> and aim to grow to Billions in the coming months.  This is probably a
> familiar scenario huh?
> I am working alone and in my 'spare time' to try and produce a
> prototype HDFS based version as I described and learning as fast as I
> can - I must admit that Hadoop, Hbase, Lucene are all new to me in the
> recent months.  I blog things (I am not an eloquent blogger as it is
> normally late on Sunday night) here: http://biodivertido.blogspot.com/
> 
> To specifically answer your questions:
>  - ranking is largely irrelevant for us, since all records are
> considered equal
>    - I am not entirely sure what "relevancy ranking" and assume it is
> the search ranking?
> 
>  - regarding the percentage - well this really depends:
>    - most searches will be small results - so index makes sense
>    - also, I look to generate maps quickly so might need latitude and
> longitude in indexes
>    - we need to offer slicing of data
>      - we are an international network, and countries / thematic
> networks want data dumps for their own analysis
>      - therefore USA is 50%+ of the data, but a specialist Butterfly
> site is small.... therefore perhaps an index to do a count and then
> pick a slice strategy?
> 
> So you have now added Tokyo to the long list of things I need to learn
> - but I will look into it...
> 
> My working so far is my spare time and on EC2 and S3 and I am working
> alone - I am not sure how interested you are in getting involved?  I
> would greatly appreciate a knowledgeable pair of eyes to sanity check
> things I am doing (e.g. scanning HBase, parsing XML to raw fields,
> then scanning again and parsing to interpreted values etc), but would
> welcome anyone who is interested in contributing.  It might take some
> set-up time though as I have all the pieces working in isolation and
> don't keep ec2 running all the time.
> 
> Thanks again, I'll continue to post progress
> 
> Tim
> 
> 
> 
> 
> 
> On Wed, Dec 17, 2008 at 4:20 PM, Jonathan Gray <jlist@streamy.com>
> wrote:
> > Hey Tim,
> >
> > I have dabbled with sharding of a Solr index.  We applied a
> consistent
> > hashing algorithm to our IDs to determine which node to insert to.
> >
> > One downside, not sure if this exists with Katta, is that you don't
> have
> > good relevancy across indexes.  For example, distributed querying is
> really
> > just querying each shard.  Unfortunately the relevancy ranking is
> only
> > relevant within each individual index, there is no global rank.  One
> idea is
> > to shard based on another parameter which might allow you to apply
> > relative-relevancy ;) given any domain-specific information.
> >
> > I'm very interested in your problem, right now our indexes are small
> enough
> > that we will be able to get by with 1 or 2 well-equipped nodes, but
> soon
> > enough we will outgrow that and be looking at sharding across 5-10
> nodes.
> > Our results are usually "page" size (10-20) so we don't have the same
> issue
> > with how to efficiently fetch them.
> >
> > In these cases where you might be looking for 10M records, what
> percentage
> > of the total dataset is that?  100M, 1B?  If you turn to a full scan
> as your
> > solution, you're going to serious limit how fast you can go even with
> good
> > caching and faster IO.  But if you're returning a significant number
> of
> > total rows, then this would definitely make sense.
> >
> > If your data is relatively static, you might look at writing a very
> simple
> > disk-based key/val cache like Berkeley DB or my favorite Tokyo
> Cabinet.
> > These can handle high numbers of records, stored on disk, but
> accessible in
> > sub-ms time.  I have C and Java code to work with Tokyo and HBase
> together.
> > With such a high number of records, it's probably not feasible to
> keep them
> > in memory, so a solution like this could be your best bet.  Also,
> stay tuned
> > to this issue as it would create a situation similar to running a
> disk-based
> > key/val by using Direct IO (preliminary testing shows 10X random-read
> > improvement):  https://issues.apache.org/jira/browse/HADOOP-4801
> >
> > And this is an old issue that will have new life soon:
> > https://issues.apache.org/jira/browse/HBASE-80
> >
> > Like I said, I have an interest in seeing how to solve this problem,
> so let
> > me know if you have any other questions or if we can help in any way.
> >
> > Jonathan Gray
> >
> >> -----Original Message-----
> >> From: tim robertson [mailto:timrobertson100@gmail.com]
> >> Sent: Tuesday, December 16, 2008 11:42 PM
> >> To: hbase-user@hadoop.apache.org
> >> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
> >>
> >> Hi,
> >>
> >> Thanks for the help.
> >>
> >> My Lucene indexes are for sure going to be too large for one
> machine,
> >> so I plan to put the indexes on the HDFS, and then let Katta
> >> distribute them around a few machines.  Because of Katta's ability
> to
> >> do this, I went for Lucene and not SOLR, which requires me to do all
> >> the sharding myself, if I understand distributed SOLR correctly - I
> >> would much prefer SOLR's primitive handling as right now I convert
> all
> >> dates and Ints manually.  If someone has distributed SOLR (really is
> >> too big for one machine since indexes are >50G) I'd love to hear how
> >> they sharded nicely and mange it.
> >>
> >> Regarding performance... well, for "reports" that will return 10M
> >> records, I will be quite happy with minutes as a response time, as
> >> this is typically data download for scientific analysis, and
> therefore
> >> people are happy to wait.  The results get put on to Amazon S3
> GZipped
> >> for download.  What worries me is if I have 10-100 reports running
> at
> >> one time, there is an awful lot of single record requests on HBase.
> I
> >> guess I will try and blog the findings.
> >>
> >> I am following HBase, Katta and Hadoop code trunks so will also try
> >> and always use the latest, as this is a research project and not
> >> production right now (production is still mysql based).
> >>
> >> The alternative of course is to always open a scanner and then do a
> >> full table scan for each report...
> >>
> >> Thanks
> >>
> >> Tim
> >>
> >> On Wed, Dec 17, 2008 at 12:22 AM, Jonathan Gray <jlist@streamy.com>
> >> wrote:
> >> > If I understand your system (and Lucene) correctly, you obviously
> >> must input
> >> > all queried fields to Lucene.  And the indexes will be stored for
> the
> >> > documents.
> >> >
> >> > Your question is about whether to also store the raw fields in
> Lucene
> >> or
> >> > just store indexes in Lucene?
> >> >
> >> > A few things you might consider...
> >> >
> >> > - Scaling Lucene is much more difficult than scaling HBase.
> Storing
> >> indexes
> >> > and raw content is going to grow your Lucene instance fast.
> Scaling
> >> HBase
> >> > is easy and you're going to have constant performance whereas
> Lucene
> >> > performance will degrade significantly as it grows.
> >> >
> >> > - Random access to HBase currently leaves something to be desired.
> >> What
> >> > kind of performance are you looking for with 1M random fetches?
> >> There is
> >> > major work being done for 0.19 and 0.20 that will really help with
> >> > performance as stack mentioned.
> >> >
> >> > - With 1M random reads, you might never get the performance out of
> >> HBase
> >> > that you want, certainly not if you're expecting 1M fetches to be
> >> done in
> >> > "realtime" (~100ms or so). However, depending on your dataset and
> >> access
> >> > patterns, you might be able to get sufficient performance with
> >> caching
> >> > (either block that is currently available, or record caching
> slated
> >> for 0.20
> >> > but likely with a patch available soon).
> >> >
> >> > We are using Lucene by way of Solr and are not storing the raw
> data
> >> in
> >> > Lucene.  We have an external Memcached-like cache so that our raw
> >> content
> >> > fetches are sufficiently quick.  My team is currently working on
> >> building
> >> > this cache into HBase.
> >> >
> >> > I'm not sure if the highlighting features in Solr are only part of
> >> Solr or
> >> > also in Lucene, but of course you lose the ability to do those
> things
> >> if you
> >> > don't put the raw content into Lucene.
> >> >
> >> > JG
> >> >
> >> >
> >> >
> >> >> -----Original Message-----
> >> >> From: stack [mailto:stack@duboce.net]
> >> >> Sent: Tuesday, December 16, 2008 2:37 PM
> >> >> To: hbase-user@hadoop.apache.org
> >> >> Subject: Re: Lucene from HBase - raw values in Lucene index or
> not?
> >> >>
> >> >> Interesting question.
> >> >>
> >> >> Would be grand if you didn't have to duplicate the hbase data in
> the
> >> >> lucene index, just store the hbase locations -- or, just store
> small
> >> >> stuff in the lucene index and leave big-stuff back in hbase --
> but
> >> >> perhaps the double hop of lucene first and then to hbase will not
> >> >> perform well enough?  0.19.0 hbase will be better than 0.18.0 if
> you
> >> >> can
> >> >> wait a week or so for the release candiate to test.
> >> >>
> >> >> Let us know how it goes Tim,
> >> >> St.Ack
> >> >>
> >> >>
> >> >> tim robertson wrote:
> >> >> > Hi All,
> >> >> >
> >> >> > I have HBase running now, building Lucene indexes on Hadoop
> >> >> > successfully and then I will get Katta running for distributing
> my
> >> >> > indexes.
> >> >> >
> >> >> > I have around 15 search fields indexed that I wish to extract
> and
> >> >> > return those 15 to the user in the result set - my result sets
> >> will
> >> >> be
> >> >> > up to millions of records...
> >> >> >
> >> >> > Should I:
> >> >> >
> >> >> >   a) have the values stored in the Lucene index which will make
> it
> >> >> > slower to search but returns the results immediately in pages
> >> without
> >> >> > hitting HBase
> >> >> >
> >> >> > or
> >> >> >
> >> >> >   b) Not store the data in the index but page over the Lucene
> >> index
> >> >> > and do millions of "get by ROWKEY" on HBase
> >> >> >
> >> >> > Obviously this is not happening synchronously while the user
> >> waits,
> >> >> > but looking forward to hear if people have done similar
> scenarios
> >> and
> >> >> > what worked out nicely...
> >> >> >
> >> >> > Lucene degrades in performance at large page numbers (100th
> page
> >> of
> >> >> > 1000 results) right?
> >> >> >
> >> >> > Thanks for any insights,
> >> >> >
> >> >> > Tim
> >> >> >
> >> >
> >> >
> >
> >


Mime
View raw message