hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Re: Lucene from HBase - raw values in Lucene index or not?
Date Wed, 17 Dec 2008 16:53:04 GMT
Thanks Jonathan,

Your feedback really is appreciated - I will just keep pestering the
list and try and get to more specific problems as I come across them!

Cheers,

Tim


On Wed, Dec 17, 2008 at 5:46 PM, Jonathan Gray <jlist@streamy.com> wrote:
> Tim,
>
> Your scenario is certainly a familiar one, distribution of your mysql
> solution will be cumbersome at best and performance will degrade quickly.
>
> Since ranking is not so important, then you really won't have any issues
> with sharding your index in Solr, or Katta and Lucene.  Distributed Solr
> queries are really just performing the query on each index and merging the
> results; this should scale fairly well for you it seems.  And yes, relevancy
> ranking I mean search ranking (relevancy of the results to your original
> query).
>
> With a wide range of potential results you are dealing with (most you say
> are small, but some could be 10M+), you'll need to either design a
> principled approach that will give you the best average-case, or start
> thinking about dealing with them differently.  It certainly does seem clear
> that HBase is a great persistent storage solution for you regardless of how
> you handle querying.
>
> Tokyo Cabinet is a very simple piece of software, C implementations of
> disk-based key/val Hash Tables or B-Trees.  You can get constant-time access
> to a very large set of data with surprisingly efficient use of disk space
> and memory (a small index is kept in memory so fetches are single-seek
> generally).
>
> My suggestion to you would be to start (or continue) experimenting.  For
> one, again, it seems clear that you should insert/persist your data in HBase
> and query from there to build your indexes.  After that, you should look at
> the tradeoffs between keeping the raw data in Lucene versus keeping it
> external.  Rather than focusing on the time to fetch the records when
> keeping external (lots can be done to speed that up), let's see how it
> affects the size of your index and time to execute the queries.  I still
> think the hardest piece to scale is the sharded index because performance
> can degrade quickly as it grows, but I'm unsure what affect storing the raw
> data will have on index size and query time.
>
> As far as getting involved, this is about the extent to which I'm able to.
> Keep sending your findings and questions to the list and I or others will be
> more than happy to respond.  If you need someone to review your code, scope
> out your cluster, answer a quick question, etc.. you might also jump into
> the IRC channel #hbase on freenode where many of us hang out.
>
> Good luck with everything.  Looking forward to seeing your results.
>
> JG
>
>
>
>> -----Original Message-----
>> From: tim robertson [mailto:timrobertson100@gmail.com]
>> Sent: Wednesday, December 17, 2008 8:11 AM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
>>
>> Hey Jonathan
>>
>> Firstly, thank you for your considered response and offer for help...
>> this is much appreciated.
>>
>> History:
>> So, we have mysql solution now, and a big fat server (64G ram), with
>> 150M or so records in the biggest table.  It acts as an index of
>> several thousand resources that are harvested from around the world
>> (all biodiversity data and all this is open source / 'open access with
>> citation').  Because it is an index and the nature of the data (150M
>> specimens, but tied to hierarchical trees - "taxonomy") it does not
>> partition nicely (geospatially, taxonomically, temporally) since all
>> partitions would need hit for one or other queries.  As we have grown
>> we have had to limit the fields we offer as search fields.  We expect
>> and aim to grow to Billions in the coming months.  This is probably a
>> familiar scenario huh?
>> I am working alone and in my 'spare time' to try and produce a
>> prototype HDFS based version as I described and learning as fast as I
>> can - I must admit that Hadoop, Hbase, Lucene are all new to me in the
>> recent months.  I blog things (I am not an eloquent blogger as it is
>> normally late on Sunday night) here: http://biodivertido.blogspot.com/
>>
>> To specifically answer your questions:
>>  - ranking is largely irrelevant for us, since all records are
>> considered equal
>>    - I am not entirely sure what "relevancy ranking" and assume it is
>> the search ranking?
>>
>>  - regarding the percentage - well this really depends:
>>    - most searches will be small results - so index makes sense
>>    - also, I look to generate maps quickly so might need latitude and
>> longitude in indexes
>>    - we need to offer slicing of data
>>      - we are an international network, and countries / thematic
>> networks want data dumps for their own analysis
>>      - therefore USA is 50%+ of the data, but a specialist Butterfly
>> site is small.... therefore perhaps an index to do a count and then
>> pick a slice strategy?
>>
>> So you have now added Tokyo to the long list of things I need to learn
>> - but I will look into it...
>>
>> My working so far is my spare time and on EC2 and S3 and I am working
>> alone - I am not sure how interested you are in getting involved?  I
>> would greatly appreciate a knowledgeable pair of eyes to sanity check
>> things I am doing (e.g. scanning HBase, parsing XML to raw fields,
>> then scanning again and parsing to interpreted values etc), but would
>> welcome anyone who is interested in contributing.  It might take some
>> set-up time though as I have all the pieces working in isolation and
>> don't keep ec2 running all the time.
>>
>> Thanks again, I'll continue to post progress
>>
>> Tim
>>
>>
>>
>>
>>
>> On Wed, Dec 17, 2008 at 4:20 PM, Jonathan Gray <jlist@streamy.com>
>> wrote:
>> > Hey Tim,
>> >
>> > I have dabbled with sharding of a Solr index.  We applied a
>> consistent
>> > hashing algorithm to our IDs to determine which node to insert to.
>> >
>> > One downside, not sure if this exists with Katta, is that you don't
>> have
>> > good relevancy across indexes.  For example, distributed querying is
>> really
>> > just querying each shard.  Unfortunately the relevancy ranking is
>> only
>> > relevant within each individual index, there is no global rank.  One
>> idea is
>> > to shard based on another parameter which might allow you to apply
>> > relative-relevancy ;) given any domain-specific information.
>> >
>> > I'm very interested in your problem, right now our indexes are small
>> enough
>> > that we will be able to get by with 1 or 2 well-equipped nodes, but
>> soon
>> > enough we will outgrow that and be looking at sharding across 5-10
>> nodes.
>> > Our results are usually "page" size (10-20) so we don't have the same
>> issue
>> > with how to efficiently fetch them.
>> >
>> > In these cases where you might be looking for 10M records, what
>> percentage
>> > of the total dataset is that?  100M, 1B?  If you turn to a full scan
>> as your
>> > solution, you're going to serious limit how fast you can go even with
>> good
>> > caching and faster IO.  But if you're returning a significant number
>> of
>> > total rows, then this would definitely make sense.
>> >
>> > If your data is relatively static, you might look at writing a very
>> simple
>> > disk-based key/val cache like Berkeley DB or my favorite Tokyo
>> Cabinet.
>> > These can handle high numbers of records, stored on disk, but
>> accessible in
>> > sub-ms time.  I have C and Java code to work with Tokyo and HBase
>> together.
>> > With such a high number of records, it's probably not feasible to
>> keep them
>> > in memory, so a solution like this could be your best bet.  Also,
>> stay tuned
>> > to this issue as it would create a situation similar to running a
>> disk-based
>> > key/val by using Direct IO (preliminary testing shows 10X random-read
>> > improvement):  https://issues.apache.org/jira/browse/HADOOP-4801
>> >
>> > And this is an old issue that will have new life soon:
>> > https://issues.apache.org/jira/browse/HBASE-80
>> >
>> > Like I said, I have an interest in seeing how to solve this problem,
>> so let
>> > me know if you have any other questions or if we can help in any way.
>> >
>> > Jonathan Gray
>> >
>> >> -----Original Message-----
>> >> From: tim robertson [mailto:timrobertson100@gmail.com]
>> >> Sent: Tuesday, December 16, 2008 11:42 PM
>> >> To: hbase-user@hadoop.apache.org
>> >> Subject: Re: Lucene from HBase - raw values in Lucene index or not?
>> >>
>> >> Hi,
>> >>
>> >> Thanks for the help.
>> >>
>> >> My Lucene indexes are for sure going to be too large for one
>> machine,
>> >> so I plan to put the indexes on the HDFS, and then let Katta
>> >> distribute them around a few machines.  Because of Katta's ability
>> to
>> >> do this, I went for Lucene and not SOLR, which requires me to do all
>> >> the sharding myself, if I understand distributed SOLR correctly - I
>> >> would much prefer SOLR's primitive handling as right now I convert
>> all
>> >> dates and Ints manually.  If someone has distributed SOLR (really is
>> >> too big for one machine since indexes are >50G) I'd love to hear how
>> >> they sharded nicely and mange it.
>> >>
>> >> Regarding performance... well, for "reports" that will return 10M
>> >> records, I will be quite happy with minutes as a response time, as
>> >> this is typically data download for scientific analysis, and
>> therefore
>> >> people are happy to wait.  The results get put on to Amazon S3
>> GZipped
>> >> for download.  What worries me is if I have 10-100 reports running
>> at
>> >> one time, there is an awful lot of single record requests on HBase.
>> I
>> >> guess I will try and blog the findings.
>> >>
>> >> I am following HBase, Katta and Hadoop code trunks so will also try
>> >> and always use the latest, as this is a research project and not
>> >> production right now (production is still mysql based).
>> >>
>> >> The alternative of course is to always open a scanner and then do a
>> >> full table scan for each report...
>> >>
>> >> Thanks
>> >>
>> >> Tim
>> >>
>> >> On Wed, Dec 17, 2008 at 12:22 AM, Jonathan Gray <jlist@streamy.com>
>> >> wrote:
>> >> > If I understand your system (and Lucene) correctly, you obviously
>> >> must input
>> >> > all queried fields to Lucene.  And the indexes will be stored for
>> the
>> >> > documents.
>> >> >
>> >> > Your question is about whether to also store the raw fields in
>> Lucene
>> >> or
>> >> > just store indexes in Lucene?
>> >> >
>> >> > A few things you might consider...
>> >> >
>> >> > - Scaling Lucene is much more difficult than scaling HBase.
>> Storing
>> >> indexes
>> >> > and raw content is going to grow your Lucene instance fast.
>> Scaling
>> >> HBase
>> >> > is easy and you're going to have constant performance whereas
>> Lucene
>> >> > performance will degrade significantly as it grows.
>> >> >
>> >> > - Random access to HBase currently leaves something to be desired.
>> >> What
>> >> > kind of performance are you looking for with 1M random fetches?
>> >> There is
>> >> > major work being done for 0.19 and 0.20 that will really help with
>> >> > performance as stack mentioned.
>> >> >
>> >> > - With 1M random reads, you might never get the performance out of
>> >> HBase
>> >> > that you want, certainly not if you're expecting 1M fetches to be
>> >> done in
>> >> > "realtime" (~100ms or so). However, depending on your dataset and
>> >> access
>> >> > patterns, you might be able to get sufficient performance with
>> >> caching
>> >> > (either block that is currently available, or record caching
>> slated
>> >> for 0.20
>> >> > but likely with a patch available soon).
>> >> >
>> >> > We are using Lucene by way of Solr and are not storing the raw
>> data
>> >> in
>> >> > Lucene.  We have an external Memcached-like cache so that our raw
>> >> content
>> >> > fetches are sufficiently quick.  My team is currently working on
>> >> building
>> >> > this cache into HBase.
>> >> >
>> >> > I'm not sure if the highlighting features in Solr are only part of
>> >> Solr or
>> >> > also in Lucene, but of course you lose the ability to do those
>> things
>> >> if you
>> >> > don't put the raw content into Lucene.
>> >> >
>> >> > JG
>> >> >
>> >> >
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: stack [mailto:stack@duboce.net]
>> >> >> Sent: Tuesday, December 16, 2008 2:37 PM
>> >> >> To: hbase-user@hadoop.apache.org
>> >> >> Subject: Re: Lucene from HBase - raw values in Lucene index or
>> not?
>> >> >>
>> >> >> Interesting question.
>> >> >>
>> >> >> Would be grand if you didn't have to duplicate the hbase data in
>> the
>> >> >> lucene index, just store the hbase locations -- or, just store
>> small
>> >> >> stuff in the lucene index and leave big-stuff back in hbase --
>> but
>> >> >> perhaps the double hop of lucene first and then to hbase will not
>> >> >> perform well enough?  0.19.0 hbase will be better than 0.18.0 if
>> you
>> >> >> can
>> >> >> wait a week or so for the release candiate to test.
>> >> >>
>> >> >> Let us know how it goes Tim,
>> >> >> St.Ack
>> >> >>
>> >> >>
>> >> >> tim robertson wrote:
>> >> >> > Hi All,
>> >> >> >
>> >> >> > I have HBase running now, building Lucene indexes on Hadoop
>> >> >> > successfully and then I will get Katta running for distributing
>> my
>> >> >> > indexes.
>> >> >> >
>> >> >> > I have around 15 search fields indexed that I wish to extract
>> and
>> >> >> > return those 15 to the user in the result set - my result
sets
>> >> will
>> >> >> be
>> >> >> > up to millions of records...
>> >> >> >
>> >> >> > Should I:
>> >> >> >
>> >> >> >   a) have the values stored in the Lucene index which will
make
>> it
>> >> >> > slower to search but returns the results immediately in pages
>> >> without
>> >> >> > hitting HBase
>> >> >> >
>> >> >> > or
>> >> >> >
>> >> >> >   b) Not store the data in the index but page over the Lucene
>> >> index
>> >> >> > and do millions of "get by ROWKEY" on HBase
>> >> >> >
>> >> >> > Obviously this is not happening synchronously while the user
>> >> waits,
>> >> >> > but looking forward to hear if people have done similar
>> scenarios
>> >> and
>> >> >> > what worked out nicely...
>> >> >> >
>> >> >> > Lucene degrades in performance at large page numbers (100th
>> page
>> >> of
>> >> >> > 1000 results) right?
>> >> >> >
>> >> >> > Thanks for any insights,
>> >> >> >
>> >> >> > Tim
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Mime
View raw message