lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <>
Subject [jira] Commented: (SOLR-1375) BloomFilter on a field
Date Wed, 26 Aug 2009 01:37:59 GMT


Jason Rutherglen commented on SOLR-1375:

The other attribute to add is the ability to set the hash function to use (i.e. Murmur)

> BloomFilter on a field
> ----------------------
>                 Key: SOLR-1375
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>   Original Estimate: 120h
>  Remaining Estimate: 120h
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message