incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <>
Subject Re: [jira] [Commented] (BLUR-112) Allow for types to be set on blur tables
Date Wed, 26 Jun 2013 07:12:46 GMT
Hi Aaron,

Thanks for your response. Below are few of my thoughts

MapReduce Indexing : May be we should implement some locking mechanism on
TableDescriptor/AnalyzerDefinition so that it is not changed until
MapReduce Indexing is done.

May be my understanding is a little naive here but we are storing analyzer
definition as whole in zookeeper. If that is the case how can we use
ZkCachedMap to store values in zookeeper(apart from memory) at a lower
granularity like types or column definition (unless we modify how analyzer
definition is stored).

We should also add this subtask apart from the ones you mentioned:
  -- Modify Column struct to add type and once defined we should not allow
updates for existing column definitions

* Then a ZK watcher would fire in the ZooKeeperClusterStatus class (all the
shard servers would also fire on the watcher) that would update the
TableDescriptor, then perhaps we would clear the table context cache -- I
thoought about this but I didn't knew how to achieve this. This clarifies.

Also what would be the problem with refreshing the same blur analyzer
object instead of creating a new one. Something like
analyzer.reload(updatedAnalyzerDefinition). This avoids updating all
references held to the analyzer. If for some reason this does not work then
apart from passing it to BlurIndexWriter, we should also tweak the
DistributedIndexServer as it holds _tableAnalyzers and _tableDescriptors.
What do you think?

- Rahul

On Tue, Jun 25, 2013 at 6:12 PM, Aaron McCurry (JIRA) <>wrote:

>     [
> Aaron McCurry commented on BLUR-112:
> ------------------------------------
> Rahul,
> These are all good questions, to be honest this is a feature that I know
> we need but I am unsure of what the correct implementation should be.  I
> like your line of thinking though.  After given this some thought and I
> have tried to document some of these thoughts below.  Like I said I am not
> sold on any implementation so I am open to other ideas on how we should
> proceed.
> I believe there are 2 fundamental operating modes when it comes to column
> definitions/types/analyzers.
> * The first mode is when you know all the columns and definitions up front
> and you define a column with a type and an optional analyzer if that type
> is a TextField.  This mode is required for MapReduce indexing because it
> needs to know up front how to index all the data.
> * The second mode is when no columns or types are known up front but as
> they are discovered they need to be added to the table descriptor/analyzer
> def.
> I believe we need to support both modes at the same time.  So back to some
> of your questions.
> - Once defined, we do not allow updates for existing column definitions.
> We can safely check whether any Inbound Column types have a conflict with
> any defined types in AnalyzerDefinition as we do not have to worry about
> AnalyzerDefinition object on the shard server being out of sync with that
> stored on Zookeeper.
> This is true.
> - However for dynamic columns as you explained, the same column might have
> been added by a different shard server. This need not happen at the exact
> same time(correct me) . Even if ShardServer-B added a dynamic column 5
> seconds before, the AnalyzerDefinition in ShardServer-A does not reflect
> that change
> So this is where the ZkCachedMap comes in, basically it is an inmemory
> cache of fields (or any other values really) that can only be set once and
> is persistent.  Also it serves as a consistent store for all the types
> across all the shards servers.
> - Also a little confused about how to use ZkCachedMap. What values will we
> be caching/storing using this? Are they only the one's which can be
> overwritten on Zookeeper?
> I think that we should store all the types (maybe more information like
> the entire column definition etc) in ZkCachedMap for all cases.  I think
> that this will make things more consistent.  Perhaps the ZkCachedMap needs
> to become the storage mechanism for the AnalyzerDefinition.  Also the
> ZkCachedMap is probably a bad name for that class we might need to come up
> with something else.  As I look at your pseudo code, you are right we will
> need to reload the analyzer somehow, it will likely need to be driven from
> a ZK watcher on the update of the ZkCacheMap.
> Operations at a high level
> * From an external api (Thrift), I would think we would need a method that
> looks like "addColumnDefinition(family,name,type,analyzer,fulltext)" or
> something like that
> * Next it would call a method to update the ZkCacheMap and store to ZK
> * Then a ZK watcher would fire in the ZooKeeperClusterStatus class (all
> the shard servers would also fire on the watcher) that would update the
> TableDescriptor, then perhaps we would clear the table context cache
> * And the clearing of the cache would force the recreation of the the
> BlurAnalyzer (if it doesn't we should make it)
> * We would also need to figure out how to get the new analyzer into the
> index writer (maybe with an atomic reference inside a analyzer decorator?)
> As we talk about this feature I think we need to break this one up into
> sub tasks.  Let me know what you think.
> Thanks,
> Aaron
> > Allow for types to be set on blur tables
> > ----------------------------------------
> >
> >                 Key: BLUR-112
> >                 URL:
> >             Project: Apache Blur
> >          Issue Type: Improvement
> >    Affects Versions: 0.2.0, 0.3.0
> >            Reporter: Aaron McCurry
> >             Fix For: 0.3.0
> >
> >
> > Create the ability for Blur to handle the default Lucene field types.
>  This should not be tied to the table descriptor because types should be
> allowed to be added at runtime.  Also 2 new fields should be added to the
> TableDescriptor:
> > 1. A strict types attribute.  If set to true, if a new column is added
> to the table and there is no type mapping for it.  Throw an exception.  Set
> to false by default.
> > 2. Default type is strict is set to false.  The default type should be
> text.
> > Also, dynamic columns could be allowed if their name included the type.
>  Such as:
> > The column name could be "col1" with a type of "int", in the Column
> struct in thrift the name would be "col1/int" and if the type did not exist
> before the call it would be added.
> > Thoughts?
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message