incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: [jira] [Commented] (BLUR-112) Allow for types to be set on blur tables
Date Thu, 27 Jun 2013 02:12:22 GMT
On Wed, Jun 26, 2013 at 3:12 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Hi Aaron,
>
> Thanks for your response. Below are few of my thoughts
>
> MapReduce Indexing : May be we should implement some locking mechanism on
> TableDescriptor/AnalyzerDefinition so that it is not changed until
> MapReduce Indexing is done.
>

The way the MR indexing works it gets the current table descriptor and
packs the object into the job.  So it never goes back to ZK or the cluster
for more information.


>
> May be my understanding is a little naive here but we are storing analyzer
> definition as whole in zookeeper. If that is the case how can we use
> ZkCachedMap to store values in zookeeper(apart from memory) at a lower
> granularity like types or column definition (unless we modify how analyzer
> definition is stored).
>

Yes I'm suggesting that we changes the way the analyzer is stored in ZK.


>
> We should also add this subtask apart from the ones you mentioned:
>   -- Modify Column struct to add type and once defined we should not allow
> updates for existing column definitions
>

> * Then a ZK watcher would fire in the ZooKeeperClusterStatus class (all the
> shard servers would also fire on the watcher) that would update the
> TableDescriptor, then perhaps we would clear the table context cache -- I
> thoought about this but I didn't knew how to achieve this. This clarifies.
>
> Also what would be the problem with refreshing the same blur analyzer
> object instead of creating a new one. Something like
> analyzer.reload(updatedAnalyzerDefinition). This avoids updating all
> references held to the analyzer. If for some reason this does not work then
> apart from passing it to BlurIndexWriter, we should also tweak the
> DistributedIndexServer as it holds _tableAnalyzers and _tableDescriptors.
> What do you think?
>

Well the biggest reason I said we should clear the TableContext cache is
because it re-reads the state of everything from ZK.  It may sound like a
lot to rebuild it, but the act of creating a new column type is going to be
very infrequent.  Also if we do it this way (clearing the cache and letting
it re-create it), then we don't have to worry about an pesky concurrency
issues that would likely come into play during a reload because of active
writes, reads, queries, etc.  It basically makes it immutable.

The only issue with this is getting the new analyzer information into the
indexwriter.  But we may have to close and reopen it anyway because I don't
really know what it's doing with the analyzer.  Could probably figure it
out though.  I'm sort of undecided on that one.

Aaron


> - Rahul
>
>
>
>
> On Tue, Jun 25, 2013 at 6:12 PM, Aaron McCurry (JIRA) <jira@apache.org
> >wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/BLUR-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693558#comment-13693558
> ]
> >
> > Aaron McCurry commented on BLUR-112:
> > ------------------------------------
> >
> > Rahul,
> >
> > These are all good questions, to be honest this is a feature that I know
> > we need but I am unsure of what the correct implementation should be.  I
> > like your line of thinking though.  After given this some thought and I
> > have tried to document some of these thoughts below.  Like I said I am
> not
> > sold on any implementation so I am open to other ideas on how we should
> > proceed.
> >
> > I believe there are 2 fundamental operating modes when it comes to column
> > definitions/types/analyzers.
> >
> > * The first mode is when you know all the columns and definitions up
> front
> > and you define a column with a type and an optional analyzer if that type
> > is a TextField.  This mode is required for MapReduce indexing because it
> > needs to know up front how to index all the data.
> >
> > * The second mode is when no columns or types are known up front but as
> > they are discovered they need to be added to the table
> descriptor/analyzer
> > def.
> >
> > I believe we need to support both modes at the same time.  So back to
> some
> > of your questions.
> >
> > - Once defined, we do not allow updates for existing column definitions.
> > We can safely check whether any Inbound Column types have a conflict with
> > any defined types in AnalyzerDefinition as we do not have to worry about
> > AnalyzerDefinition object on the shard server being out of sync with that
> > stored on Zookeeper.
> >
> > This is true.
> >
> > - However for dynamic columns as you explained, the same column might
> have
> > been added by a different shard server. This need not happen at the exact
> > same time(correct me) . Even if ShardServer-B added a dynamic column 5
> > seconds before, the AnalyzerDefinition in ShardServer-A does not reflect
> > that change
> >
> > So this is where the ZkCachedMap comes in, basically it is an inmemory
> > cache of fields (or any other values really) that can only be set once
> and
> > is persistent.  Also it serves as a consistent store for all the types
> > across all the shards servers.
> >
> > - Also a little confused about how to use ZkCachedMap. What values will
> we
> > be caching/storing using this? Are they only the one's which can be
> > overwritten on Zookeeper?
> >
> > I think that we should store all the types (maybe more information like
> > the entire column definition etc) in ZkCachedMap for all cases.  I think
> > that this will make things more consistent.  Perhaps the ZkCachedMap
> needs
> > to become the storage mechanism for the AnalyzerDefinition.  Also the
> > ZkCachedMap is probably a bad name for that class we might need to come
> up
> > with something else.  As I look at your pseudo code, you are right we
> will
> > need to reload the analyzer somehow, it will likely need to be driven
> from
> > a ZK watcher on the update of the ZkCacheMap.
> >
> > Operations at a high level
> >
> > * From an external api (Thrift), I would think we would need a method
> that
> > looks like "addColumnDefinition(family,name,type,analyzer,fulltext)" or
> > something like that
> > * Next it would call a method to update the ZkCacheMap and store to ZK
> > * Then a ZK watcher would fire in the ZooKeeperClusterStatus class (all
> > the shard servers would also fire on the watcher) that would update the
> > TableDescriptor, then perhaps we would clear the table context cache
> > * And the clearing of the cache would force the recreation of the the
> > BlurAnalyzer (if it doesn't we should make it)
> > * We would also need to figure out how to get the new analyzer into the
> > index writer (maybe with an atomic reference inside a analyzer
> decorator?)
> >
> > As we talk about this feature I think we need to break this one up into
> > sub tasks.  Let me know what you think.
> >
> > Thanks,
> > Aaron
> >
> >
> >
> >
> >
> >
> > > Allow for types to be set on blur tables
> > > ----------------------------------------
> > >
> > >                 Key: BLUR-112
> > >                 URL: https://issues.apache.org/jira/browse/BLUR-112
> > >             Project: Apache Blur
> > >          Issue Type: Improvement
> > >    Affects Versions: 0.2.0, 0.3.0
> > >            Reporter: Aaron McCurry
> > >             Fix For: 0.3.0
> > >
> > >
> > > Create the ability for Blur to handle the default Lucene field types.
> >  This should not be tied to the table descriptor because types should be
> > allowed to be added at runtime.  Also 2 new fields should be added to the
> > TableDescriptor:
> > > 1. A strict types attribute.  If set to true, if a new column is added
> > to the table and there is no type mapping for it.  Throw an exception.
>  Set
> > to false by default.
> > > 2. Default type is strict is set to false.  The default type should be
> > text.
> > > Also, dynamic columns could be allowed if their name included the type.
> >  Such as:
> > > The column name could be "col1" with a type of "int", in the Column
> > struct in thrift the name would be "col1/int" and if the type did not
> exist
> > before the call it would be added.
> > > Thoughts?
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> > administrators
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message