lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3535) Add block support for XMLLoader
Date Wed, 13 Jun 2012 21:55:42 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294678#comment-13294678
] 

Hoss Man commented on SOLR-3535:
--------------------------------

bq. I don't feel that this rich model is covered with single level parent-child well.

who said anything about a "single level" ? .. if SolrInputDocument can have a List<SolrInputDocument>
of children, then those children can have other children, etc..

bq. PK field is a blocker for transparent handling scoped docs by the current processors.
i.e. I don't think it's mandatory to provide PK field for every child document (most time
it's useless and redundant info)

Agreed, but i don't see how it's a blocker - if the the children hang off of the top most
parent, then as long as that parent has a uniqueKey, all of the distributed stuff (and any
update processors that care about uniqueKey) should be fine ... processors that want to be
aware of sub-documents might have to worry about it, and we have to think through how deletes
by id should work (so that children are automaticly removed and not inherited by the ajacent
parent doc) but those are going to issues that need thought through/solved regardless of how
we model the nested docs in the processor chain API.

bq. field update processors can work wrong if the same field name is present in several scopes
- name clash between different relations/scopes

a) that seems like an argument in favor of continuing to give the processors a single top
level SolrInputDocument with all of it's children hanging off of it in a hierarchy, instead
of adding a new AddBlockCommand that contains an flatened list of documents -- because the
processors won't have any way of knowing if/when to treat some docs differently.

b) like other things i mentioned earlier, that really seems like a secondary concern -- for
many use cases either the fields names will be distinct, or can be made distinct for the purposes
of using this feature.  Update processors can (eventually) be made smarter to know to only
operate on certain documents by "type" but any solution like that that would work on a sequential
list of documents like in your "AddBlockCommand" suggestion could also work on a true hierarchy
of SOlrInputDocuments (where it would have the acutal hierarchy to help inform it's behavior)

bq. why new api/property is necessary? is solrInputDoc.addField("skus", new Object[]{sku1,
sku2, sku3}) not enough?

Are you suggesting we model child documents as objects (SolrInputDocuments i guess?) in a
special field? ... what if i put child documents in multiple fields? would that signify the
different types of child?  how would solr model that in the (lucene) Documents when giving
them to the InddexWriter?  How would solr know how to order the children in from multiple
fields/lists when creating the block?  Wouldn't the "type of child" information be better
living in the child documents itself?  (particularly since that "type" information needs to
be in the child documents anyway so that the filter query for a BJQ can be specified.)  

It also seems like it would require code that wants to know what children exist in a document
to do a lot of work to find that out (need to iterate ever field in the SolrInputDocument
and do reflection to see if they are child-documents or not)

Another concern off the top of my head is that a lot of existing code (including any custom
update processors people might have) would assume those child documents are multivaluved field
values and would probably break -- hence a new method on SolrInputDocument seems wiser (code
that doens't know about may not do what you want, but at least it won't break it)

bq. there is a *pre*processors chain which deal with scoped documents and flatten them - there
should be two of them: block-join (bjq counterpart); denormalizer (grouping counterpart);
fk-copier for query-time join;

i don't really understand the need for this.  i'm at a complete loss as what you mean by "fk-copier
for query-time join", but your suggestion for a new type of processor chain that can flatten/denormalize
documents seems like it could easily be implemented using the existing UpdateProcessorChain
code -- assuming we let SolrInputDocuments have other SolrInputDocuments as children.  Couldn't
you just write a new "FlattenDocumentUpdateProcessor" such that anytime it gets a SolrInputDocument
with children, it creates new AddDocCommands containing those children (adding whatever flattened
fields from the parent that it wants) and executes them?

bq. for distributed processor AddBlockCommand should have PK - it's preprocessors' duty

but that doesn't address the issues yonik and i raised about all of the distributed update
& transaction log code that already exists revolving around forwarding *documents* and
recording their unique key.  What is the advantage of introducing a new AddBlockCommand that
also has to have a unique key, and would need to be forwarded around atomically when we could
just use the top level parent document with all of the existing distributed update code as
is?
                
> Add block support for XMLLoader
> -------------------------------
>
>                 Key: SOLR-3535
>                 URL: https://issues.apache.org/jira/browse/SOLR-3535
>             Project: Solr
>          Issue Type: Sub-task
>          Components: update
>    Affects Versions: 4.1, 5.0
>            Reporter: Mikhail Khludnev
>            Priority: Minor
>         Attachments: SOLR-3535.patch
>
>
> I'd like to add the following update xml message:
> <add-block>
>     <doc>....</doc>
>     <doc>....</doc>
> </add-block>
> out of scope for now: 
> * other update formats
> * update log support (NRT), should not be a big deal
> * overwrite feature support for block updates - it's more complicated, I'll tell you
why
> Alt
> * wdyt about adding attribute to the current tag {pre}<add block="true">{pre} 
> * or we can establish RunBlockUpdateProcessor which treat every <add> ....</add>
as a block.
> *Test is included!!*
> How you'd suggest to improve the patch?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message