lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-12298) Index Full nested document Hierarchy For Queries (umbrella issue)
Date Wed, 09 May 2018 19:44:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469375#comment-16469375
] 

David Smiley commented on SOLR-12298:
-------------------------------------

Quoting [~hossman] here inline (hoping for his input):
{quote}Are you suggesting we model child documents as objects (SolrInputDocuments i guess?)
in a special field?
{quote}
Yes.  Not as a special field, although _anonymous_ children (those that don't have any particular
label (no named relationship)) could use the _childDocuments_ key as it's consistent with
existing use of this label.
  
{quote}... what if i put child documents in multiple fields? would that signify the different
types of child?
{quote}
Yes indeed.  This is largely the point of this approach, since the current anonymous relationship
has a loss of semantics in the relationship.

 
{quote}how would solr model that in the (lucene) Documents when giving them to the InddexWriter?
{quote}
In this issue, Moshe has proposed a labeled path field, e.g. "post.comment".  This path
would be added in an URP, or perhaps it would be done by \{{AddUpdateCommand.flatten/recUnwrap}}
right when the URP chain is done.
{quote}How would solr know how to order the children in from multiple fields/lists when creating
the block?
{quote}
Ah, I think that's a non-issue as they are indexed in the order given (notwistanding the hierarchy
flattening with parent last).  If you meant how might the order be reconstituted later at
retrieval time then we can rely on the docID order since they are kept in order and never
broken up.  
{quote}Wouldn't the "type of child" information be better living in the child documents itself?
(particularly since that "type" information needs to be in the child documents anyway so that
the filter query for a BJQ can be specified.)
{quote}
_Ultimately_ it does in the generated Lucene Document.  
{quote}It also seems like it would require code that wants to know what children exist in
a document to do a lot of work to find that out (need to iterate ever field in the SolrInputDocument
and do reflection to see if they are child-documents or not)
{quote}
I looked at this; it's AddSchemaFieldsUpdateProcessorFactory and AddUpdateCommand.flatten/recUnwrap.
 I'm not concerned about the former as it's for schema-guessing; only the latter.  Perhaps
this is no big deal; it's only the number of distinct field names in the average document?
 Also if the schema contained special "ChildDoc" fields or some-such, then the schema could
guide these code paths to know which field names to lookup in the incoming document.
{quote}Another concern off the top of my head is that a lot of existing code (including any
custom update processors people might have) would assume those child documents are multivaluved
field values and would probably break – hence a new method on SolrInputDocument seems wiser
(code that doens't know about may not do what you want, but at least it won't break it)
{quote}
Fixable on a case by case basis.  If this is worse than I imagine it is, then what URP would
be the worst offender?

In summary, the current approach doesn't retain the semantic information of relationships,
and I believe removing SolrInputFields.childDocuments will result in something _simpler_.
 It also allows a cleaner separation between the format-specific input (JSON vs XML vs ...)
and logic that should be ignorant to that.

The next-best alternative I can think of that doesn't disturb SolrInputDocument._childDocuments
would be if hypothetically SolrInputDocument had overloaded addChildDocument to accept a
relationship string.  And the impl would add the child document along with mutating it to
have the fields moshe has spoken of.  But this seems trappy to me since some methods would
do this and the existing ones wouldn't, and so the format loader would need to be careful
to always use or or the other.

> Index Full nested document Hierarchy For Queries (umbrella issue)
> -----------------------------------------------------------------
>
>                 Key: SOLR-12298
>                 URL: https://issues.apache.org/jira/browse/SOLR-12298
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: mosh
>            Priority: Major
>
> Solr ought to have the ability to index deeply nested objects, while storing the original
document hierarchy.
>  Currently the client has to index the child document's full path and level to manually
reconstruct the original document structure, since the children are flattened and returned
in the reserved "__childDocuments__" key.
> Ideally you could index a nested document, having Solr transparently add the required
fields while providing a document transformer to rebuild the original document's hierarchy.
>  
> This issue is an umbrella issue for the particular tasks that will make it all happen
– either subtasks or issue linking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message