lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fergus McMenemie (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
Date Thu, 01 Oct 2009 07:58:23 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761124#action_12761124
] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

I am quite pleased with it as far as it goes and think it would be good for 1.4. I have tested
it against my test set of 3000 XML documents and replacing:

{code}
        <field column="para1" name="text"                xpath="/record/sect1/para" flatten="true"/>
        <field column="para2" name="text"                xpath="/record/list/listitem/para"
flatten="true"/>
        <field column="para32"     name="text"                        xpath="/record/address/para"
 flatten="true" />
        <field column="para40"     name="text"                        xpath="/record/authoredBy/para"
 flatten="true" />
        <field column="para43"     name="text"                        xpath="/record/dataGroup/address/para"
 flatten="true" />
        <field column="para47"     name="text"                        xpath="/record/dataGroup/keyPersonnel/doubleList/first/para"
 flatten="true" />
        <field column="para49"     name="text"                        xpath="/record/dataGroup/keyPersonnel/doubleList/second/para"
 flatten="true" />
        <field column="para50"     name="text"                        xpath="/record/dataGroup/keyPersonnel/para"
 flatten="true" />
        <field column="para51"     name="text"                        xpath="/record/dataGroup/para"
 flatten="true" />
        <field column="para57"     name="text"                        xpath="/record/doubleList/first/para"
 flatten="true" />
        <field column="para59"     name="text"                        xpath="/record/doubleList/second/para"
 flatten="true" />
        <field column="para63"     name="text"                        xpath="/record/keyPersonnel/doubleList/first/para"
 flatten="true" />
        <field column="para65"     name="text"                        xpath="/record/keyPersonnel/doubleList/second/para"
 flatten="true" />
        <field column="para68"     name="text"                        xpath="/record/list/listItem/para"
 flatten="true" />
        <field column="para75"     name="text"                        xpath="/record/mediaBlock/doubleList/first/para"
 flatten="true" />
        <field column="para77"     name="text"                        xpath="/record/mediaBlock/doubleList/second/para"
 flatten="true" />
        <field column="para172"     name="text"                        xpath="/record/noteGroup/note/para"
 flatten="true" />
        <field column="para174"     name="text"                        xpath="/record/para"
 flatten="true" />
        <field column="para179"     name="text"                        xpath="/record/relatedInfo/list/listItem/relatedArticle/para"
 flatten="true" />
        <field column="para184"     name="text"                        xpath="/record/sect1/address/dataGroup/para"
 flatten="true" />
        <field column="para185"     name="text"                        xpath="/record/sect1/address/para"
 flatten="true" />
        <field column="para195"     name="text"                        xpath="/record/sect1/dataGroup/address/para"
 flatten="true" />
        <field column="para199"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/first/para"
 flatten="true" />
        <field column="para201"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/second/para"
 flatten="true" />
        <field column="para202"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/para"
 flatten="true" />
        <field column="para203"     name="text"                        xpath="/record/sect1/dataGroup/para"
 flatten="true" />
        <field column="para208"     name="text"                        xpath="/record/sect1/doubleList/first/para"
 flatten="true" />
        <field column="para212"     name="text"                        xpath="/record/sect1/doubleList/second/list/listItem/para"
 flatten="true" />
        <field column="para213"     name="text"                        xpath="/record/sect1/doubleList/second/para"
 flatten="true" />
        <field column="para217"     name="text"                        xpath="/record/sect1/keyPersonnel/doubleList/first/para"
 flatten="true" />
        <field column="para219"     name="text"                        xpath="/record/sect1/keyPersonnel/doubleList/second/para"
 flatten="true" />
        <field column="para220"     name="text"                        xpath="/record/sect1/keyPersonnel/para"
 flatten="true" />
        <field column="para225"     name="text"                        xpath="/record/sect1/list/listItem/list/listItem/para"
 flatten="true" />
        <field column="para226"     name="text"                        xpath="/record/sect1/list/listItem/para"
 flatten="true" />
        <field column="para240"     name="text"                        xpath="/record/sect1/para"
 flatten="true" />
        <field column="para244"     name="text"                        xpath="/record/sect1/sect2/doubleList/first/para"
 flatten="true" />
        <field column="para246"     name="text"                        xpath="/record/sect1/sect2/doubleList/second/para"
 flatten="true" />
        <field column="para251"     name="text"                        xpath="/record/sect1/sect2/list/listItem/list/listItem/para"
 flatten="true" />
        <field column="para252"     name="text"                        xpath="/record/sect1/sect2/list/listItem/para"
 flatten="true" />
        <field column="para258"     name="text"                        xpath="/record/sect1/sect2/noteGroup/note/para"
 flatten="true" />
        <field column="para259"     name="text"                        xpath="/record/sect1/sect2/para"
 flatten="true" />
        <field column="para265"     name="text"                        xpath="/record/sect1/sect2/sect3/list/listItem/list/listItem/para"
 flatten="true" />
        <field column="para266"     name="text"                        xpath="/record/sect1/sect2/sect3/list/listItem/para"
 flatten="true" />
        <field column="para271"     name="text"                        xpath="/record/sect1/sect2/sect3/para"
 flatten="true" />
        <field column="para275"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/list/listItem/para"
 flatten="true" />
        <field column="para279"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/para"
 flatten="true" />
        <field column="para284"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/sect5/para"
 flatten="true" />
        <field column="para295"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/noteGroup/note/para"
 flatten="true" />
        <field column="para297"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/para"
 flatten="true" />
        <field column="para301"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/thead/row/entry/para"
 flatten="true" />
        <field column="para312"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/list/listItem/para"
 flatten="true" />
        <field column="para315"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/note/para"
 flatten="true" />
        <field column="para316"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/para"
 flatten="true" />
        <field column="para318"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/para"
 flatten="true" />
        <field column="para322"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/thead/row/entry/para"
 flatten="true" />
        <field column="para341"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/note/para"
 flatten="true" />
        <field column="para342"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/para"
 flatten="true" />
        <field column="para344"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/para"
 flatten="true" />
        <field column="para348"     name="text"                        xpath="/record/sect1/table/tgroup/thead/row/entry/para"
 flatten="true" />
        <field column="para371"     name="text"                        xpath="/record/table/tgroup/tbody/row/entry/noteGroup/note/para"
 flatten="true" />
        <field column="para373"     name="text"                        xpath="/record/table/tgroup/tbody/row/entry/para"
 flatten="true" />
        <field column="para377"     name="text"                        xpath="/record/table/tgroup/thead/row/entry/para"
 flatten="true" />
{code]

with 

{code}
       <field column="text"                             xpath="//para" flatten="true"/>
{code}

The indexes seemed equivalent and time to index was also equivalent.

I have one concern which should be addressed before any 1.4 release. I still do not understand
the purpose of the HashSet childrenFound and putNulls, if its important then I suspect that
whatever is done to childNodes when an end_element is parsed also needs done to descNodes;
but I have a feeling the whole lot may be unnecessary and can be removed. If it is required
we need to explain it.

The last change I would like to see, which I am happy to leave to 1.5, involves making sure
emitted records do not contain tags from parent nodes unless they are stipulated by "commonField"

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html
it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message