lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul, Noble" <>(by way of Fergus McMenemie)
Subject Re: Extract info from parent node during data import (redirect:)
Date Wed, 16 Sep 2009 12:16:25 GMT

Implementing  wildcard (//tagname) is definitely possible. I would love
to see it working. But if you wish to take a dig at it I shall do
whatever I can to help.

>What is the use case that makes flow though so useful? 
We do not know to which forEach xpath a given field is associated with.
Currently you can clean up the fields using a transformer. There is an
implicit field '$forEach' which tells you about the xpath tag for each
record that is emitted.

>The recently added comments in XPathRecordReader are a great help and I
>was planning to add more. Might this be an issue?
I would love to have it. Give a patch and I shall commit it.
XPathRecordReader is a blackbox and AFAIK I am the only one who knows
it. I would love to have more eyes on that.

>I would like to open a JIRA for improving XPathRecordReader.
Please go ahead. You can paste the contents of this mail in the list .
There may be others with similar ideas

-----Original Message-----
>>>>/document/category/item | /document/category
>>>>means there are two paths which triggers a new doc (it is possible to
>>>>have more). Whenever it encounters the closing tag of that xpath , it
>>>>emits all the fields it collected since the opening of the same tag.
>>>>after that it clears all the fields it collected since the opening of
>>>>the tag.
>>>>If there are fields it collected before opening of the same tag, it 
>>>>retains it
>>> Nice and clear, but that is not what I see.
>>> With my test case with forEach="/record | /record/mediaBlock"
>>> I see that for each /record/mediaBlock "document" indexed it contains
>>> all fields from the parent "/record" document as well. A search over 
>>> mediaBlock s returns lots of extra fields from the parent which did 
>>> not have the commonField attribute. I will try and produce a testcase
>>yes it does . . /record/mediaBlock will have all the fields collected 
>>from /record as well.  *****It is by design******
>I had always considered it a bug or at least a limitation. After all if
>we have the "commonField" attribute why do we need an automatic flow
>through of all collected fields from parent nodes. This feature is as
>far as I can see undocumented and at the same time unintuitive.
>It also, in my case, causes tons more information to be indexed than is
>I have spent a while thinking through possible use cases. My use case
>involves having documents we want to search as a whole and behave as
>normal. At the same time these documents contain inner sections we wish
>to treat as sub-documents; in my case I a have pictures with associated
>captions which I wish to search separately. Having indexed the documents
>with forEach="/record | /record/mediaBlock" my picture search works
>nicely but I have a nasty side effect when performing searches over the
>rest of the document. Because fields from the parent node are also
>present in the children, when I search for any text the same document
>gets returned many times, once due to the text in the parent node and
>again for each picture placed in the document. I have a work around for
>this issue but have always considered it a bug.
>What is the use case that makes flow though so useful?
>I had just started playing with the code to see how easy this would be
>to change. The recently added comments in XPathRecordReader are a great
>help and I was planning to add more. Might this be an issue?
>I have noted, while lurking on the solr mail lists, that requests for
>this type of functionality keep coming up; to be able to restrict
>searches to a sub section of a document. I have really needed this sort
>of thinks many times with the type of stuff I work with.
>My other planned activity was to see how easy xpaths such as //tagname
>would be implement. Since my latest data-config.xml looks like:-
><field column="para32"   name="text" xpath="/record/address/para"
>flatten="true" />
><field column="para40"   name="text" xpath="/record/authoredBy/para"
>flatten="true" />
><field column="para43"   name="text"
>xpath="/record/dataGroup/address/para"  flatten="true" />
><field column="para47"   name="text"
>flatten="true" />
><field column="para49"   name="text"
>flatten="true" />
><field column="para50"   name="text"
>xpath="/record/dataGroup/keyPersonnel/para"  flatten="true" />
><field column="para51"   name="text" xpath="/record/dataGroup/para"
>flatten="true" />
><field column="para57"   name="text"
>xpath="/record/doubleList/first/para"  flatten="true" />
><field column="para59"   name="text"
>xpath="/record/doubleList/second/para"  flatten="true" />
><field column="para63"   name="text"
>xpath="/record/keyPersonnel/doubleList/first/para"  flatten="true" />
><field column="para65"   name="text"
>xpath="/record/keyPersonnel/doubleList/second/para"  flatten="true" />
><field column="para68"   name="text" xpath="/record/list/listItem/para"
>flatten="true" />
><field column="para75"   name="text"
>xpath="/record/mediaBlock/doubleList/first/para"  flatten="true" />
><field column="para77"   name="text"
>xpath="/record/mediaBlock/doubleList/second/para"  flatten="true" />
><field column="para172"  name="text" xpath="/record/noteGroup/note/para"
>flatten="true" /> <field column="para174"  name="text"
>xpath="/record/para"  flatten="true" /> <field column="para179"
>flatten="true" /> <field column="para184"  name="text"
>xpath="/record/sect1/address/dataGroup/para"  flatten="true" /> <field
>column="para185"  name="text" xpath="/record/sect1/address/para"
>flatten="true" /> <field column="para195"  name="text"
>xpath="/record/sect1/dataGroup/address/para"  flatten="true" /> <field
>column="para199"  name="text"
>flatten="true" /> <field column="para201"  name="text"
>flatten="true" /> <field column="para202"  name="text"
>xpath="/record/sect1/dataGroup/keyPersonnel/para"  flatten="true" />
><field column="para203"  name="text"
>xpath="/record/sect1/dataGroup/para"  flatten="true" /> <field
>column="para208"  name="text"
>xpath="/record/sect1/doubleList/first/para"  flatten="true" /> <field
>column="para212"  name="text"
>flatten="true" /> <field column="para213"  name="text"
>xpath="/record/sect1/doubleList/second/para"  flatten="true" /> <field
>column="para217"  name="text"
>xpath="/record/sect1/keyPersonnel/doubleList/first/para"  flatten="true"
>/> <field column="para219"  name="text"
>flatten="true" /> <field column="para220"  name="text"
>xpath="/record/sect1/keyPersonnel/para"  flatten="true" /> <field
>column="para225"  name="text"
>xpath="/record/sect1/list/listItem/list/listItem/para"  flatten="true"
>/> <field column="para226"  name="text"
>xpath="/record/sect1/list/listItem/para"  flatten="true" /> <field
>column="para240"  name="text" xpath="/record/sect1/para"  flatten="true"
>/> <field column="para244"  name="text"
>xpath="/record/sect1/sect2/doubleList/first/para"  flatten="true" />
><field column="para246"  name="text"
>xpath="/record/sect1/sect2/doubleList/second/para"  flatten="true" />
><field column="para251"  name="text"
>flatten="true" /> <field column="para252"  name="text"
>xpath="/record/sect1/sect2/list/listItem/para"  flatten="true" /> <field
>column="para258"  name="text"
>xpath="/record/sect1/sect2/noteGroup/note/para"  flatten="true" />
><field column="para259"  name="text" xpath="/record/sect1/sect2/para"
>flatten="true" /> <field column="para265"  name="text"
>flatten="true" /> <field column="para266"  name="text"
>xpath="/record/sect1/sect2/sect3/list/listItem/para"  flatten="true" />
><field column="para271"  name="text"
>xpath="/record/sect1/sect2/sect3/para"  flatten="true" /> <field
>column="para275"  name="text"
>flatten="true" /> <field column="para279"  name="text"
>xpath="/record/sect1/sect2/sect3/sect4/para"  flatten="true" /> <field
>column="para284"  name="text"
>xpath="/record/sect1/sect2/sect3/sect4/sect5/para"  flatten="true" />
><field column="para295"  name="text"
>note/para"  flatten="true" /> <field column="para297"  name="text"
>flatten="true" /> <field column="para301"  name="text"
>flatten="true" /> <field column="para312"  name="text"
>ra"  flatten="true" /> <field column="para315"  name="text"
>ara"  flatten="true" /> <field column="para316"  name="text"
>flatten="true" /> <field column="para318"  name="text"
>flatten="true" /> <field column="para322"  name="text"
>flatten="true" /> <field column="para341"  name="text"
>flatten="true" /> <field column="para342"  name="text"
>flatten="true" /> <field column="para344"  name="text"
>xpath="/record/sect1/table/tgroup/tbody/row/entry/para"  flatten="true"
>/> <field column="para348"  name="text"
>xpath="/record/sect1/table/tgroup/thead/row/entry/para"  flatten="true"
>/> <field column="para371"  name="text"
>flatten="true" /> <field column="para373"  name="text"
>xpath="/record/table/tgroup/tbody/row/entry/para"  flatten="true" />
><field column="para377"  name="text"
>xpath="/record/table/tgroup/thead/row/entry/para"  flatten="true" />
>Which is nuts!
>I would like to open a JIRA for improving XPathRecordReader.
>Regds Fergus.

View raw message