lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Davis <...@danizen.net>
Subject Re: DIH XPathEntityProcessor question
Date Mon, 08 Dec 2014 22:54:55 GMT
Yes, that worked quite well.   I still need the "//tagname" but that is the
only DIH incantation I need.   This will substantially accelerate things.

On Mon, Dec 8, 2014 at 5:37 PM, Dan Davis <dan@danizen.net> wrote:

> The problem is that XPathEntityProcessor implements Xpath on its own, and
> implements a subset of XPath.  So, if the input document is small enough,
> it makes no sense to fight it.   One possibility is to apply an XSLT to the
> file before processing ite
>
> This blog post
> <http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx>
> shows a worked example.   The XSL transform takes place before the forEach
> or field specifications, which is the principal question I had about it
> from the documentation.  This is also illustrated in the initQuery()
> private method of XPathEntityProcessor.    You can see the transformation
> being applied before the forEach.  This will not scale to extremely large
> XML documents including millions of rows - that is why they have the
> stream="true" argument there, so that you don't preprocess the document.
> In my case, the entire XML file is 29M, and so I think I could do the XSL
> transformation and then do for each document.
>
> This potentially shortens my time frame of moving to Apache Solr
> substantially, because the common case with our previous indexer is to run
> XSLT to trasform to the document format desired by the indexer.
>
> On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch <arafalov@gmail.com>
> wrote:
>
>> I don't believe there are any alternatives. At least I could not get
>> anything but the full path to work.
>>
>> Regards,
>>    Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On 8 December 2014 at 17:01, Dan Davis <dansmood@gmail.com> wrote:
>> > In experimentation with a much simpler and smaller XML file, it doesn't
>> > look like '//health-topic/@url" will not work, nor will '//@url' etc.
>>   So
>> > far, only spelling it all out will work.
>> > With child elements, such as <title>, an xpath of "//title" works fine,
>> but
>> > it  is beginning to same dangerous.
>> >
>> > Is there any short-hand for the current node or the match?
>> >
>> > On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis <dansmood@gmail.com> wrote:
>> >
>> >> When I have a forEach attribute like the following:
>> >>
>> >>
>> >>
>> forEach="/medical-topics/medical-topic/health-topic[@language='English']"
>> >>
>> >> And then need to match an attribute of that, is there any alternative
>> to
>> >> spelling it all out:
>> >>
>> >>      <field column="url"
>> >>
>> xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>
>> >>
>> >> I suppose I could do "//health-topic/@url" since the document should
>> then
>> >> have a single health-topic (as long as I know they don't nest).
>> >>
>> >>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message