lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francisco Fernandez <fra...@gmail.com>
Subject Re: Pubmed XML indexing
Date Fri, 27 Sep 2013 13:44:59 GMT
Many thanks both Mike and Alexandre.
I'll peek those tools.
Lux seems a good option.
Thanks again,

Francisco

El 27/09/2013, a las 09:33, Michael Sokolov escribió:

> You might be interested in Lux (http://luxdb.org), which is designed for indexing and
querying XML using Solr and Lucene.  It can run index-supported XPath/XQuery over your documents,
and you can define arbitrary XPath indexes.
> 
> -Mike
> 
> On 9/27/13 6:28 AM, Francisco Fernandez wrote:
>> Hi, I'm a newby trying to index PubMed texts obtained as xml with similar structure
to:
>> 
>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23864173,22073418
>> 
>> The nodes I need to extract, expressed as XPaths would be:
>> 
>> //PubmedArticle/MedlineCitation/PMID
>> //PubmedArticle/MedlineCitation/DateCreated/Year
>> //PubmedArticle/MedlineCitation/Article/ArticleTitle
>> //PubmedArticle/MedlineCitation/Article/Abstract/AbstractText
>> //PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading
>> 
>> I think a way to index them in Solr is to create another xml structure similar to:
>> <add>
>> <doc>
>>  <field name="id">PMID</field>
>>  <field name="year_i">Year</field>
>>  <field name="name">ArticleTitle</field>
>>  <field name="abstract_s">AbstractText</field>
>>  <field name="cat">MeshHeading1</field>
>>  <field name="cat">MeshHeading2</field>
>> </doc>
>> </add>
>> 
>> Being "PMID" = '23864173' and "ArticleTitle" = 'Cost-effectiveness of low-molecular-weight
heparin compared with aspirin for prophylaxis against venous thromboembolism after total joint
arthroplasty' and so on.
>> With that structure I would post it to Solr using the following statement over the
documents folder
>> java -jar post.jar *.xml
>> 
>> I'm wondering if is there a more direct way to perform the same task that does not
imply a 'iterate->parsing->restructure->write to disk->post' cycle
>> Many thanks
>> 
>> Francisco
> 


Mime
View raw message