stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Vargas <juan.var...@appstylus.com>
Subject Creating a spanish index for Stanbol (doubts)
Date Tue, 13 Nov 2012 12:20:33 GMT
Hello.

I'm Juan Vargas. a web developer at Notedlinks S.L. from Spain.   (Issue:
https://issues.apache.org/jira/browse/STANBOL-804)

I've been trying a few days to create a spanish index using dbpedia 3.8
files, following the next instructions of
https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/dbpedia/README.mdto
use on Stanbol enhancer, its means:

*1. Building index tool*
   - cd {stanbol-source}/entityhub/
indexing/genericrdf/  (where you install stanbol) * require stanbol (
http://stanbol.apache.org/docs/trunk/tutorial.html)
   - mvn assembly:single
   - moveorg.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jaron
my target direct that i plan to make a index

*2. Create sub-folder on target directory*
   - java -jar
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
init

*3. Download dbpedia dump files and copy in* 'indexing/resources/rdfdata':

   - http://downloads.dbpedia.org/3.8/dbpedia_3.6.owl.bz2    (general for
   any language)
   - http://downloads.dbpedia.org/3.8/en/instance_types_es.nt.bz2
   - http://downloads.dbpedia.org/3.8/es/labels_es.nt.bz2
   - http://downloads.dbpedia.org/3.8/es/short_abstracts_es.nt.bz2
   - http://downloads.dbpedia.org/3.8/es/long_abstracts_es.nt.bz2
   - http://downloads.dbpedia.org/3.8/es/geo_coordinates_es.nt.bz2
   - http://downloads.dbpedia.org/3.8/es/persondata_es.nt.bz2  (doesnt seem
   to exist in spanish, any problem it isnt use ?)
   - http://downloads.dbpedia.org/3.8/es/article_categories_es.nt.bz2
   - http://downloads.dbpedia.org/3.8/es/category_labels_es.nt.bz2
   - http://downloads.dbpedia.org/3.8/es/skos_categories_es.nt.bz2
   - http://downloads.dbpedia.org/3.8/en/redirects_es.nt.bz2


*4. Generate entities score and copy to** '*indexing/resources':
  - curl http://downloads.dbpedia.org/3.8/es/page_links_en.nt.bz2 | bzcat |
sed -e 's/.*<http\:\/\/es\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' | sort
\ | uniq -c | sort -nr > incoming_links.txt

(changes in spanish: url resource, 'en' for 'es', see suggested notes on
url web)

*5. Configuration of the index:*
 - I left by default, otherwise i dont understand too much how to
configurate.

*6. Execute jar to create index:*
  - java -jar
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
index

The execution crash, and trace is as follows:

10:42:36,037 [Thread-3] ERROR source.ResourceLoader - Unable to load
resource
/home/juan/stanbol-index/indexing/resources/rdfdata/redirects_es.nt.bz2
org.openjena.riot.RiotException: [line: *5854*, col: 103] *Broken token*:
http://es.dbpedia.org/resource/Pactos_de_
    at
org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
    at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
    at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
    at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
    at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
    at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
    at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
    at
org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
    at java.lang.Thread.run(Thread.java:679)

Looking redirects_es.nt.bz2 file:

  5852 <http://es.dbpedia.org/resource/Tratados_Lateranos> <
http://dbpedia.org/ontology/wikiPageRedirects> <
http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
   5853 <http://es.dbpedia.org/resource/Tratado_Laterano> <
http://dbpedia.org/ontology/wikiPageRedirects> <
http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
  * 5854* <http://es.dbpedia.org/resource/Tratado_Lateranense> <
http://dbpedia.org/ontology/wikiPageRedirects> <
http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
   5855 <http://es.dbpedia.org/resource/Tratados_Lateranenses> <
http://dbpedia.org/ontology/wikiPageRedirects> <
http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .

I dont see any error. Someone could help me, if there are anything unusual?

Also, i try to do a dbpedia 3.8 englsih version, to check if i wad doing
wrong a spanish version, its seems ok, but finally minutes after, i got::

11:23:32,576 [Thread-3] ERROR source.ResourceLoader - Unable to load
resource
/home/juan/stanbol-index/indexing/resources/rdfdata/short_abstracts_en.nt.bz2
org.openjena.riot.RiotException: [line: *1880*, col: 96] *Broken token*:
Bambara, also known as Bamana, and Bamanankan by speakers of the language,
is a language spoken in Mali, and to a lesser extent Burkina Faso, Senegal
by as many as six million people (in
    at
org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
    at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
    at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
    at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
    at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
    at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
    at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
    at
org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
    at java.lang.Thread.run(Thread.java:679)

Looking short_abstracts_en.nt.bz2:

1879 <http://dbpedia.org/resource/Bernard_of_Clairvaux> <
http://www.w3.org/2000/01/rdf-schema#comment> "Bernard of Clairvaux, O.
Cist (1090 \u2013 August 20, 1153) was a French abbot and the primary
builder of the reforming Cistercian order. After the death of his mother,
Bernard sought admission into the Cistercian order. Three years later, he
was sent to found a new abbey at an isolated clearing in a glen known as
the Val d'Absinthe, about 15\u00A0km southeast of Bar-sur-Aube. According
to tradition, Bernard founded the monastery on 25 June 1115, naming it
Claire Vall\u00E9e, which evolved into Clairvaux."@en .
   *1880 *<http://dbpedia.org/resource/Bambara_language> <
http://www.w3.org/2000/01/rdf-schema#comment> "Bambara, also known as
Bamana, and Bamanankan by speakers of the language, is a language spoken in
Mali, and to a lesser extent Burkina Faso, Senegal by as many as six
million people (including second language users). The Bambara language is
the language of people of the Bambara ethnic group, numbering about
4,000,000 people, but serves also as a lingua franca in Mali (it is
estimated that about 80% of the population speak it as a first or second
language)."@en .
   1881 <http://dbpedia.org/resource/Bishkek> <
http://www.w3.org/2000/01/rdf-schema#comment> "Bishkek, formerly Pishpek
and Frunze, is the capital and the largest city of Kyrgyzstan. Bishkek is
also the administrative centre of Chuy Province which surrounds the city,
even though the city itself is not part of the province but rather a
province-level unit of Kyrgyzstan. The name is thought to derive from a
Kyrgyz word for a churn used to make fermented mare's milk, the Kyrgyz
national drink."@en .

Someone might say why appears errors like "broken pipe" or if I'm doing
something wrong. I think that i follow well the guide. Thanks, and I hope that
this information can help others that try to create indexes and an Apache
Stanbol, that is a really great project. Nice work!

Best,
Juan.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message