stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Di Menna <andre...@inqmobile.com>
Subject Re: EntityHub Referenced Site and redirects
Date Mon, 05 Nov 2012 09:59:41 GMT
Hi Rupert,
I would be more than happy to share the indexes.
I have also created one including redirects by forcibly inserting
redirecting entities into the incoming_links.txt file.
Redirects have been assigned the same entity rank as the entities they
redirect to.

Please let me know how and where to store those indexes.

Cheers

2012/11/3 Rupert Westenthaler <rupert.westenthaler@gmail.com>

> Hi,
>
> I have started to play around with indexing dbpedia 3.8 myself as well
> and I con confirm that one has to preprocess nearly all files. Because
> of that I have written a nice shell script that downloads, processes
> and re-compresses the RDF files
>
> # array syntax is ({item-1} {items-2} ... {item-n})
> # names need to include the language path segment!
> files=(dbpedia_3.8.owl \
>     en/labels_en.nt \
>     {all-the-other-files-you-need} \
>     )
>
> for i in "${files[@]}"
> do
>     :
>     # clean possible encoding errors
>     filename=$(basename $i)
>     if [ ! -f ${filename}.gz ]
>     then
>         url=${DBPEDIA}/${i}.bz2
>         wget -c ${url}
>         echo "cleaning $filename ..."
>         #corrects encoding and recompress using gz
>         #gz is used because it is faster
>         bzcat ${filename}.bz2 \
>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>             | gzip -c > ${filename}.gz
>         rm -f ${filename}.bz2
>     fi
> done
>
> > the SolrIndex zip file is about 3.5GB.
> > I am using a min-score=2 in minincoming.properties
> > I think the 3.7 index file from the IKS project downloads site was
> created
> > with min-score=10.
>
> The dbpedia 3.7 index was build by ogrisel, but I think you are right.
> 3.5GByte for all entities wih >=2 incomming links (should be about
> 4million entities) sound reasonable. If you  want to share your index
> with the Stanbol community I am sure we can find a server to host it.
>
>
> Note about languages:
>
> while it is easy include labels, comments, abstracts of additional
> languages it is not so easy to add proper Solr field definition for
> languages. While there is a great wiki page that provides all the
> necessary links [1] I find it still very hard to add configurations
> for languages I do not understand. So if someone can help with that I
> am happy to improve the Solr schemas used by the Entityhub (and the
> Entityhub Indexing tool)!
>
>
> Upgrading the default DBpedia index:
>
> After the ApacheCon I will work on replacing the default dbpedia index
> used with the Stanbol launchers with a dbpedia 3.8 based version (the
> current one is still based on 3.6). This will need some time because I
> expect that I will need to adapt a lot of unit/integration tests
> affected by data changes.
>
> [1] http://wiki.apache.org/solr/LanguageAnalysis
>
> >
> > I have indexed english resources and labels from other languages, as this
> > is what I currently need.
> >
> > Cheers
> > Andrea
> >
> > 2012/11/2 harish suvarna <hsuvarna@gmail.com>
> >
> >> Andrea,
> >> Thanks for the update. I was also trying to create the Chinese and
> English
> >> dbpedia3.8 indexes. But ranout hardware power.
> >> What is the size of the dbpedia.solr.index.zip file? It used to be 1.9
> GB
> >> (zip file). But I guess that contained labels from all languages.
> >>
> >> Did you index English only?
> >>
> >> -harish
> >>
> >> On Fri, Nov 2, 2012 at 9:40 AM, Andrea Di Menna <andreadm@inqmobile.com
> >> >wrote:
> >>
> >> > Hi all,
> >> >
> >> > I have created a EntityHub Solr index from dbpedia 3.8 using the
> default
> >> > settings for the dbpedia indexing tool.
> >> > The index was created successfully.
> >> >
> >> > Now that I working on it I am noticing that wikipedia redirects are
> >> > completely missing from the EntityHub.
> >> >
> >> > I have used the fetch_prepare.sh tool to download data from DBpedia,
> and
> >> > among the resources there is also redirects_en.nt.bz2
> >> > There is a rule in the mappings.txt file to map
> dbp-ont:wikiPageRedirects
> >> > to rdfs:seeAlso.
> >> >
> >> > From what I can see, the problems seems to be that the indexing tool
> is
> >> > only taking into account the resources listed in the
> incoming_links.txt
> >> > file.
> >> > This file is built upon page_links_en.nt.bz2 and ranks entities on the
> >> > basis of the incoming links.
> >> > Page redirects will never have incoming links hence will not be
> listed in
> >> > incoming_links.txt
> >> >
> >> > Is my understanding correct or am I missing anything?
> >> > Should I forcibly insert page redirects entities in the incoming_links
> >> file
> >> > to get them included in the Solr index?
> >> >
> >> > Thank you very much for your time
> >> >
> >> > --
> >> > Andrea Di Menna
> >> >
> >> >
> >> >
> >> >
> >> > This e-mail is only intended for the person(s) to whom it is addressed
> >> and
> >> > may contain CONFIDENTIAL information. Any opinions or views are
> personal
> >> to
> >> > the writer and do not represent those of INQ Mobile Limited, Hutchison
> >> > Whampoa Limited or its group companies.  If you  are not the intended
> >> > recipient, you are hereby notified that any use, retention,
> disclosure,
> >> > copying, printing, forwarding or dissemination of this communication
> is
> >> > strictly prohibited. If you have received this  communication in
> error,
> >> > please erase all copies of the message and its  attachments and notify
> >> the
> >> > sender immediately. INQ Mobile Limited is  a company registered in the
> >> > British Virgin Islands. www.inqmobile.com.
> >> >
> >> >
> >>
> >>
> >> --
> >> Thanks
> >> Harish
> >>
> >
> >
> >
> >
> > This e-mail is only intended for the person(s) to whom it is addressed
> and may contain CONFIDENTIAL information. Any opinions or views are
> personal to the writer and do not represent those of INQ Mobile Limited,
> Hutchison Whampoa Limited or its group companies.  If you  are not the
> intended recipient, you are hereby notified that any use, retention,
> disclosure, copying, printing, forwarding or dissemination of this
> communication is strictly prohibited. If you have received this
>  communication in error, please erase all copies of the message and its
>  attachments and notify the sender immediately. INQ Mobile Limited is  a
> company registered in the British Virgin Islands. www.inqmobile.com.
> >
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
Andrea Di Menna
INQ - Engineering
+393925803119
skype: ninniux
inqmobile.com
INQ¹ – Winner of the 2009 Best Handset




This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL
information. Any opinions or views are personal to the writer and do not represent those of
INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the
intended recipient, you are hereby notified that any use, retention, disclosure, copying,
printing, forwarding or dissemination of this communication is strictly prohibited. If you
have received this  communication in error, please erase all copies of the message and its
 attachments and notify the sender immediately. INQ Mobile Limited is  a company registered
in the British Virgin Islands. www.inqmobile.com.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message