hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rebecca McGuinness <rebe...@openplanetsfoundation.org>
Subject Hadoop Driven Digital Preservation hackathon, 2-4 December, Vienna
Date Thu, 21 Nov 2013 09:43:55 GMT
*Hadoop Driven Digital Preservation*

*2-4 December,Austrian National Library, Vienna*

There are just two days left to sign up for our next hackathon:
https://hadoop-driven-digital-preservation.eventbrite.co.uk.

This hackathon will focus on using Hadoop <http://hadoop.apache.org/> in
two digital preservation scenarios:
*Web-Archiving: File Format Identification/Characterisation*
A web archive usually contains a wide range of different file types. From a
curatorial perspective the question is: Do I need to be worried? Is there a
risk that means I should take adequate measures right now? The first step
is to reliably identify and characterise the content of a web archive.
Linguistic
analysis can help categorise the “text/plain” content into more precise
content types. A detailed analysis of “application/pdf” content can help
cluster properties of the files and identify characteristics that are of
special interest. Using the Hadoop framework and prepared sample projects
for processing web archive content, we will be able to perform any kind of
processing or analysis that we come up with on a large scale using a Hadoop
Cluster. Together we will discuss what are the requirements to enable this
and we will find out what still needs to optimised.

*Digital Books: Quality Assurance, text mining (OCR Quality)*
The digital objects of the Austrian National Library's digital book
collection consists of the aggregated book object with technical and
descriptive meta data, and the images, layout and text content for the book
pages. Due to the massive scale of digitisation in a relatively short time
period and the fact that the digitised books are from the 18th century and
older, there are different types of quality issues. Using the Hadoop
framework, we provide the means to perform any kind of large scale book
processing on a book or page level. Linguistic analysis and language
detection, for example, can help us determining the quality of the OCR
(Optical Character Recognition), or image analysis can help in detecting
any technical or content related issues with the book page images.

Take a look at the full agenda here:
http://wiki.opf-labs.org/display/SP/Agenda+-+Hadoop+Driven+Digital+Preservation
.

*Highlights of this hackathon include:*

   - Talks from our guest speaker, Jimmy
Lin<http://www.umiacs.umd.edu/~jimmylin/>,
   University of Maryland
   - Taking part in our competition for the best idea and visualisation
   - A chance to gain hands-on experience carrying out identification and
   characterisation experiments
   - Practitioners and developers working together to address digital
   preservation challenges
   - The opportunity to share experiences and knowledge about implementing
   Hadoop

*Who should attend?*
*Practitioners* (digital librarians and archivists, digital curators,
repository managers, or anyone responsible for managing digital
collections) You will learn how Hadoop might fit your organisation, how to
write requirements to guide development and gain some hands on experience
using tools yourself and finding out how they work. To get the most out of
this training course you will ideally have some knowledge or experience of
digital preservation.

*Developers* of all experience can participate, from writing your first
Hadoop jobs, to working on scalable solutions for issues identified in the
scenarios.

We hope to see you in Vienna!

Kind Regards,

Rebecca McGuinness
Membership and Communications Manager

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message