lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: ApacheCon Speakers
Date Sun, 12 Jul 2009 11:46:33 GMT

On Jul 7, 2009, at 9:55 AM, Jukka Zitting wrote:

> Hi,
> On Tue, Jul 7, 2009 at 3:09 PM, Uwe Schindler <> wrote:
>> I am a little bit confused: I thought the call for papers/talks is  
>> over?
> Yes, the CFP has ended and we (currently just me and Grant) have all
> the submissions.

FWIW, All of them are already listed on Lucene Meetup page (

), so I think we can just state them.

I will send a follow up with some scheduling ideas.  We need to get  
this done.


TRACK: Lucene (two days)

347: Building Intelligent Search Applications with the Lucene Stack
Presentation, 60 minutes by Grant Ingersoll

Apache Lucene has evolved in recent years beyond a core search library
into a top level project containing a whole suite of tools for working
with content.  Starting with Solr, which builds on the core Lucene
search library, we can add in tools like Tika, Mahout, Droids and other
open source libraries to build intelligent search applications.
This talk will focus on how to leverage the various components
of the Lucene Stack to build out intelligent search applications
that better enable users to find what they are looking for in
today's sea of content.

366: Apache Solr: Out of the Box
Presentation, 60 minutes by Chris Hostetter

Apache Solr is an HTTP based enterprise search server built on top of
the Lucene Java search library.  In this session we will see how quick
and easy it can be to install and configure Solr to provide full-text
searching of structured data without needing to write any custom code.
We will demonstrate various built-in features such as:  loading data
from CSV files, tolerant parsing of user input, faceted searching,
highlighting matched text in results, and retrieving search results
in a variety of formats (XML, JSON, etc....) We will also look at
using Solr's Administrative interface to understand how different
text analysis configuration options affect our results, and why
various results score the way they do against different searches.
No previous Solr experience is expected.

367: Apache Solr: Beyond the Box
Presentation, 60 minutes by Chris Hostetter

Apache Solr is an HTTP based enterprise search server built
on top of the Lucene Java search library.  In this session we
will look at Solr's internal Java APIs and discuss how to write
various types of plugins for customizing it's behavior--
as well as some real world examples of "When" and "Why"
it makes sense to do so.

415: Implementing an Information Retrieval Framework for an
Organizational Repository
Presentation, 60 minutes by Sithu D Sudarsan

Successful Information Retrieval (IR) frameworks for large
repositories have been reported in recent times.  Invariably, all of
them have used machine readable repositories, where plain text
availability is the norm.  However, organizations with legacy
archives need to develop a framework which first converts
the non-electronic archive to electronic archive and then
extract machine readable text with an acceptable error rate.
The Food and Drug Administration (FDA) has electronic images
of the documents collected as part of their charter to approve
and monitor products related to health care.  These documents
date back multiple decades and have formats which range from
microfiche through early optical character recognition to recent
electronic formats.  We believe that a large knowledge base
hidden in them could be mined.  To mine this knowledge base,
we are developing a semantic mining framework using open
source tools such as lucene, pdfbox, solr, poi, and Java.
Challenges include determining the quality of text being extracted
and the ability to handle documents containing formatted text in part.
The text itself may contain specific vocabularies from medical,
legal, engineering and scientific domains and terminology that
evolves over time.  Careful thought needs to be given to selecting
analyzers for indexing and retrieval and implementing a framework
for heuristics useful to domain experts as well as novices.
An initial prototype is currently being evaluated with a sample size
of over 100,000 documents and 70GB of data for different extractors,
analyzers and search heuristics, with multiple indices for each
document stored in a distributed fashion.

424: Apache Mahout - Going from raw data to information
Presentation, 60 minutes by Isabel Drost

It has become very easy to create, publish, and collect data
in digital form.  The volume of structured and unstructured
data is increasing at tremendous pace.  This has led to a whole
new set of applications that can be build if one solves
the problem of turning raw data into valuable information.
Possible applications include but are not limited to:
Discovering new trends from a stream of weblog entries.
Automatic learning approaches for supplementing market research
processes for new products.

Machine learning provides tools for building such applications.
A large community of researchers has been working on the topic
of learning from data.  Although a lot of information on algorithms
and solutions to common problems are publicly available,
scaling these solutions into the range of terabytes and petabytes
is an open issue.  To scale algorithms to such dimensions it
is indispensable to distribute data as well as computation.
The mission of the Mahout project is to build a suite of scalable
machine learning algorithms that can cope with todays amount of data.
The project is built on top of Hadoop.

This talk provides a beginner-friendly introduction to the topic
of machine learning.  It presents a broad set of applications
that benefit machine learning.  The presentation gives a highlevel
overview of the project itself:  The types of tasks that can
be solved with each algorithm and the pitfalls one needs to look
out for when using it.

426: MIME Magic with Apache Tika
Presentation, 60 minutes by Jukka Zitting

Apache Tika is a Lucene subproject whose purpose is to make it
easier to extract metadata and structured text content from
all kinds of files.  Tika leverages libraries like Apache POI
and PDFBox to provide a powerful yet simple interface for parsing
dozens of document formats.  This makes Tika an ideal companion
for Apache Lucene or any other search engine that needs to be able
to index metadata and content from many different types of files.
This presentation introduces Apache Tika and shows how it's
being used in projects like Apache Solr and Apache Jackrabbit.
You will learn how to integrate Tika with your application
and how to configure and extend Tika to best suit your needs.
The presentation also summarizes the key characteristics
of the more widely used file formats and metadata standards,
and shows how Tika can help deal with that complexity.
The audience is expected to have basic understanding of Java
programming and MIME media types.

493: Solr Flair: User Interfaces, powered by Apache Solr
Presentation, 60 minutes by Erik Hatcher

Come see Solr in a new light, with snazzy innovative user interfaces.
We'll talk about Solr's flexible capabilities for driving custom
user interfaces and how projects like SolrJS and "Solritas"
bring Solr to the front-end. We'll experience user interfaces
in a variety of front-end technologies, including PHP, Ruby
on Rails, Java, Velocity, JQuery, and SIMILE Timeline.
We'll have Ajax, clouds, maps, timelines, and set visualizations, oh my!

512: Advanced Indexing Techniques with Apache Lucene
Presentation, 60 minutes by Michael Busch

Just as in 2007 and 2008 will we talk in this presentation about the
latest indexing and search innovations in Lucene and how to use them.
The payloads feature that was added in 2007 enabled many new
interesting use cases.  The Lucene developers continued working
on Flexible Indexing, and so far a new flexible TokenStream API,
a configurable indexing chain and pluggable indexing consumers
have been developed.  We are also working on column-stride fields,
a feature which will perform better than payloads for many use cases.
This talk will give an overview of the latest progress and demonstrate
the new features with interesting use cases.

View raw message