lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <>
Subject Re: Participating in GSoC'11 with Lucene
Date Sat, 12 Mar 2011 21:35:58 GMT

On Sat, Mar 12, 2011 at 5:32 PM, Zhijie Shen <> wrote:
> Hi developers,
> I'm a graduate student from National University of Singapore, majoring in
> Computer Science. The enthusiasm of open source and information retrieval
> drives me to participate in GSoC'11 with your community. I first got to know
> Lucene when I was in a software engineer intern in IBM, working on Lotus
> Connections.

Awesome and welcome to Lucene :)
> Now I've already checked out the source code and successfully built it
> locally. Meanwhile, I begin to read through the Jira issues, and are more
> interested in Issue 2308, 2309 and 2621, which seem to be the refactoring
> tasks (Please correct me if I'm wrong). My personal feeling is that these
> tasks will be more appropriate for a beginner to get in. Moreover, I think
> to start with such a big project, it is more efficient to read through the
> discussion on Jira to understand the problem, and then dive into the related
> code with the problem kept in mind. What is your opinion? I'm looking
> forward to your guidance.

Apparently you survived the first steps to get into lucene and solr!
Great! You also looked at JIRA which is even better. So lemme tell you
some words about the issues you have listed.

LUCENE-2621 - Extend Codec to handle also stored fields and term vectors
This is a very interesting and at the same time very much needed
feature which involves API Design, Refactoring and in depth
understanding of how IndexWriter and its internals work. The API which
needs to be refactored (Codec API) was made to consume PostingLists
once an in memory index segment is flushed to disc. Yet, to expose
Stored Fields to this API we need to prepare it to consume data for
every document while we build the in memory segment. So there is a
little paradigm missmatch here which needs to be addressed.

LUCENE-2309 - Fully decouple IndexWriter from analyzers

This one is something I look forward to have for quite a while which
would flatten the way for other analysis capabilities than the one
lucene offers today. This seems to be refactoring-heavier that the
other but might be require less knowledge about the IndexWriter (IW)
internals than the codec one. Yet, it still is a very interesting
issue / project to work on and fairly self-contained.

LUCENE-2308 - Separately specify a field's type

FieldType aims on the one hand to separate field properties from the
actual value and on the other make Field's extensibility easier. Both
seem equally important while far from easy to achieve. Fieldable and
Field are a core API and changes to it need to well thought. Further
this issue can easily cause drastic performance degradation if not
done right. Consider this as a massive change since fields are used
almost all over lucene and solr.

I wrote those little summaries not to scare you away, not at all! I
rather tried to find out what to expect from the issues and to make it
easier for you to pick either one or another which you would like to
work on. I will try to update the description of those issues if they
are not already clear enough ( LUCENE-2621  seems kind of too brief
though) in the next couple of days.

If you have any question regarding those issues or any other, feel
free to ask here on the list or on the issue directly (you might need
a JIRA account if you don't have one already you should get one :)
Reading the JIRA issue might help you to understand what those issues
about but those are usually written by core devs or long time
contributors so please as any question you have and don't hesitate to
ask if you have problems with anything.

> Regards,
> Zhijie
> --
> Zhijie Shen
> School of Computing
> National University of Singapore

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message