lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <>
Subject [jira] Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article
Date Tue, 07 Aug 2007 00:53:59 GMT


Steven Parkes updated LUCENE-971:

    Attachment: LUCENE-971.patch.txt

Okay. Here's an update to the patch.


1) EnwikiDocMaker replaces ExtractWikipedia

2) A sample algorithm is provided (and used by the build.xml file, which could be removed
if desired

3) A bug in LineDocMaker is fixed (it was storing both the title and date in the title field
(small enough that it doesn't need its own JIRA(?))

4) LineDocMaker was made derivable-from

Much of the code in LineDocMaker is useful in EnwikiDocMaker so I made it so (it's inheritance
for impl, not abstraction so it could be changed, of course)

5) Made LineDocMaker and WriteLineDocTask multicharater safe

Or at least I tried to. Wikipedia has non-ascii characters in it. To make LineDocMaker work
as a base class, I made it use an explicit FileInputStream which is required so that SAX can
extract the encoding correctly. I made WriteLineDocTask always write UTF-8 so that I can get
non-ASCII in the output file. Seems like UTF-8 is the best encoding for line files? At the
same time, I made LineDocMaker assume UTF-8 (unless told otherwise by a derived class like
EnwikiDocMaker) so that the line files created by EnwikiDocMaker/WriteLineDocTask can be read
by LineDocMaker w/o loss.

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>                 Key: LUCENE-971
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt
> Create a line per article rather than a file. Consume with indexLineFile task.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message