lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article
Date Wed, 01 Aug 2007 15:21:52 GMT


Michael McCandless commented on LUCENE-971:

This looks great!

One alternate approach here would be to create a WikipediaDocMaker
(implementing DocMaker interface) that pulls directly from the XML
file and feeds documents into the alg.

Then, to make a line file, one could create an alg that pulls docs
from WikipediaDocMaker and uses WriteLineDoc task to create the
line-by-line file.

One benefit of this approach is creating docs of a certain size (10
tokens, 100 tokens, etc) would become a one-step process (single alg)
instead of what I think is a 2-step process now (make first line file,
then reprocess into second line file).  Another benefit would be you
could make wikipedia tasks that pull directly from the XML file and
not even use a line file as an intermediary.

Steve do you think this would be a hard change?  I think it should be
easy, except, I'm not sure how to do this w/ SAX since SAX is "in
control".  You sort of need coroutines.  Or maybe one thread is
running SAX and putting doc data into a shared queue, and then the other
thread (the normal "main" thread that benchmark runs) would pull from
this queue?

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>                 Key: LUCENE-971
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
> Create a line per article rather than a file. Consume with indexLineFile task.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message