lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mango <bgh...@gmail.com>
Subject parsing document title
Date Thu, 17 Jun 2010 15:28:32 GMT
I'm supposed to index documents which do not have all the information
I need stored in the Metadata fields.  I would like to extract the
document title from the document body when the Title Metadata field
contains no information.
In addition, many of the documents contain a table with information on
the document subject. One of the columns in the table is named
'Abstract:'  and it indicates that the topic is specified in the
neighbouring cell.

I would store the title and abstract in separate fields with the idea
to have them stored for the search results presentation, but also to
boost them so that the results become more relevant. First of all, I
would like to ask if that is a good idea, especially since I do not
know how exactly I would extract this information.

As it is now, the title is in the first line of the parsed text,
followed by _space_ and the contents of the next row. The same goes
for abstract information, it is separated by _space_ from the contents
of the next row. I.e.the stream goes like this:

Let's say this is the title _space_ New Line text

Abstract: This would be the paper subject _space_ new column


I suppose that I should write a custom ContentHandler or modify the
existing BodyContentHandler from SAX? If so, a couple of lines of code
showing the direction to go would be of immense help.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message