lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Allan Hill <>
Subject RE: Best document format / markup for text indexing?
Date Tue, 22 Nov 2011 23:21:30 GMT
> What is the best format/markup/ebook standard/document standard/other to use for easiest
and best text search support?

The helpful Tika libraries can parse any number of formats and then index the text into Lucene,
so I'm thinking the question is what is the better format when you want to display the document.

It seems you need to ask what is a "document" as far as Lucene is concerned.    Possibly the
answer is each sentence (not the chapter), because I'm wondering if fundamentally the user
wants to see each line and the references to other lines in this or other documents, but also
view the whole document when needed.
So then you need
1.  A nice viewable version of each file (chapter).
2. Table(s) (in RDBS) that can cross-link every verse/sentence/line to every other.  Isn't
that how cross references work?  At the sentence level?
3. Table(s) (in RDBS) that link each sentence to chapter to book to work (or alternatively
some field(s) in Lucene that can be used to get to the definition of the context).
4. A Lucene index that indexes the "sentences" (the fundamental cross referencable subunit
of the text).

Maybe someone else has ideas about mapping from text in a document to a particular verse and
its cross references, but that sounds like a lot of mapping to me, so I think of doing the
work up front and building the index of verses/sentences.
Just my beginners 0.02 cents worth.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message