lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "logic.cpp" <>
Subject Best document format / markup for text indexing?
Date Thu, 17 Nov 2011 20:46:20 GMT
tl;dr version:

We're converting tons (hundreds of thousands?) of books into digital text.

What is the best format/markup/ebook standard/document standard/other to use for easiest and
best text search support?


Longer version;

The following are some desired user experience features of the project, these probably influence
the way in which the content should preferably be stored;

- Granular access to the text content.
Users would be able to fetch a specific phrase in a specific paragraph in a specific page
in a specific chapter in a specific book. (A 'document' may consist of a single chapter of
a book).

- Cross referencing.
Most likely achieved through a RDBMS, users should have references to/from content that refers
or mentions a topic or quotes related content in other books.
(Similar to Wikipedia articles linking to one-another.)

- Full text search
This is probably where Lucene comes in.

So which format/markup/standard would allow for software to easily fetch and cross-reference
granular bits of data, as well as be easily indexable by Lucene?

Would it maybe be better to store all the books' digital text straight into the RDBMS? In
which case, can Lucene index such data?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message