lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Indexing several parts of PDF file
Date Tue, 05 Feb 2013 14:15:08 GMT
Yes.. I also think the same..Better Index each Page as Documents

On Tue, Feb 5, 2013 at 7:35 PM, Upayavira <> wrote:
> This would involve you querying against every page in your document,
> which will be too many fields and will break quickly.
> The best way to do it is to index pages as documents. You can use field
> collapsing to group pages from the same document together.
> Upayavira
> On Tue, Feb 5, 2013, at 02:00 PM, Jorge Luis Betancourt Gonzalez wrote:
>> Hi:
>> I'm working on a search engine for several PDF documents, right now one
>> of the requirements is that we can provide not only the documents
>> matching the search criteria but the page that match the criteria.
>> Normally tika only extracts the text content and does not do this
>> distinction, but using some custom library this could be achieve, but my
>> question is how to structure the schema. For what I've seen one approach
>> could be the use dynamic fields:
>> <dynamicField name="page_*" type="text" indexed="true"  stored="true"/>
>> So at query time I could extract the page number from the fields name. Is
>> this the best approach? Is there any form of storing the number page into
>> an attribute and not using the dynamic fields?
>> Thanks in advance!
>> Greetings
>> --
>> "It is only in the mysterious equation of love that any
>> logical reasons can be found."
>> "Good programmers often confuse halloween (31 OCT) with
>> christmas (25 DEC)"

Thanks and Regards
Vignesh Srinivasan

View raw message