lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From VIGNESH S <vigneshkln...@gmail.com>
Subject Re: Indexing several parts of PDF file
Date Tue, 05 Feb 2013 14:15:08 GMT
Yes.. I also think the same..Better Index each Page as Documents

On Tue, Feb 5, 2013 at 7:35 PM, Upayavira <uv@odoko.co.uk> wrote:
> This would involve you querying against every page in your document,
> which will be too many fields and will break quickly.
>
> The best way to do it is to index pages as documents. You can use field
> collapsing to group pages from the same document together.
>
> Upayavira
>
> On Tue, Feb 5, 2013, at 02:00 PM, Jorge Luis Betancourt Gonzalez wrote:
>> Hi:
>>
>> I'm working on a search engine for several PDF documents, right now one
>> of the requirements is that we can provide not only the documents
>> matching the search criteria but the page that match the criteria.
>> Normally tika only extracts the text content and does not do this
>> distinction, but using some custom library this could be achieve, but my
>> question is how to structure the schema. For what I've seen one approach
>> could be the use dynamic fields:
>>
>> <dynamicField name="page_*" type="text" indexed="true"  stored="true"/>
>>
>> So at query time I could extract the page number from the fields name. Is
>> this the best approach? Is there any form of storing the number page into
>> an attribute and not using the dynamic fields?
>>
>> Thanks in advance!
>>
>> Greetings
>> --
>> "It is only in the mysterious equation of love that any
>> logical reasons can be found."
>> "Good programmers often confuse halloween (31 OCT) with
>> christmas (25 DEC)"



-- 
Thanks and Regards
Vignesh Srinivasan
9739135640

Mime
View raw message