Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B35177ADF for ; Mon, 19 Dec 2011 16:37:31 +0000 (UTC) Received: (qmail 81134 invoked by uid 500); 19 Dec 2011 16:37:26 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 81078 invoked by uid 500); 19 Dec 2011 16:37:26 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 81041 invoked by uid 99); 19 Dec 2011 16:37:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Dec 2011 16:37:26 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.161.176 as permitted sender) Received: from [209.85.161.176] (HELO mail-gx0-f176.google.com) (209.85.161.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Dec 2011 16:37:21 +0000 Received: by ggnr4 with SMTP id r4so5613638ggn.35 for ; Mon, 19 Dec 2011 08:37:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=CL6so66E3COriwFWTT4HNBf3LUs4OAM+swetUwNdmNs=; b=MrUNefz/MvujLof6mUEjVRd2HlFKr4gpaik+MtSMDJP4+hm3nO2CHLVnvhuUrVgHZ3 li9sgzpKXXm83xRjIwEzgIO3FMUK5vH3sD4TvMGVqBIYXBcIYwI2tNeVp+wmXEQpWURK IoWljyfAFaIC9AEO8rOzXuu0uzcMGvv5T0ptc= MIME-Version: 1.0 Received: by 10.182.110.1 with SMTP id hw1mr10740621obb.38.1324312620435; Mon, 19 Dec 2011 08:37:00 -0800 (PST) Received: by 10.182.15.101 with HTTP; Mon, 19 Dec 2011 08:37:00 -0800 (PST) In-Reply-To: References: <1324097498.33386.YahooMailNeo@web130105.mail.mud.yahoo.com> <3105368504907620561@unknownmsgid> <1324151940.72743.YahooMailNeo@web130106.mail.mud.yahoo.com> Date: Mon, 19 Dec 2011 11:37:00 -0500 Message-ID: Subject: Re: Retrieving Documents From: Erick Erickson To: solr-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable "A programmer has a problem. She tried to solve it with regular expressions= . Now she has two problems".... You could *try* PatternReplaceCharFilterFactory. Note that this is applied to the entire input string *before* tokenization. I'm thinking you could write a clever regex that transformed everything except your page number into "". But I think a better thing to explore is for you to tell us how you're getting the epub format in the first place. If you're programmatically splitting it up (say in SolrJ), just find the interesting bit of data in the SolrJ program and add it as a field in the document. Or, if you're using DIH... Or.... The interesting bit will be several fold: 1> how grouping the documents such that they are a single entity, i.e. book. Will you have one large document or one document for each page? 2> How will you handle phrase searches when the phrase begins on one page and ends on the next? This is one of those things that looks easy at the start but quickly gets kinda ugly. Good luck! Best Erick On Mon, Dec 19, 2011 at 10:29 AM, Dan McGinn-Combs wrot= e: > I can see why you are confused. Re-reading it, I'm confused. > Here's my dilemna. > > I am trying index some one hundred or so books all in EPUB format. The > goal is to provide research functions, i.e. people who need to > reference specific quotes, pages and books for their writing. > > I don't know if EPUB is designed to do this by default, but each book > is created/converted to EPUB using Calibre. Each page is packed into > the EPUB file as a separate HTML file with the format > _split_<page number>.html. So the upshot of my question is > whether there is a way to extract the page number from the title of > the embedded HTML page and expose that in a Solr field that I can > subsequently display to the user? > > I hope that makes a bit more sense. > > Still looking through the Wiki because it seems to be stuffed with goodie= s. > > Dan > > On Sat, Dec 17, 2011 at 2:59 PM, Otis Gospodnetic > <otis_gospodnetic@yahoo.com> wrote: >> Hi Dan, >> >> I don't follow the second paragraph. =A0Not sure what you are trying to = do, what you've tried, what didn't work and how... >> >> Otis >> ---- >> >> Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-perf= ormance-monitoring/index.html >> >> >> >>>________________________________ >>> From: Dan McGinn-Combs <dgcombs@gmail.com> >>>To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> >>>Sent: Saturday, December 17, 2011 9:30 AM >>>Subject: Re: Retrieving Documents >>> >>>Good pointer. Thank you, that is exactly what I had in mind. To the >>>second point, yes, sort of. >>> >>>I've managed to take apart a sample of the ePub documents (there are a >>>finite number). Inside the ePub are single HTML documents that are a >>>single page of the overall book. It would be super to be able to parse >>>the title (originally formed from the page number) to set up a >>>dynamically generated documented and include that as part of the >>>results. Combing the wiki now since that's where every answers seems >>>to be! Pointers welcome though. >>>Thanks! >>>-- >>>Dan McGinn-Combs >>> >>>On Dec 16, 2011, at 11:52 PM, Otis Gospodnetic >>><otis_gospodnetic@yahoo.com> wrote: >>> >>>> Hi Dan, >>>> >>>> 1) Are you looking for http://wiki.apache.org/solr/HighlightingParamet= ers#hl.fragsize ? >>>> >>>> 2) Hundreds of words in a field should not be a problem for highlighti= ng.=A0 But it sounds like this long field may contain content that correspo= nds to N different pages in a publication and you would like to inform the = searcher which page the match was on, and not just that a match was somewhe= re in that big piece of text.=A0 One way to deal with that is to break your= document into N smaller documents - one document for each page. >>>> >>>> Otis >>>> ---- >>>> >>>> Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-pe= rformance-monitoring/index.html >>>> >>>> >>>> >>>>> ________________________________ >>>>> From: Dan McGinn-Combs <dgcombs@gmail.com> >>>>> To: solr-user@lucene.apache.org >>>>> Sent: Friday, December 16, 2011 4:33 PM >>>>> Subject: Retrieving Documents >>>>> >>>>> I've been doing a fair amount of reading and experimenting with Solr >>>>> lately. I find that it does a good job of indexing very structured >>>>> documents. However, the application I have in mind is build around >>>>> long EPUB documents. >>>>> >>>>> Of course, I found the Extract components useful for indexing the >>>>> EPUBs. However, I would like to be able to >>>>> >>>>> * Size the "highlight" portion of text around the query parameters >>>>> (i.e. show 20 or 30 words) and >>>>> >>>>> * Retrieve a location within the document so I can display that "page= " >>>>> from the EPUB. >>>>> >>>>> What is common practice for these? I notice that if I have a list of >>>>> (short) text segments in fields, they are stored without too much fus= s >>>>> and are retrievable. However, I'm talking about a field of potentiall= y >>>>> hundreds of words. >>>>> >>>>> Thanks for any pointers, >>>>> Dan >>>>> >>>>> -- >>>>> Dan McGinn-Combs >>>>> dgcombs@gmail.com >>>>> Peachtree City, Georgia USA >>>>> >>>>> >>> >>> >>> > > > > -- > Dan McGinn-Combs > dgcombs@gmail.com > Google Voice: +1 404 492 7532 > Peachtree City, Georgia USA