nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Developer Developer" <devquesti...@gmail.com>
Subject Re: System.out.println(parsetext.getText()) prints non readable chars - Please help
Date Wed, 02 Jan 2008 16:12:48 GMT
It is in English language. I am pretty sure it is not in other language
because here is the document url

http://www.irs.gov/pub/irs-pdf/f1040as1.pdf.




On Jan 2, 2008 10:49 AM, Dennis Kubes <kubes@apache.org> wrote:

> Most likely this page is in a different language.
>
> Dennis
>
> Developer Developer wrote:
> > Hello ,
> >
> > I need to access parse text from nutch documents, I am using nuthbean to
> > search and then access the parseText from it. Here is the sample code
> >
> >
> >
> > Configuration conf = NutchConfiguration.create();
> > NutchBean nb = new NutchBean(conf);
> > Hits hits = nb.search(Query.parse("irs", conf), 10);
> >
> > //get a sample hit
> > Hit hit = hits.getHit(8);
> >
> > HitDetails hitDetails = nb.getDetails(hit);
> >
> > ParseText pText = nb.getParseText(hitDetails);
> >
> > System.out.println(pText.getText());
> >
> > The System.out command prints non readable characters as follows
> >
> > obj<</Length 31683/Filter/FlateDecode/Length1 1720/Length2 30704/Length3
> > 532>>stream
> > H‰¤U 8Të (R)Ýåt›=*éƶ„P†Y†fÆ.'!vb )'´Ì,,fÖŒµÖ¸ÔV*—P
> ¥›¨]QJî%'kDE…Š¨ØÏ"ê(EÚ‡JÍYklgËÉóœszæyþYÿ÷ ÿ»Þï{ßÿ_ºZ|g†•Hê
ÛJQ‚  1Í?õˆ Æ
>  (c) B &" 1Ù4]]k † DŠò
> >  &sä0° Â
> >
> >
> > Any idea what I am  missing ? The document is a pdf in english.
> >
> > Thanks !
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message