nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Coffey <>
Subject readseg dump and non-ASCII characters
Date Wed, 15 Nov 2017 01:20:27 GMT
Greetings Nutchlings,
I have been using readseg-dump successfully to retrieve content crawled by nutch, but I have
one significant problem: many non-ASCII characters appear as '???' in the dumped text file.
This happens fairly frequently in the headlines of news sites that I crawl, for things like
quotes, apostrophes, and dashes.
Am I doing something wrong, or is this a known bug? I use a python utf8 decoder, so it would
be nice if everything were UTF8.
Here is the command that I use to dump each segment (using nutch 1.12).bin/nutch readseg -dumpĀ 
segPath destPath -noparse -noparsedata -noparsetext -nogenerate
It is so close to working perfectly!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message