[Scribus] Kerning problems with pdftotext

Fri May 25 10:10:37 CEST 2007

Hi Lucien,

 > We are using pdftotext to strip out text from pdf's to prepare for  
 > search indexing and more.  This works well except with our own pdf's  
 > (produced in Scribus) which getting badly broken up - we suspect  
 > through kerning.  The text generated is simply fragmented into  
 > meaningless chunks. It remains in sequential order and some words are  
 > fine, but generally it's not working.

 > 1. Has anybody experienced this?  Is this a pdftotext thing?
 > 2. Are there alternative pdf-to-text parsers that anyone would recommend?

1. This is a Scribus thing.
2. I haven't seen a better pdf-to-text converter than pdftotext.

As Peter has pointed out, your problem is due to the way the letters are
put into the PDF file, so you will always have this problem with
Scribus-generated PDF.

However, if you follow the instructions at
http://wiki.scribus.net/index.php/Web_optimised_PDF , you will find that
(apart from compressing the PDF files, which you were not asking for) the
text extracted by pdftotext now becomes an almost perfect representation
of the original text.
  I haven't investigated this in detail, and there may be encoding issues,
etc, but I found the results striking.

The way this works is that pdftops (_not_ pdf2ps) converts the PDF to PS,
then ghostscript distills back to PDF and apparently does an excellent job
at reassembling letters into words.

W o l f g a n g