[Scribus] Producing per-page diffs of PDFs?

Tue Mar 20 14:35:12 CET 2007

Gregory Pittman wrote:
> Frank Cox wrote:
>> On Tue, 20 Mar 2007 00:56:21 -0700
>> "Brian Burger" <blurdesign at gmail.com> wrote:
>>
>>   
>>> Is it possible to produce per-page diffs of two PDFs?
>>>     
>> You could use something like pdftk to tear the files down into individual pages
>> and then use cmp to determine which pages are identical.
>>
>>   
> I think this is worth a try, not to negate any of Craig's concerns. One
> of the issues you're dealing with is the scale of the job. Using the
> 'burst' command in pdftk you can make each into individual pages, then
> see what a diff gets you. There are probably pages you can ignore since
> they may be very obviously different. If you can at least carve the job
> down to a more manageable one, you're ahead. Be prepared for 500 pages
> of pdftk output to take up a load of memory -- much more than the
> original file.
> 
> I have yet to see text stripped from a PDF to come out very well -- lots
> of mistakes, spaces in unusual places.

Yep, it's often not very good at all. However, by ignoring whitespace
one might get quite a decent check for content changes. After all, it
doesn't have to look good, we're just doing it for comparison's sake.
The only concern I have is if the extractor's attempt to handle things
like columns results in different word orderings etc then two equivalent
pages could not match due to incredibly tiny formatting differences.

I'm extremely doubtful that a simple binary compare of the exploded PDF
pages themselves will yield a `match' where one exists, so this is by
way of suggesting possibly more viable alternatives. I'm not at all sure
it'll work, I just give it more chance than a straight binary compare.

--
Craig Ringer