[scribus] Scribus sla to epub (export) q. (calibre does not work)

Eric Dodémont eric.dodemont at skynet.be
Thu Jun 5 09:30:18 UTC 2014


I have studied the PDF to ePub fixed layout conversion these last weeks and
wrote down my findings in a little ebook (20 pages):

A Practical Guide to Convert a PDF File to an ePub Version 3 Fixed Layout
File: With Free Open Source Tools.
https://play.google.com/store/books/details?id=1pytAwAAQBAJ

This is the beginning of the book (the rest is mainly technical stuffs to
make the conversion from pdf to html, then from html to epub):

Chapter 1: Fixed Layout

Different file formats exist for fixed layout ebooks. Bellow a list of the
main ones:

- PDF (Portable Document Format) [.pdf]
- DjVu (Déja Vu) [.djvu]
- ePub (electronic Publication) [.epub]
- Apple iBooks (similar to ePub) [.ibooks]
- Amazon Kindle (similar to ePub) [.kf8]

In this book, we will focus mainly on the conversion of a PDF file to a
fixed layout ePub file. This is possible since the version 3 of the ePub
format which includes now the fixed layout mode in addition to the
traditional flowing text mode.

This type of conversion can be very useful as the page layout programs
(e.g. Scribus) are always exporting the final result as a PDF (optimized
for paper or online publication).

The "ePub 3.0 Fixed Layout (FXL) Format Specifications" published by the
International Digital Publishing Forum (IDPF) can be found here:

http://www.idpf.org/epub/fxl

A "Field Guide to Fixed Layout for E-Books" published by the Book Industry
Study Group (BISG) is available for free here:

http://www.bisg.org/publications/field-guide-fixed-layout-e-books

The ePub version 3 format uses all the modern Web technologies like HTML5,
CSS3, JS, SVG, XML, XHTML, WOFF, etc.

Important remarks:

1) This book is only about fixed layout ePub. Fixed layout can be used if
the book has a sophisticated layout with lots of images. Such fixed layout
books are made with desktop publishing (DTP) programs like Scribus, Adobe
InDesign, Quark XPress, or Microsoft Publisher. For books with only text or
with few images, a flowing text ePub is more suitable and more easy to do.

2) Most of the PDF to ePub converters do not work for sophisticated layout
because they convert a fixed layout PDF into a flowing text ePub, which
gives most of the time an ugly and unusable result unless the file is
heavily adapted. They just extract the text and the images from the PDF,
and put then sequentially into a flowing text ePub with all the layout gone.

3) Most of the ePub viewers do not support (yet) the fixed layout. If you
try to display a fixed layout ePub with such viewer, the result will be
ugly and unusable. Two good ePub viewers supporting the fixed layout are
Google Play Books (for tablets running under Google Android or Apple iOS
(iPad)) and Readium (for laptops or desktops running under Microsoft
Windows, Apple OS X (Mac), or GNU Linux; it is a Google Chrome browser
extension). Most of the time, small screens are not suitable for fixed
layout books. Such books should be read on tablets, not on smartphones.

* Conversion Methods

There are three main methods to convert a PDF file to an ePub fixed layout
file:

1) Method 1: Bitmap image only + Hidden text

Each ePub page is a bitmap image (PNG8, possibly PNG24 or JPEG) of an exact
replica of the PDF page. This bitmap image is the result of the rendering
of the text (using vector fonts), bitmap images, and vector images. To
maintain accessibility (select text, copy/paste text, search text, text to
speech, etc.), an invisible text layer is added on top of the image. This
is also the way used to convert a PDF file to a DjVu file. Some PDF files
are also made like that, mainly when they are the results of scanning paper
books (the text layer is made by OCR).

2) Method 2: Image + Text

Probably the best method, but more sophisticated than the first one, is to
add on each ePub page a bitmap image (JPEG, possibly PNG) which is made of
all bitmap and vector images of the PDF page, or a bitmap and vector image
(SVG). The text is not converted in a bitmap image or inserted in the SVG
file, but added on the ePub page by using XHTML5 and CSS3. The CSS uses: a)
absolute positioning to put the text at the exact same place than in the
PDF page; b) styles and fonts for the text to look exactly the same as in
the PDF page. These two last steps are challenging, because HTML5 cannot
always do what the PDF format can; lots of free and commercial tools exist,
but most of the time cannot do that correctly when it comes to fixed layout.

3) Method 3: SVG only

The bitmap images, the vector images, and the text are embedded in SVG
files (one SVG per page). The text should be rendered as true text (with
fonts), not just outlines of the glyphs (vector images). Also called: SVG
in the spine (no XHTML).

In the following of this book, I will only focus on the second method
(image + text).

* Conversion Tools

There are free open source and commercial tools to convert PDF to
ePub3-fxl, but some have drawbacks. For example, one of these tools give a
very good visual result, but the text accessibility has a problem: no
spaces are present. The tool puts words at the correct positions, but does
not care of the spaces between the words. When you copy/paste a phrase, all
the spaces are gone. Or, if you search a word, the word is not found
(unless this word is between parenthesis by example). In fact, all phrases
are very long words.

The tool and the method I will describe below is free, and give a very good
result for the visual aspect and for the text accessibility. The tool I
will use is pdf2htmlEX, developed by Lu Wang (speudo: coolwanglu), a
Chinese PhD student at the Department of Computer Science and Engineering
of the Hong Kong University of Science and Technology. You can find it here:

http://coolwanglu.github.io/pdf2htmlEX

This tool, as its name tells us, does a conversion of the PDF pages to HTML
pages, and does not produce an ePub file. To get an ePub3-fxl file, I will
show how to use the result produced by pdf2htmlEX, to create the ePub3-fxl
file. It means mainly: a) remove the HTML viewer that pdf2htmlEX produces
and integrates in the result; b) create all the files required by the ePub
format and wrap the result into one unique file.

Best regards,

Eric Dodémont


On 5 June 2014 11:16, Peter Nermander <peter at nermander.se> wrote:

> It doesn't fix my problem, but it helps understand why it's sufficiently
> > complex that the tool is not there, yet. The original point still stands
> > though, and this makes it clearer, (at least to me), why Scribus is the
> > right
> > place to export the PDF, which Scribus knows how to write. Therefore it
> > would
> > also know how to export the epub correctly as well. I think.
> >
> >
> No, it's still not that easy. Seems I have to take an example.
>
> Imagine that you on each page have 3 pictures with a caption. The caption
> is next to the picture (not above or below). The picture and caption are
> separate frames.
>
> Now, the pictures alternates between being at the left side (with the
> caption to the right) and at the right side (with the caption to the
> right). When you export to epub you surely want all the captions to go
> either above or below each picture (same for all pictures). But could you
> describe the algorithm Scribus should use to decide in what order it shall
> export the pictures and the captions?
>
> Going from top left to bottom right will not work well. Note also that
> going from top left to bottom right can be done sideways first (most
> relevant for this case) or down first (more relevant for a regular 2 column
> layout).
>
> /Peter
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.scribus.net/pipermail/scribus/attachments/20140605/1dfc8202/attachment.html
> >
> ___
> Scribus Mailing List: scribus at lists.scribus.net
> Edit your options or unsubscribe:
> http://lists.scribus.net/mailman/listinfo/scribus
> See also:
> http://wiki.scribus.net
> http://forums.scribus.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.scribus.net/pipermail/scribus/attachments/20140605/546c14bb/attachment.html>


More information about the scribus mailing list