Discussion about the back-end format for the eBook reader
This file is for the discussion of the eBook format for the eBook reader on the laptops. Many of these features also coincide with the OEPC project's needs, and are here in a shared document.
I am working on the "external data sources" component of the eBook reader as my Google Summer of Code project. The current plan is to have an API/library which will search various eBook sources (currently Project Gutenberg and ICDLbooks.org) to find eBooks, and then download and convert them to a standard format which the eBook reader on the laptops will (only) be able to read. This allows for flexibility in where the laptops can grab eBooks from: for each desired eBook source, a module will be written which handles the downloading and conversion of books. The reader will then only have to worry about reading one format, rather than many.
This discussion is being opened to determined the "best" format to use for this standard, back-end format. Desired features will first be discussed, and then a comparison of several possible candidates will be made. If you have suggestions, please contribute.
Desired features
- File format should be easy for both humans and computers to read and write. This allows those without significant technical background to create their own books (teachers could create textbooks for use in the reader, etc).
- Formats that already have an established Python library written for them will be considered first. While not mandatory, the external data sources is being written in Python, and this would speed up development.
- Format needs to be paginated in order to display one page at a time on the screen. Limited resolution and font (re)formatting need to be taken into account to "resize" the pages.
- Ability to imbed images. This is obviously useful.
- Ability to imbed URLs/hyperlinks. Also obviously useful.
- Small file format. With limited disk space and internet access on the laptops, a small file format is preferrable.
- UTF-8 support. Elaborate on this.
- Should be able to handle text formatting, such as bold, italics, font sizes/styles, etc. Possibly HTML-equivalent.
- Ability to display mathematical formulae. This is already supported in formats like LaTeX, but could also be implemented as image display in other formats without built-in formulae rendering.
- Text needs to be able to be indexed/searched easily. This rules out formats that are basically just images.
- Needless to say, it needs to be in an open, non-proprietary format.
Candidate formats
In addition to the following formats, see http://en.wikipedia.org/wiki/Comparison_of_document_markup_languages for other possibilities, or suggest one not listed here. http://en.wikipedia.org/wiki/Ebook is also a good source.
- XML: currently my "favorite" format. See the proposed format in this repository, called "Proposed_eBook_Format_XML.xml". Also see the 'sandbox' directory to see some experiments with XML. XML seems to provide nearly everything that's needed, and is easy to work with. Several docu- ment formats are XML-based, and an example of an XML eBook implementation is FictionBook, at http://haali.cs.msu.ru/pocketpc/FictionBook_description.html . FictionBook is also a possibility, however it doesn't appear to do much beyond plain XML.
- Open eBook: http://www.idpf.org/oebps/oebps_faq.htm , also XML-based, but perhaps beyond what's needed for the laptops. More research is needed for this format.
- LaTeX: invented by Donald Knuth, handles text formatting, images, math formulae rendering, output to PDF, and other things (elaborate). Potentially a good choice due to its many features, but this provides coding challenges, such as text indexing/searching/etc.
- PDF: probably not a good format on its own, but formats such as LaTeX generate PDFs.
- HTML: this may be good for browser-based reading, but many eBook sources already provide their books in HTML formats.
- XHTML: nice blend of HTML and XML. Allows standard HTML tags plus any custom tags needed (extensible).
- Markdown + HTML -> ztxt -> display
File Compression
In order to cut down on memory usage on the laptops, books will need to be compressed in a format which allows seeking/random access. LZMA will probably be used.
LZMA features (taken from their website):
- Compressing speed: 500 KB/s on 1 GHz CPU
- Decompressing speed:
- 8-12 MB/s on 1 GHz Intel Pentium 3 or AMD Athlon.
- 500-1000 KB/s on 100 MHz ARM, MIPS, PowerPC or other simple RISC CPU.
- Small memory requirements for decompressing: 8-32 KB + DictionarySize
- Small code size for decompressing: 2-8 KB (depending from speed optimizations)
A Python binding for LZMA exists at http://www.joachim-bauch.de/projects/python/pylzma .
z.m.l.
i have spent considerable time and energy developing a format that i call z.m.l. -- "zen markup language" -- which could be used to great effect in your wonderful project. of the formats you have listed, z.m.l. is close to markdown, in the sense that it's a no-angle-bracket non-markup form of markup. i've proven its worthiness with a cross-platform reference implementation (as well as perl variants that could also execute server-side), morphing the raw-ascii e-texts of project gutenberg into e-books that are beautiful _and_ powerful. the proof is in the pudding.
on a more mundane note, i've also done a bit of cleanup of the project gutenberg catalog that might save you some of your time.
i didn't see a "talk" page, so i put this here, but it'd be fine to move it elsewhere if you'd like; but do please contact me: bowerbird@…
-bowerbird
