and Other PDF Solutions

by Thomas Leitner a.k.a. gettalong

There exist some PDF libraries for Ruby

PDF generation


PDF reading


PDF inspection


PDF merging


PDF framework

(originally intended for auditing PDFs)

What have all these projects in common?

  • Prawn

  • pdf-reader

  • pdf-inspector

  • combine_pdf

  • origami

They only implement a subset of PDF functionality.

There was no purpose-designed PDF solution



hexapdf homepage screenshot

HexaPDF is …

  • a versatile PDF creation and manipulation library for Ruby

  • a standalone application for performing the most common PDF tasks like merging files

  • designed with ease of use and performance in mind


Ruby-esque API interface

require 'hexapdf'

doc = HexaPDF::Document.new
canvas = doc.pages.add.canvas
canvas.font('Helvetica', size: 100)
canvas.text("Hello World!", at: [20, 400])

PDF objects are represented by
native Ruby objects

require 'hexapdf'

doc = HexaPDF::Document.new
info = doc.trailer.info
info[:Title] = 'This is the PDF document title'
info[:CreationDate] = Time.now
doc.catalog[:PageLayout] = :SinglePage
doc.catalog[:NeedsRendering] = true

Automatic conversion of “special” data types like dates and binary strings

require 'hexapdf'
require 'stringio'

doc = HexaPDF::Document.new
doc.trailer.info[:CreationDate] = Time.now
doc.trailer.info[:UnknownField] = Time.now
out = StringIO.new

doc = HexaPDF::Document.new(io: out)
info = doc.trailer.info
p info.data.value[:CreationDate]  # => "D:20180826074144+02'00'"
p info[:CreationDate]             # => 2018-08-26 07:41:44 +0200
p info.data.value[:CreationDate]  # => 2018-08-26 07:41:44 +0200
p info[:UnknownField]             # => "D:20180826074144+02'00'"

See lib/hexapdf/dictionary_fields.rb and HexaPDF::Dictionary#[]

Automatic conversion of PDF types

class HexaPDF::Type::Info < Dictionary

  define_type :XXInfo
  define_field :Title,        type: String, version: '1.1'
  define_field :Author,       type: String
  define_field :Subject,      type: String, version: '1.1'
  define_field :Keywords,     type: String, version: '1.1'
  define_field :Creator,      type: String
  define_field :Producer,     type: String
  define_field :CreationDate, type: PDFDate
  define_field :ModDate,      type: PDFDate
  define_field :Trapped,      type: Symbol, version: '1.3'


HexaPDF::GlobalConfiguration['object.type_map'][:XXInfo] =

Low-level API with convenience API on top

class HexaPDF::Type::PageTreeNode < Dictionary

  define_type :Pages
  define_field :Type,   type: Symbol, required: true, default: type
  define_field :Parent, type: Dictionary, indirect: true
  define_field :Kids,   type: Array, required: true, default: []
  define_field :Count,  type: Integer, required: true, default: 0

  def page_count
  def page(index)
  def insert_page(index, page)
  def add_page(page)
  def delete_page(page)
  def each_page


Validation of PDF objects

class HexaPDF::Type::Trailer
  def perform_validation
    unless value[:ID]
      msg = if value[:Encrypt]
              "ID field is required when an Encrypt dictionary is present"
              "ID field should always be set"
      yield(msg, true)

    unless value[:Root]
      yield("A PDF document must have a Catalog dictionary", true)
      value[:Root] = document.add(Type: :Catalog)
      value[:Root].validate {|message, correctable| yield(message, correctable) }

    if value[:Encrypt] && (!document.security_handler ||
      yield("Encryption key doesn't match encryption dictionary", false)

Orthogonal design of classes

require 'hexapdf'

s = HexaPDF::Serializer.new
p s.serialize(Time.now)        # => "(D:20180826080743+02'00')"

source = HexaPDF::Filter.source_from_string('My String')
source = HexaPDF::Filter::ASCII85Decode.encoder(source)
p HexaPDF::Filter.string_from_source(source) # => "9mIj[FE2)5B)~>"

Fully tested

$ rake test
Run options: --seed 5344

# Running:

.......|SNIP 100s more dots|..............................

Finished in 2.093626s, 885.5449 runs/s, 13857.2973 assertions/s.

1854 runs, 29012 assertions, 0 failures, 0 errors, 0 skips
Coverage report generated. 9032 / 9032 LOC (100%) covered.


Optimized parsing and serializing

HexaPDF vs ? - file size optimization

optimization benchmark

Black HexaPDF, orange pdftk (GCJ), blue QPDF (C++)

Optimized text output

HexaPDF vs ? - raw text output

raw text benchmark

Black HexaPDF, orange Prawn

Avoiding work

HexaPDF vs ? - line wrapping

raw text benchmark

Black HexaPDF, orange Prawn, blue reportlab, green tcpdf

Low memory usage

  • Optimized code to avoid unnecessary allocations
  • Lazy loading - only load from PDF what is needed

optimization benchmark

Black HexaPDF, orange pdftk (GCJ), blue QPDF (C++)

Small output files

  • Generate readable (because most of PDF is in ASCII format) but compact output

  • Use best compression available

  • hexapdf optimize produces smaller files than pdftk and qpdf

Demo time

hexapdf application

  • cmdparse library for command-style interface

  • Merging PDF files (comparison with pdftk)

  • Modifying a PDF file (selecting and optionally rotating pages)

  • Batch execution

Demo time

Code samples and comparisons

  • Raw text benchmark

  • Line wrapping benchmark

  • Image centering and stitching scripts

  • Complex text fitting

Future work

  • Text layout using classes like Paragraph, Table, …

  • AcroForm support

  • Document outlines (i.e. bookmarks)

  • More commands for the CLI

Summary and further information

Thank You!

What questions do you have?