Working With PDFs in Ruby

Thomas Leitner

github @gettalong
@_gettalong

Note

Shameless plug for one of my projects ;)

… but you might find it useful, too!

Curent Situation

Prawn generates PDFs
pdf-reader reads PDFs
combinepdf has basic PDF merge functionality
origami is intended for auditing PDFs
Various Java libraries for JRuby (e.g. PDFBox)

Problems

No complete, integrated solution
PDF specification only partly supported
Performance/Memory consumption

What if…?

there was a library for reading and writing PDFs,
that supports most parts of the PDF specification,
supports content generation similar to Prawn,
can easily be extended if needed,
and is written in pure Ruby?

Meet HexaPDF

Goal: One library for all things PDF

Reading, modifying, writing PDFs
Encryption
Content Generation
Basic validation of PDFs
No Dependencies
Fast and memory efficient

Fully Tested

$ rake test

# Running:

...............|SNIP 100s more dots|........................

Finished in 1.949720s, 742.6706 runs/s, 14285.6380 assertions/s.

1448 runs, 27853 assertions, 0 failures, 0 errors, 0 skips
Coverage report generated. 6999 / 6999 LOC (100.0%) covered.

Read–Optimize–Write Performance

e.pdf	Time	Memory	File size
hexapdf	1.006ms	46.960KiB	21.770.465
origami	3.301ms	153.628KiB	21.796.847
pdftk	681ms	123.152KiB	21.874.883
qpdf-comp	1.517ms	65.172KiB	21.787.558
smpdf	37.539ms	647.440KiB	21.188.516

Full benchmark at https://gist.github.com/gettalong/

Text Output Performance

Comparison with Python’s reportlab and Prawn,
using reportlab’s odyssey demo as benchmark.

reportlab w/ C extension	0.27s	930 pages/second
reportlab w/o C extension	0.67s	350 pages/second
Prawn	6.46s	35 pages/second
HexaPDF	0.96s	249 pages/second

Full benchmark at https://gist.github.com/gettalong/

Less Talk — More Code

Examples of what HexaPDF can do today

PDF Compression

HexaPDF:::Document.open(ARGV.shift) do |doc|
  doc.task(:optimize, compact: true,
           object_streams: :generate, compress_pages: false)
  doc.write(ARGV.shift, validate: true)
end

Basic PDF File Merging

target = HexaPDF::Document.new
output_name = ARGV.shift

ARGV.each do |file|
  pdf = HexaPDF::Document.new(io: File.open(file, 'rb'))
  pdf.pages.each_page do |page| 
    target.pages.add_page(target.import(page))
  end
end

target.trailer.info[:Title] = 'Merged document'
target.write(output_name)

Content Generation

doc = HexaPDF::Document.new
canvas = doc.pages.add_page.canvas

canvas.fill_color(255, 255, 0)
canvas.line(100, 100, 200, 200).stroke
canvas.ellipse(520, 50, a: 30, b: 15, inclination: 45).fill_stroke
canvas.image("picture.jpg", at: [300, 300], width: 100)
canvas.image("vectors.pdf", at: [500, 700], height: 200)
canvas.font("Times", size: 20, variant: :bold)
canvas.text("Works!", at: [100, 750])

doc.write("graphics.pdf")

Graphics Sample

TTF Sample

Text Sample

Text Processing

class SampleProcessor < HexaPDF::Content::Processor
  def show_text(str)
    boxes = decode_text_with_positioning(str)
    # do something with the character boxes
  end
end

doc = HexaPDF::Document.open(ARGV.shift)
processor = SampleProcessor.new
doc.pages.each_page do |page|
  puts "Processing page"
  page.process_contents(processor)
end
doc.write('output.pdf')

Text Processing Sample

So… what’s the catch?

Not yet released :-(

But in the next few weeks :-)

Thank you!

Slides available at
http://talks.gettalong.org/euruko2016/