Working With PDFs in Ruby


Thomas Leitner

github @gettalong
twitter @_gettalong

Note

Shameless plug for one of my projects ;)

… but you might find it useful, too!

Curent Situation

  • Prawn generates PDFs
  • pdf-reader reads PDFs
  • combinepdf has basic PDF merge functionality
  • origami is intended for auditing PDFs
  • Various Java libraries for JRuby (e.g. PDFBox)

Problems

  • No complete, integrated solution
  • PDF specification only partly supported
  • Performance/Memory consumption

What if…?

  • there was a library for reading and writing PDFs,
  • that supports most parts of the PDF specification,
  • supports content generation similar to Prawn,
  • can easily be extended if needed,
  • and is written in pure Ruby?

Meet HexaPDF

Goal: One library for all things PDF

  • Reading, modifying, writing PDFs
  • Encryption
  • Content Generation
  • Basic validation of PDFs
  • No Dependencies
  • Fast and memory efficient

Fully Tested

$ rake test

# Running:

...............|SNIP 100s more dots|........................

Finished in 1.949720s, 742.6706 runs/s, 14285.6380 assertions/s.

1448 runs, 27853 assertions, 0 failures, 0 errors, 0 skips
Coverage report generated. 6999 / 6999 LOC (100.0%) covered.

Read–Optimize–Write Performance

e.pdf Time Memory File size
hexapdf 1.006ms 46.960KiB 21.770.465
origami 3.301ms 153.628KiB 21.796.847
pdftk 681ms 123.152KiB 21.874.883
qpdf-comp 1.517ms 65.172KiB 21.787.558
smpdf 37.539ms 647.440KiB 21.188.516

Full benchmark at https://gist.github.com/gettalong/

Text Output Performance

Comparison with Python’s reportlab and Prawn,
using reportlab’s odyssey demo as benchmark.

reportlab w/ C extension 0.27s 930 pages/second
reportlab w/o C extension 0.67s 350 pages/second
Prawn 6.46s 35 pages/second
HexaPDF 0.96s 249 pages/second

Full benchmark at https://gist.github.com/gettalong/

Less Talk — More Code

Examples of what HexaPDF can do today

PDF Compression

HexaPDF:::Document.open(ARGV.shift) do |doc|
  doc.task(:optimize, compact: true,
           object_streams: :generate, compress_pages: false)
  doc.write(ARGV.shift, validate: true)
end

Basic PDF File Merging

target = HexaPDF::Document.new
output_name = ARGV.shift

ARGV.each do |file|
  pdf = HexaPDF::Document.new(io: File.open(file, 'rb'))
  pdf.pages.each_page do |page| 
    target.pages.add_page(target.import(page))
  end
end

target.trailer.info[:Title] = 'Merged document'
target.write(output_name)

Content Generation

doc = HexaPDF::Document.new
canvas = doc.pages.add_page.canvas

canvas.fill_color(255, 255, 0)
canvas.line(100, 100, 200, 200).stroke
canvas.ellipse(520, 50, a: 30, b: 15, inclination: 45).fill_stroke
canvas.image("picture.jpg", at: [300, 300], width: 100)
canvas.image("vectors.pdf", at: [500, 700], height: 200)
canvas.font("Times", size: 20, variant: :bold)
canvas.text("Works!", at: [100, 750])

doc.write("graphics.pdf")

Graphics Sample

TTF Sample

Text Sample

Text Processing

class SampleProcessor < HexaPDF::Content::Processor
  def show_text(str)
    boxes = decode_text_with_positioning(str)
    # do something with the character boxes
  end
end

doc = HexaPDF::Document.open(ARGV.shift)
processor = SampleProcessor.new
doc.pages.each_page do |page|
  puts "Processing page"
  page.process_contents(processor)
end
doc.write('output.pdf')

Text Processing Sample

So… what’s the catch?

Not yet released :-(

But in the next few weeks :-)

Thank you!

Slides available at
http://talks.gettalong.org/euruko2016/

/