PDF, PS and DjVu
This article covers software to view, edit and convert PDF, PostScript (PS), DjVu (déjà vu) and XPS files.
Engines
- DjVuLibre — Suite to create, manipulate and view DjVu documents.
-
Ghostscript — Interpreter for PostScript and PDF. Provides the gs(1) command-line interface, see also
/usr/share/doc/ghostscript/*/Use.htm
(online), along with many wrapper scripts like ps2pdf and pdf2ps.
- libgxps — GObject based library for handling and rendering XPS documents.
- libspectre — Small library for rendering Postscript documents.
- Mupdf — MuPDF is a lightweight PDF, XPS, and EPUB viewer, consisting of a software library, command line tools, and viewers.
- Poppler — PDF rendering library based on Xpdf. For CJK (Chinese, Japanese, Korean) support with Poppler, install poppler-data.
Viewers
Framebuffer
- fbgs — Poor man's PostScript/pdf viewer for the linux framebuffer console.
- fbpdf — Small framebuffer PDF and DjVu viewer based on MuPDF, with Vim keybindings and written in C
- jfbview — Framebuffer PDF and image viewer. Features include Vim-like controls, zoom-to-fit, a TOC (outline) view and fast multi-threaded rendering.
Graphical
- apvlv — Lightweight document viewer with Vim keybindings using GTK libraries. Supports PDF, DjVu, EPUB, HTML and TXT.
- Atril — Simple multi-page document viewer for MATE. Supports DjVu, DVI, EPS, EPUB, PDF, PostScript, TIFF, XPS and Comicbook.
- CorePDF — Simple lightweight PDF viewer based on Qt and poppler. Part of C-Suite.
- Deepin Document Viewer — A simple PDF and DjVu reader, supporting bookmarks, highlights and annotations.
- DjView — Viewer for DjVu documents.
- Emacs — See also pdf-tools for improved pdf support (emacs-pdf-tools-gitAUR) and the djvu package for djvu support.
- ePDFView — Lightweight PDF document viewer using the Poppler and GTK libraries. Development stopped.
- Foxit Reader — Small, fast (compared to Acrobat) proprietary PDF viewer. Releases (outside of security updates) are discontinued for Linux (November 2020).
- GNOME Document Viewer — Document viewer for GNOME using GTK. Supports DjVu, DVI, EPS, PDF, PostScript, TIFF, XPS and Comicbook. Part of gnome.
- gv — Graphical user interface for the Ghostscript interpreter that allows to view and navigate through PostScript and PDF documents.
- llpp — Very fast PDF reader based off of MuPDF, that supports continuous page scrolling, bookmarking, and text search through the whole document.
- MuPDF — Very fast EPUB, FictionBook, PDF, XPS and Comicbook viewer written in portable C. Features CJK font support and vim-like bindings.
- Okular — Universal document viewer for KDE. Supports CHM, Comicbook, DjVu, DVI, EPUB, FictionBook, Mobipocket, ODT, PDF, Plucker, PostScript, TIFF and XPS. Part of kde-graphics.
- Papers — Document viewer for GNOME using GTK. Supports DjVu, EPS, PDF, PostScript, TIFF, XPS and Comicbook.
- pdfpc — Presenter console with multi-monitor support for PDF files.
- qpdfview — Tabbed document viewer. It uses Poppler for PDF support, libspectre for PS support, DjVuLibre for DjVu support, CUPS for printing support and the Qt toolkit for its interface.
- Sioyek — Lightweight PDF viewer based on MuPDF with features designed for viewing research papers and technical books, e.g., marking, bookmarking, highlighting, searchable command palette, jumping to references, and more.
- https://sioyek.info/ || sioyekAUR
- Xpdf — Viewer that can decode LZW and read encrypted PDFs.
- Xreader — Document viewer part of the X-Apps Project. Supports DjVu, DVI, EPUB, PDF, PostScript, TIFF, XPS, Comicbook.
- Zathura — Highly customizable and functional document viewer (plugin based). Supports PDF, DjVu, PostScript and Comicbook.
Comparison
Name | PostScript | DjVu | XPS | PDF forms | PDF Annotation | Non-rectangle selection[dead link 2024-07-30 ⓘ] | License | |
---|---|---|---|---|---|---|---|---|
Adobe Reader | Custom | – | – | – | Yes | – | Yes | proprietary |
apvlv | Poppler | – | DjVuLibre | – | No | – | No (not by default, at least) | GPLv2 |
Atril | Poppler | libspectre | DjVuLibre | libgxps | Yes | – | – | GPLv2 |
DjView | – | – | DjVuLibre | – | – | – | – | GPLv2 |
Emacs | Ghostscript1 | DjVuLibre1 | – | No | Yes | Yes | GPLv3 | |
Emacs pdf-tools | Poppler | – | – | – | – | Yes | Yes | GPLv3 |
ePDFView | Poppler | – | – | – | No | – | – | GPLv2 |
Foxit Reader | Custom | – | – | – | Yes | Yes | Yes | proprietary |
GNOME Document Viewer | Poppler | libspectre | DjVuLibre | libgxps | Yes | Yes | Yes | GPLv2 |
gv | Ghostscript | – | – | No | – | – | GPLv3 | |
llpp | libmupdf | – | – | libmupdf | Yes | – | – | GPLv3 |
MuPDF | Custom | – | – | Custom | Yes (mupdf-gl) | Yes (mupdf-gl) | Yes (mupdf-gl) | AGPLv3 |
Okular | Poppler | libspectre | DjVuLibre | Custom | Yes | Yes | Yes | GPL, LGPL |
PDF4QT | Custom | – | – | – | No | Yes | Yes | LGPLv3 |
pdfpc | Poppler | – | – | – | No | – | – | GPLv2 |
qpdfview | Poppler | libspectre1 | DjVuLibre1 | – | Yes | Yes | – | GPLv2 |
Xpdf | Custom | – | – | – | No | – | – | GPLv3 |
Xreader | Poppler | libspectre1 | DjVuLibre1 | libgxps1 | Yes | Yes | Yes | GPLv2 |
Zathura | libmupdf1 / Poppler1 | libspectre1 | DjVuLibre1 | libmupdf1 | [dead link 2024-07-30 ⓘ] No | [dead link 2024-07-30 ⓘ] No | [dead link 2024-07-30 ⓘ] Yes | zlib |
- Optional dependency needs to be installed
PDF forms
The PDF forms column in the above table refers to AcroForms support. If you do not need your input to be directly extractable from the PDF, you can also use the applications in #Graphical PDF editing to put text on top of a PDF. PDF forms can be created with LibreOffice Writer (View > Toolbars > Form Controls) and the advanced PDF editors.
The proprietary and deprecated XFA format for forms is not fully supported by Poppler[1][2] and only supported by Adobe Reader and Master PDF Editor.
Alternatively, web browsers such as Firefox or Chromium feature a built-in PDF viewer capable of filling out forms.
Graphical PDF editing
Editors that can import PDF files
- Scribus can import and export PDF; text is imported as polygons.[3]
- LibreOffice Draw can import and export PDF; text is imported as text; embedded fonts are substituted.[4][5]
- Inkscape can import and export PDF; text is imported as cloned glyphs or text; with the latter embedded fonts are substituted.
- Graphics editors like GIMP and krita can also import and export PDFs at the cost of rasterization.
Basic editors
- flpsed — A PostScript and PDF annotator, only supports text boxes.
- HandyOutliner for DjVu / PDF — Make easier and faster the process of creating bookmarks for DjVu and PDF documents.
- jPDF Tweak — Java Swing application that can combine, split, rotate, reorder, watermark, encrypt, sign, and otherwise tweak PDF files.
- Paper Clip — PDF document metadata editor to edit the title, author, keywords and more details.
- PDF Arranger — Helps merge or split pdf documents and rotate, crop and rearrange pages. It is a maintained fork of PDF-Shuffler.
- PDF Chain — GTK front-end for PDFtk, written in C++, supporting concatenation, burst, watermarks, attaching files and more.
- PdfJumbler — Simple tool to rearrange, merge, delete and rotate pages in PDF files.
- PDF Mix Tool — Qt front-end for PoDoFo, written in C++, supports splitting, merging, rotating and mixing PDF files.
- PDFsam — Open source application, written in Java, supports merging, splitting and rotating.
- https://pdfsam.org/ || pdfsamAUR
- PDF Slicer — Simple application to extract, merge, rotate and reorder pages of PDF documents.
- PDF Tricks — Simple, efficient application for small manipulations in PDF files using Ghostscript.
Cropping tools
- briss — Java GUI to crop pages of PDF documents to one or more regions selected.
- krop — Simple graphical tool to crop the pages of PDF files.
- pdfCropMargins — Automatically crops the margins of PDF files.
- PdfHandoutCrop — Tool to crop pdf handout with multiple pages per sheet.
Advanced editors
- Master PDF Editor — Functional proprietary PDF editor. Latest version free for non-commercial use. The -free package is outdated but lacks a watermark.
- PDF Studio — All-in-one proprietary PDF editor similar to Adobe Acrobat.
- PDF4QT — Open source PDF editor.
Comparison of advanced editors
Name | Cost (USD, lifetime) | Page Labels | Form Designer | Content Editing (Text and Images) | Optimize PDFs | Digitally Sign PDFs | License |
---|---|---|---|---|---|---|---|
Master PDF Editor | 85.34 | No | Yes | Yes | Yes | Yes | proprietary |
Qoppa PDF Studio Standard | 99 | Yes | No | No | No | No | proprietary |
Qoppa PDF Studio Pro | 139 | Yes | Yes | Yes | Yes | Yes | proprietary |
PDF tools
See also Ghostscript.
- Camelot — Camelot: PDF Table Extraction for Humans.
- Coherent PDF — Proprietary non-free command line tools to manipulate PDF files including merge, encrypt, decrypt, scale, crop, rotate, bookmarks, stamp, logos, page numbers.
- DiffPDF — Compare the text or the visual appearance of each page in two PDF files.
- mupdf-tools — Tools developed as part of MuPDF, contains mutool(1) and muraster.
- pdfcpu — Command-line tool to create and modify PDFs.
- pdf_extbook — Extract bookmarked PDF pages.
- pdfgrep — Commandline utility to search text in PDF files.
- pdfjam — Can be used to n-up, join, rotate and flip PDFs and arrange them into a format suitable for book binding.
- PDFMiner — PDFMiner is a text extraction tool for PDF documents. Not actively maintained as of 2020.
- pdfminer.six — Community maintained fork of pdfminer.
- pdf2svg — Convert PDF files to SVG files.
- PDFtk — Simple tool for doing everyday things with PDF documents.
- QPDF — Content-preserving PDF transformation system.
- Stapler — Light alternative to PDFtk using the PyPDF2 library.
- Tabula — Tabula is a tool for liberating data tables trapped inside PDF files.
- https://tabula.technology || tabulaAUR, tabula-javaAUR
- verapdf — A purpose-built, open source, file-format validator covering all PDF/A and PDF/UA parts and conformance levels.
- https://verapdf.org || verapdfAUR
Command snippets
Create a PDF from images
With GraphicsMagick:
$ gm convert 1.jpg 2.jpg 3.jpg out.pdf
With ImageMagick:
$ magick convert 1.jpg 2.jpg 3.jpg out.pdf
Note that ImageMagick's output is lossy. For lossless PDF creation from jpeg, use img2pdf.
Concatenate PDFs
With Ghostscript:
$ gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=out.pdf -dBATCH 1.pdf 2.pdf 3.pdf
With PDFtk:
$ pdftk 1.pdf 2.pdf 3.pdf cat output out.pdf
With Poppler:
$ pdfunite 1.pdf 2.pdf 3.pdf out.pdf
With QPDF:
$ qpdf --empty --pages 1.pdf 2.pdf 3.pdf -- out.pdf
Extract text from PDF
With Poppler and maintaining the layout:
$ pdftotext -layout in.pdf out.txt
See also pdftotext(1).
With calibre:
$ ebook-convert in.pdf out.txt
Results vary between applications, depending on the PDF file.
Decrypt a PDF
This section lists commands to decrypt a PDF to an unencrypted file. Note that most PDF viewers also support encrypted PDFs.
With PDFtk:
$ pdftk in.pdf input_pw password output out.pdf
With Poppler to PostScript:
$ pdftops -upw password in.pdf out.ps
With QPDF:
$ qpdf --decrypt --password=password in.pdf out.pdf
Encrypt a PDF
The user password is used for encryption, the owner password to restrict operations once the document is decrypted, for more information, see Wikipedia:PDF#Encryption and signatures.
With PDFtk:
$ pdftk in.pdf output out.pdf user_pw password
With PoDoFo:
$ podofoencrypt -u user_password -o owner_password in.pdf out.pdf
With QPDF:
$ qpdf --encrypt user_password owner_password key_length -- in.pdf out.pdf
where key_length
can be 40, 128 or 256.
Extract images from a PDF
With poppler, saving images as JPEG:
$ pdfimages infile.pdf -j outfileroot
Extract page range from PDF, split multipage PDF document
With Ghostscript as a single file[6]
$ gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dFirstPage=first -dLastPage=last -sOutputFile=outfile.pdf infile.pdf
With PDFtk as a single file:
$ pdftk infile.pdf cat first-last output outfile.pdf
With Poppler as separate files:
$ pdfseparate -f first -l last infile.pdf outfileroot-%d.pdf
With QPDF as a single file:
$ qpdf --empty --pages infile.pdf first-last -- outfile.pdf
With mutool as a single file:
$ mutool clean -g infile.pdf outfile.pdf first-last
Impose a PDF (nup)
PDF Imposition is the process by which multiple input pages are combined into one output page, layed out into a rowsxcolumns grid.
It can be done with pdfjam (notice that wrapper scripts such as pdfnup and pdfbook are deprecated):
$ pdfjam --nup rowsxcolumns input.pdf --outfile output.pdf
or with pdfsak:
$ pdfsak --input-file input.pdf --output output.pdf --nup rows columns
Inspect metadata
With ExifTool:
$ exiftool -All file.pdf
With Poppler:
$ pdfinfo file.pdf
Remove metadata
Using ExifTool
With ExifTool:
$ exiftool -All= -overwrite_original input.pdf $ mv input.pdf /tmp/temp.pdf $ qpdf --linearize /tmp/temp.pdf input.pdf
The linearize step is needed to prevent recovery of deleted metadata. See this SuperUser question and the related ExifTool forum thread.
Using pdftk
Many PDFs store document metadata using both an Info dictionary (old school) and an XMP stream (new school). This pdftk command remove the XMP stream from the PDF altogether. It does not remove the Info dictionary.
Note that objects inside the PDF might have their own, separate XMP metadata streams, and that this command does not remove those. It only removes the PDF’s document‐level XMP stream.
$ pdftk input.pdf drop_xmp output output.pdf
Reduce size of a PDF
PDF size can be reduced by setting an appropriate optimization or compression level.
With Ghostscript one of:
$ ps2pdf -dPDFSETTINGS=/screen in.pdf out.pdf
or
$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -sOutputFile=out.pdf in.pdf
For different settings see the documentation.
There is also shrinkpdfAUR, a script wrapping gs.
Rasterize a PDF
These commands will convert your PDF into images.
With GraphicsMagick to convert a specific page into an image file:
$ gm convert -density dpi infile.pdf[page] outfile.jpg
With ImageMagick to convert a specific page into an image file:
$ magick convert -density dpi infile.pdf[page] outfile.jpg
With ImageMagick to convert all pages into another PDF file composed by an image file per page:
$ magick convert -density dpi infile.pdf outfile.pdf
With Poppler to convert all pages into one image file per page:
$ pdftoppm -jpeg -r dpi infile.pdf outfileroot
With Poppler to convert a specific page into an image file:
$ pdftoppm -jpeg -r dpi -f page -singlefile infile.pdf outfileroot
Split PDF pages
With mupdf-tools to split every page vertically into two pages:
$ mutool poster -y 2 in.pdf out.pdf
Can be used to undo simple imposition.
Add an image
Adding an image to any location in a PDF can be done
- with ImageMagick (convert), xvAUR and pdftk. (Wrapper script)
- with xournalAUR
- with LibreOffice
Details on these and other solutions can be found on StackExchange.
Add digital signature to PDF
jsignpdfAUR can digitally sign PDF files with X.509 certificates in GUI and CLI.
Readers such as Okular and MuPDF can sign PDFs with digital signatures. This requires a PFX certificate, which can be created with an OpenSSL command:
$ openssl req -x509 -days 365 -newkey rsa:2048 -keyout cert.pem -out cert.pem $ openssl pkcs12 -export -in cert.pem -out cert.pfx
MuPDF users can then sign PDFs with the cert.pfx
using the graphical interface, or its mutool-sign tool.
Okular users must import cert.pfx
into a certificate store such as the one in the default Firefox profile.[7][dead link 2024-01-13 ⓘ] With Firefox this is done through Settings > Privacy & Security > View Certificates > Your Certificates > Import and selecting cert.pfx. Afterwards Okular will offer this certificate to be used when signing PDFs.
Libreoffice can also sign PDFs.[8]
Removing annotations from a PDF
$ pdftk in.pdf output - uncompress | sed '/^\/Annots/d' | pdftk - output out.pdf compress
With perl-cam-pdfAUR:
$ rewritepdf.pl -C in.pdf out.pdf
See https://superuser.com/a/1051543 for more information.
Add page numbers
With pdfsak:
$ pdfsak --input-file input.pdf --output output.pdf --text "\large \$page/\$pages" br 0.99 0.99 --latex-engine xelatex --font "Noto Regular"
Add page labels
Page labels are logical page numbers shown in the navigation bar of your PDF reader. They are useful for example if the first pages of the PDF are indices numbered with roman numbers (I, II, etc.), while the page numbered "1" corresponds to a PDF page greater than 1, and you want the page number shown in the navigation bar to corresponds to the page number shown in the physical page.
This should not be confused with adding page numbers into a physical page. See section 12.4.2 of PDF reference to better understand page labels.
- Using pagelabels-py, let's say we have a PDF named
my_document.pdf
, that has 12 pages.- Pages 1 to 4 should be labelled
Intro I
toIntro IV
. - Pages 5 to 9 should be labelled
2
to6
. - Pages 10 to 12 should be labelled
Appendix A
toAppendix C
- We can issue the following list of commands:
$ python3 -m pagelabels --delete "my_document.pdf" $ python3 -m pagelabels --startpage 1 --prefix "Intro " --type "roman uppercase" "my_document.pdf" $ python3 -m pagelabels --startpage 5 --firstpagenum 2 "my_document.pdf" $ python3 -m pagelabels --startpage 10 --prefix "Appendix " --type "letters uppercase" "my_document.pdf"
- Note: pagelabels-py will convert your file to PDF 1.3 specification
- Pages 1 to 4 should be labelled
- Using pdftk, create a
metadata.txt
file with labels:PageLabelBegin PageLabelNewIndex: 1 PageLabelStart: 1 PageLabelPrefix: Cover PageLabelNumStyle: NoNumber PageLabelBegin PageLabelNewIndex: 2 PageLabelStart: 1 PageLabelPrefix: Back Cover PageLabelNumStyle: NoNumber PageLabelBegin PageLabelNewIndex: 3 PageLabelStart: 1 PageLabelNumStyle: LowercaseRomanNumerals PageLabelBegin PageLabelNewIndex: 27 PageLabelStart: 1 PageLabelNumStyle: DecimalArabicNumerals
- Where:
- PageLabelBegin
- signal a new page label definition will follow
- PageLabelNewIndex
- is the PDF page index from which the numbering style applies, counting from one. The numbering style will continue until the next page label or, if there are no more page labels, until the end of the document.
- PageLabelStart
- is the starting number. For example, if you specify 5 here, the pages will be numbered 5, 6, 7, ...
- PageLabelPrefix
- a text to put before the number in page labels.
- PageLabelNumStyle
- can be
DecimalArabicNumerals
,UppercaseRomanNumerals
,LowercaseRomanNumerals
,UppercaseLetters
,LowercaseLetters
orNoNumber
.
- Then use:
pdftk book.pdf update_info_utf8 metadata.txt output book-with-metadata.pdf
- Where:
See this SuperUser question for more details.
Extract bookmarks
With pdftk:
$ pdftk file.pdf dump_data_utf8 | grep '^Bookmark'
With qpdf:
$ qpdf --json --json-key=outlines file.pdf
See https://unix.stackexchange.com/questions/143886/how-to-extract-bookmarks-from-a-pdf-file for more information.
Add bookmarks
With pdftk
Create a text file bookmark_definitions.txt
with bookmark definitions in the following format:
BookmarkBegin BookmarkTitle: Chapter 1 BookmarkLevel: 1 BookmarkPageNumber: 1 BookmarkBegin BookmarkTitle: Chapter 1.1 BookmarkLevel: 2 BookmarkPageNumber: 2 BookmarkBegin BookmarkTitle: Chapter 1.2 BookmarkLevel: 2 BookmarkPageNumber: 3 BookmarkBegin BookmarkTitle: Chapter 1.3 BookmarkLevel: 2 BookmarkPageNumber: 4 BookmarkBegin BookmarkTitle: Chapter 1.3.1 BookmarkLevel: 3 BookmarkPageNumber: 5 BookmarkBegin BookmarkTitle: Chapter 2 BookmarkLevel: 1 BookmarkPageNumber: 6
Where
- BookmarkBegin
- signal a new bookmark definition
- BookmarkTitle
- the title of the bookmark
- BookmarkLevel
- the level of the bookmark in the hierarchy
- BookmarkPageNumber
- the page number the bookmark redirects to
In this example, the above file will create the following bookmark structure:
- Chapter 1
- Chapter 1.1
- Chapter 1.2
- Chapter 1.3
- Chapter 1.3.1
- Chapter 2
Apply the bookmarks with the following command:
$ pdftk input.pdf update_info_utf8 bookmark_definitions.txt output output.pdf
Extract pages contained within a bookmark
To extract the pages contained within a bookmark, you can use pdf_extbook-gitAUR.
With pdf_extbook file
you will be prompted on what bookmark whose pages you want to extract and where to save it. To extract all bookmarks of a given hierarchical level:
$ pdf_extbook file -a level output_file_stem
Remove blank pages
One can use the following script to remove blank pages form a PDF file (credit: SuperUser post):
#!/bin/sh IN="$1" filename=$(basename "${IN}") filename="${filename%.*}" PAGES=$(pdfinfo "$IN" | grep ^Pages: | tr -dc '0-9') non_blank() { for i in $(seq 1 $PAGES); do PERCENT=$(gs -o - -dFirstPage=${i} -dLastPage=${i} -sDEVICE=ink_cov "$IN" | grep CMYK | nawk 'BEGIN { sum=0; } {sum += $1 + $2 + $3 + $4;} END { printf "%.5f\n", sum } ') if [ $(echo "$PERCENT > 0.001" | bc) -eq 1 ]; then echo $i #echo $i 1>&2 fi echo -n . 1>&2 done | tee "$filename.tmp" echo 1>&2 } set +x pdftk "${IN}" cat $(non_blank) output "${filename}_noblanks.pdf"
Use it like pdf_remove_blank_pages input.pdf
.
The script needs pdftk, nawk and ghostscript.
Find fonts used in a PDF
The pdffonts(1) command (from poppler), can be used to find which fonts a PDF uses and if they have been embedded in it or not:
$ pdffonts file.pdf
name type encoding emb sub uni object ID ------------------------------------ ----------------- ---------------- --- --- --- --------- Times-Roman Type 1 Custom no no no 8 0 Times-Italic Type 1 Standard no no no 9 0 Times-Bold Type 1 Standard no no no 7 0 Helvetica Type 1 Standard no no no 34 0 Helvetica-Bold Type 1 Standard no no no 35 0
This can be used when having issues displaying properly the text in a PDF, to determine if missing fonts or their metric-compatible equivalent need to be installed.
Repair broken PDF file
With ghostscript:
$ gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress corrupted.pdf
With poppler:
$ pdftocairo -pdf corrupted.pdf repaired.pdf
With mupdf-tools:
$ mutool clean corrupted.pdf repaired.pdf
Reference: https://superuser.com/q/278562
Convert PDF to PDF/A standard
With ghostscript:
$ gs -dPDFA -dBATCH -dNOPAUSE -sColorConversionStrategy=UseDeviceIndependentColor -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=2 -sOutputFile=document_pdfa.pdf document.pdf
Reference: https://stackoverflow.com/a/56459053
Validate PDF/A compliance
Using verapdfAUR you can validate the compliance of your PDF to different flavours of the PDF/A standard:
$ verapdf --flavour 1a --format text document.pdf
DjVu tools
- DjVuLibre provides many command-line tools, like ddjvu(1) for example.
- img2djvu — Single-pass DjVu encoder based on DjVu Libre and ImageMagick.
- pdf2djvu — Creates DjVu files from PDF files.
Convert DjVu to images
Break Djvu into separate pages:
$ djvmcvt -i input.djvu /path/to/out/dir output-index.djvu
Convert Djvu pages into images:
$ ddjvu --format=tiff page.djvu page.tiff
Convert Djvu pages into PDF:
$ ddjvu --format=pdf inputfile.djvu ouputfile.pdf
You can also use --page to export specific pages:
$ ddjvu --format=tiff --page=1-10 input.djvu output.tiff
this will convert pages from 1 to 10 into one tiff file.
Processing images
You can use scantailor-advanced to:
- fix orientation
- split pages
- deskew
- crop
- adjust margins
Make DjVu from images
There is a useful script img2djvu-gitAUR.
$ img2djvu -c1 -d600 -v1 ./out
it will create 600 DPI out.djvu
from all files in ./out
directory.
Alternatively, you can try didjvuAUR, which seems to create smaller files especially on images with well defined background.
PostScript tools
- pstotext — Converts PostScript files to text.
ps2pdf
ps2pdf is a wrapper around ghostscript to convert PostScript to PDF:
$ ps2pdf -sPAPERSIZE=a4 -dOptimize=true -dEmbedAllFonts=true YourPSFile.ps
Explanation:
- with
-sPAPERSIZE=something
you define the paper size. For valid PAPERSIZE values, see [10][dead link 2022-09-22 ⓘ]. -
-dOptimize=true
lets the created PDF be optimised for loading. -
-dEmbedAllFonts=true
makes the fonts look always nice.
-sPAPERSIZE
you specified, because EPS files usually do not contain paper orientation information. A workaround is creating a new paper in ghostscript settings (call it e.g. "slide") and use it as -sPAPERSIZE=slide
.Libraries
C/C++
- libharu — C library for generating PDF documents.
- https://github.com/libharu/libharu || libharu, Lua binding: lua-hpdfAUR
- PoDoFo — A C++ library to work with the PDF file format.
Python
- borb — borb is a library for reading, creating and manipulating PDF files in python.
- https://borbpdf.com/, https://github.com/jorisschellekens/borb || not packaged? search in AUR
- pdfrw — A pure Python library that reads and writes PDFs.
- PyPDF — A pure-Python library built as a PDF toolkit.
- PyX — Python library for the creation of PostScript and PDF files.
- ReportLab — A proven industry-strength PDF generating solution
Java
- iText Core — iText is a more versatile, programmable and enterprise-grade PDF solution that allows you to embed its functionalities within your own software for digital transformation.
- OpenPDF — OpenPDF is a free Java library for creating and editing PDF files with a LGPL and MPL open source license. OpenPDF is based on a fork of iText.
- https://github.com/LibrePDF/OpenPDF || not packaged? search in AUR