Merge PDF files

0 votes
asked Aug 9, 2010 by btibert3

I did a search and nothing really seemed to be directly related to this question. Is it possible, using Python, to merge seperate PDF files?

Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure.

And I may be pushing my luck, but is it possible to exclude a page that is contained in of the PDFs (my report generation always creates an extra blank page).

6 Answers

0 votes
answered Aug 9, 2010 by second

do you have to use python? if you just need to merge your pdfs, i'd have a look at pdftk

0 votes
answered Aug 9, 2010 by gilles

Use Pypdf:

A Pure-Python library built as a PDF toolkit. It is capable of:
* splitting documents page by page,
* merging documents page by page,

(and much more)

An example of two pdf-files being merged into a single file with pyPdf:

# Loading the pyPdf Library
from pyPdf import PdfFileWriter, PdfFileReader

# Creating a routine that appends files to the output file
def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

# Creating an object where pdf pages are appended to
output = PdfFileWriter()

# Appending two pdf-pages from two different files
append_pdf(PdfFileReader(open("SamplePage1.pdf","rb")),output)
append_pdf(PdfFileReader(open("SamplePage2.pdf","rb")),output)

# Writing all the collected pages to a file
output.write(open("CombinedPages.pdf","wb"))
0 votes
answered Aug 18, 2014 by mark-k

here, http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/, gives an solution.

similarly:

from pyPdf import PdfFileWriter, PdfFileReader

def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

output = PdfFileWriter()

append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)

    output.write(file("c:\\combined.pdf","wb"))
0 votes
answered Aug 31, 2014 by martin-thoma

Is it possible, using Python, to merge seperate PDF files?

Yes.

The following example merges all files in one folder to a single new PDF file:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os

def merge(path, output_filename):
    output = PdfFileWriter()

    for pdffile in glob(path + os.sep + '*.pdf'):
        if pdffile == output_filename:
            continue
        print("Parse '%s'" % pdffile)
        document = PdfFileReader(open(pdffile, 'rb'))
        for i in range(document.getNumPages()):
            output.addPage(document.getPage(i))

    print("Start writing '%s'" % output_filename)
    with open(output_filename, "wb") as f:
        output.write(f)

if __name__ == "__main__":
    parser = ArgumentParser()

    # Add more options if you like
    parser.add_argument("-o", "--output",
                        dest="output_filename",
                        default="merged.pdf",
                        help="write merged PDF to FILE",
                        metavar="FILE")
    parser.add_argument("-p", "--path",
                        dest="path",
                        default=".",
                        help="path of source PDF files")

    args = parser.parse_args()
    merge(args.path, args.output_filename)
0 votes
answered Aug 21, 2016 by paul-rooney

The newer PyPdf2 library has a PdfMerger class, which can be used like so.

example:

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('result.pdf', 'wb') as fout:
    merger.write(fout)

The append method seems to require a lazy file object. That is it doesn't read the file immediately. It seems to wait until the write method is invoked. If you use a scoped open (i.e. with) it appends blank pages to the resultant file, as the input file is closed at that point.

The easiest way to avoid this if file handle lifetime is an issue, is to pass append file name strings and allow it to handle file lifetime.

i.e.

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("result.pdf")
0 votes
answered Sep 15, 2017 by patrick-maupin

The pdfrw library can do this quite easily, assuming you don't need to preserve bookmarks and annotations, and your PDFs aren't encrypted. Here is an example concatenation script, and here is an example page subsetting script.

The relevant part of the concatenation script -- assumes inputs is a list of input filenames, and outfn is an output file name:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

As you can see from this, it would be pretty easy to leave out the last page, e.g. something like:

    writer.addpages(PdfReader(inpfn).pages[:-1])

Disclaimer: I am the primary pdfrw author.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...