Sunday, March 23, 2014

Merging multiple big pdf files in java


  • Merging pdfs
  • Merging pdf's using ghost Script
  • Converting ghost script to pdf or Converting pdf to ghost script
  • Merging large collection of pdf's into a single pdf at a time in java
  • Merging multiple pdf files into a single document.


We had a requirement to merge multiple pdf documents in to a single large pdf.We had to merge 1000+ pdfs in to one document. We tried using itext and also PDF box. Both work great as long the document count is 200 or 300 after that the code would error out with "java.lang.OutOfMemoryError".Needless to say we tried increasing memory  for the program using ( set JAVA_OPTS="-Xms512m -Xmx1024m"). That still didn't help as at some point the program would choke.
The other approach was to convert PDF to ghost script and translate the ghost Script  to PDF which reduced the size significantly. Here is the basic utility.Using this each single pdf was converted from pdf==> ghost script ==> coverted pdf. The converted pdf was very less in size.


To convert PDF to ghost script you need 
1. Install ghost script and export the path to bin directory in your PATH variable (in windows or Linux) .
    http://www.ghostscript.com/download/gsdnld.html
- Install gs9.10  (Or the latest)
- Download ghost4j jars (http://www.ghost4j.org/downloads.html) 

Now all you need is small utility class that can convert PDF to ghost script and vice versa. 
You will be amazed see the size reduction of the PDF.



package com.ram.utils;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;

import org.apache.commons.io.FileUtils;
import org.ghost4j.converter.ConverterException;
import org.ghost4j.converter.PDFConverter;
import org.ghost4j.converter.PSConverter;
import org.ghost4j.document.DocumentException;
import org.ghost4j.document.PDFDocument;
import org.ghost4j.document.PSDocument;

public class GhostScriptUtils {

 public static void main(String[] args) throws Exception {

  // CASE1: PDF to Post Script
  byte pdfAsBytes[] = FileUtils.readFileToByteArray(new File("C:\\Test.pdf"));
  ByteArrayOutputStream postScriptAsBytes = converPDFToPostScript(pdfAsBytes);
  FileUtils.writeByteArrayToFile(new File("C:\\input.ps"), postScriptAsBytes.toByteArray());
  postScriptAsBytes.close();
  System.out.println("Coverted PDF to PS file.");
  
  // CASE2: Post Script to PDF
  byte postScriptContentAsBytes[] = FileUtils.readFileToByteArray(new File("C:\\input.ps"));
  ByteArrayOutputStream convertedPDFBytes = converPostScriptTOPdf(postScriptContentAsBytes);
  FileUtils.writeByteArrayToFile(new File("C:\\Test_Compressed.pdf"), convertedPDFBytes.toByteArray());
  System.out.println(" Generated PDF from PS which is lessser in Size now.");

 }

 private static ByteArrayOutputStream converPDFToPostScript(byte[] pdfAsBytes)
   throws IOException, ConverterException, DocumentException {
  PDFDocument pdfDocument = new PDFDocument();
  pdfDocument.load(new ByteArrayInputStream(pdfAsBytes));
  ByteArrayOutputStream os = new ByteArrayOutputStream();
  PSConverter psConverter = new PSConverter();
  psConverter.convert(pdfDocument, os);
  return os;
 }

 private static ByteArrayOutputStream converPostScriptTOPdf(
   byte[] postScriptAsBytes) throws IOException, ConverterException,
   DocumentException {

  PSDocument postScriptDocument = new PSDocument();
  postScriptDocument.load(new ByteArrayInputStream(postScriptAsBytes));
  PDFConverter converter = new PDFConverter();
  // converter.setPDFSettings(PDFConverter.OPTION_PDFSETTINGS_PREPRESS);
  ByteArrayOutputStream os = new ByteArrayOutputStream();
  converter.convert(postScriptDocument, os);
  return os;
 }

}