Uncategorized | PDF Liberation

Congratulations to Our Winning Teams

The PDF Liberation Hackathon took place in six cities between January 17 and 19. In Washington DC, New York and San Francisco we had judging. Following were our prize winners:

Washington, DC
1st. What Word Where: https://github.com/pdfliberation/whatwordwhere
2nd. US AID Development Experience Clearinghouse: https://github.com/pdfliberation/USAID-DEC

New York
1st. Jersey City Budget PDF Liberation: https://github.com/pdfliberation/Jersey-City-Budget-PDF-Liberation
2nd. New York Economic Development Commission Newsletter: https://github.com/pdfliberation/NYCEDCprosedatascraper

San Francisco
1st. Amnesty International Data: https://github.com/pdfliberation/amnestydata
2nd. House Financial Disclosures: https://github.com/pdfliberation/housedisc

Keep on Liberating

The Sunlight Foundation and other sponsors continue the effort to liberate PDFs. The focal point for this activity will be the PDFLiberation organization on GitHub. For future updates, please check:

https://github.com/pdfliberation

Hackathon Challenges

Participants can work on a PDF extraction challenge provided by one of our sponsoring organizations, can work on their own challenges or develop enhancements to an open source PDF extraction tool.

Here are the challenges we have thus far. One or two more may be added by Friday:

Name of Challenge	Challenge Description	Folder with Samples
Comprehensive Annual Financial Reports	<< Markdown File >>	<< Dropbox Folder >>
House of Representatives Financial Disclosures (OpenSecrets.org)	<< Markdown File >>	<< Dropbox Folder >>
IRS Form 990 – Not-for-Profit Organization Reports	<< Markdown File >>	<< Dropbox Folder >>
Amnesty International Annual Reports – Torture Incident Database	<< Markdown File >>	<< Dropbox Folder >>
US Foreign Aid Reports (USAID)	<< Markdown File >>	Web Site
Federal Communications Commission Daily Releases	<< Markdown File >>	<< PDF >>
New York City Economic Development Commission Monthly Snapshot	<< PDF >> (for all 3 NYC Challenges)	Web Site
New York City Council and Community Board Documents		Web Site
New York City Environmental Impact Statements		Web Site

Also, we will be attempting to build a matrix of PDF problems and solutions during the weekend. For more, see https://github.com/pdfliberation/knowledge/blob/master/studies/dam_pdfs.md.

List of PDF Extraction Resources

Added 1/15/2014: Some commercial PDF solution vendors have agreed to offer special evaluation versions of their software to hackathon participants. While evaluation licenses are common, they often come with restrictions on the number of pages that can be processed – making them useless for the hackathon. The following vendors are providing versions of their software with high page limits or no page limits at all:

Aspose PDF for Java – Download the software from http://www.aspose.com/community/files/72/java-components/aspose.pdf-for-java/entry517431.aspx, unzip it into a folder of your choice and then place this license file in the installation folder. Instructions for placing the license file can be found at http://www.aspose.com/docs/display/pdfjava/Licensing.

PDFLib Text Extraction Tool – See their hackathon web page for an unlimited page trial of their new version.

ABBYY Cloud OCR Software Development Kit – Hackathon participants can perform Optical Character Recognition on up to 5000 pages during the hackathon weekend for free with Abbyy’s cloud based (no installation) solution. Click this link for a product description with registration instructions. Abbyy’s cloud OCR web site: https://cloud.ocrsdk.com. Promo Code: ABBYY_HQ_TRIAL_Hackathon.

Here is a list of PDF tools that you can use at the hackathon or afterwards. I would like to keep this list complete and updated, so please use the comments to tell me about tools and technologies not listed thus far. Because this list is becoming very long, I am now denoting (with a ♦) tools that I believe participants should consider first. Selections are based on my own experiences applying the tools to a sample PDF, whether the project is still active and (for commercial tools) whether a liberal evaluation license is available.

Open source PDF technologies:

♦ Apache PDFBox – General purpose PDF library written in Java. Download page: http://pdfbox.apache.org/downloads.html. Support available through their mailing list at dev@pdfbox.apache.org (to submit questions, you must subscribe to the list at http://pdfbox.apache.org/mailinglists.html).

♦ Tabula – Open source PDF table extraction tool written in Java and Ruby by Manuel Aristarán. Makes calls to PDFBox. Download page: http://tabula.nerdpower.org. Repository: https://github.com/jazzido/tabula. Tabula-extractor repository: http://github.com/jazzido/tabula-extractor.

PDF Extraction Toolkit – Java framework built on PDFBox by Tamir Hassan for performing document analysis of PDF files and creating custom conversion methods to HTML and other formats. Download Page: http://tamirhassan.com/pdfxtk.html. Repository: https://github.com/tamirhassan/pdfxtk/.

PDFExtract – Text extraction library that extends both PDFBox and Poppler. Written in Java by Øyvind Berg, the tool is no longer under active development but may contain code that can be reused by hackathon participants. Download Page: http://elacin.github.io/PDFExtract/. Repository: https://github.com/elacin/PDFExtract.

PDF2SVG – Java tool developed by Peter Murray-Rust that converts PDFs to Scalable Vector Graphics (SVG) files that can be rendered by most modern browsers. PDF2SVG, which is based on PDFBox, is a component of the larger AMI suite of open source tools created for the purpose of liberating scientific documents. Another component, SVG2XML converts the SVG files to HTML and is currently under heavy development. Download Page: https://bitbucket.org/petermr/pdf2svg-dev/overview. Repository: https://bitbucket.org/petermr/pdf2svg-dev/src.

♦ Poppler (pdftotext, pdfinfo, pdfimages) – Command line tools to extract text, metadata, and bitmap images from PDF files, written in C++, forked from Xpdf. Download page: http://poppler.freedesktop.org/

Ashima PDF Table Extractor – Table extraction tool built in Python and based on Poppler. Repository: https://github.com/ashima/pdf-table-extract

Coolwanglu – PDF to HMTL converter based on Poppler. Product Page: http://coolwanglu.github.io/pdf2htmlEX/. Repository: https://github.com/coolwanglu/pdf2htmlEX

PDF2XML – Open source converter based on XPDF library developed by Hervé Déjean. Download Page: http://sourceforge.net/projects/pdf2xml/

Xpdf (pdftotext, pdfinfo, pdfimages) – Command line tools to extract text, metadata, and bitmap images from PDF files. Also includes a page rasterizer (pdftoppm). Download Page: http://www.foolabs.com/xpdf/

MuPDF – General purpose, open source PDF toolkit written in C by Artifex, the developers of GhostScript. The mudraw component has a basic text extraction utility. Download Page: http://www.mupdf.com/. Repository: http://git.ghostscript.com/?p=mupdf.git;a=summary. Organizers wish to thank Artifex for sponsoring our hackathon.

PDFMiner – Open source PDF extraction library written in Python. Download Page: http://www.unixuser.org/~euske/python/pdfminer/. Repository: https://github.com/euske/pdfminer/.

PDFParser – Open source Python script that displays objects within a PDF. Download Page: http://blog.didierstevens.com/programs/pdf-tools/.

PDFTables – Table extraction tool based on PDFMiner and also written in Python. Repository: https://github.com/okfn/pdftables.

♦ DocHive – Open source tool based on Tesseract and ImageMagick that extracts data from scanned PDFs. Repository: https://github.com/raleighpublicrecord/dochive.

Node PDF Extract – Javascript library that reads PDFs with embedded text as well as scanned PDFs. Built on both Poppler and Tesseract. Repository: https://github.com/nisaacson/pdf-extract.

Low-cost commercial PDF technologies:

Adobe Acrobat XI Pro – The original general purpose GUI-based PDF tool that can convert to PDFs to Excel, Word, Powerpoint and HTML. Product Page: http://www.adobe.com/products/acrobat/pdf-to-excel-xlsx-converter.html

Able2Extract – A line of tools from InvestInTech that extracts PDF content to Excel, Word, XML and other formats. GUI and Command Line tools available. Products page: http://www.investintech.com/prod_options.htm

♦ Aspose.Pdf for Java – General purpose PDF library for Java developers that has text extraction functionality. Page explaining how to use Aspose for extraction: http://www.aspose.com/docs/display/pdfjava/Extract+Text+From+All+the+Pages+of+a+PDF+Document. Download Page: http://www.aspose.com/community/files/72/java-components/aspose.pdf-for-java/entry517431.aspx

BCL Technologies – Free, online PDF to Word and PDF to HTML converters. Word conversion page: http://www.pdfonline.com/pdf-to-word-converter/

Cogniview – Extracts PDFs to Excel. Product Page: http://www.cogniview.com/

Docudesk deskUnPDF Converter – Converts PDFs to Excel, Word, XML and other formats. Trial download: http://www.docudesk.com/pdf-downloads

Microsoft Word 2013 – The most recent version of this MS Office component supports direct opening of PDFs. The contents can then be saved in DOCX or other Word-supported formats. Feature Page: http://office.microsoft.com/en-us/word-help/edit-pdf-content-in-word-HA102903948.aspx?CTT=5&origin=HA102809597

NitroPDF – General purpose GUI-based PDF tool that can extract to spreadsheets and documents. Home Page: http://www.nitropdf.com/

Nuance PDF Reader – Free PDF reader with a web service that converts PDFs to spreadsheets and documents. Home Page: http://www.nuance.com/products/pdf-reader/index.htm

Nuance PDF Converter – Product Page: http://www.nuance.com/for-business/document-imaging-and-scanning/pdf-converter/index.htm

♦ PDFLib Text Extraction Tool – Function library that makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. Product Page: http://www.pdflib.com/products/tet/. Special hackathon evaluation version: http://www.pdflib.com/products/tet/hackathon.

PDFTron – General purpose PDF manipulation library that includes text extraction capabilities. Sample Code Page: http://www.pdftron.com/pdfnet/samplecode.html

ScraperWiki Table XTract – Web based solution that returns tables extracted from uploaded PDFs. Product Page: https://scraperwiki.com/tools/tablextract

Simx Text Converter – Extract, Transform and Load (ETL) solution that enables users to create custom routines for converting PDFs and other unstructured formats to database records. Product Page: http://www.simx.com/simx/Products.stp?prm=tc

Snow Tide PDF TextStream. Commercial PDF text extraction component that can be embedded in Java or .Net applications. Single threaded version is free. Download Page: http://www.snowtide.com/downloads

Xpdf Commercial Libraries from Glyph and Cog – Including XpdfText, a PDF text extraction library [http://glyphandcog.com/XpdfText.html] ; XpdfInfo – a PDF metadata extraction library [http://glyphandcog.com/XpdfInfo.html] ; XpdfImageExtract,a PDF image extraction library [contact info@glyphandcog.com for details] ; and XpdfRasterizer, a library which converts PDF pages to images [http://glyphandcog.com/XpdfRasterizer.html].

Big Faceless Java Library – http://bfo.com/blog/2011/11/16/pdf_text_extraction_in_java.html

IText, a Java PDF Library – http://sourceforge.net/projects/itext/reviews?source=navbar

OCR Technologies (required for scanned PDFs):

Tesseract OCR – Open source OCR library. Home Page: https://code.google.com/p/tesseract-ocr/. This tool does not work directly with PDFs, but a shell script or package can be used to convert a PDF to a TIFF which can be analyzed with Tesseract. Also, a Java interface to Tesseract is available at http://sourceforge.net/projects/tess4j/.

ABBYY FineReader – Commercial OCR tool which works directly with PDFs. Home Page: http://finereader.abbyy.com/. ABBYY also offers a cloud OCR Software Development Kit (API). See: http://ocrsdk.com/

Nuance OmniPage – Commercial OCR tool which works directly with PDFs. Product Page: http://www.nuance.com/for-individuals/by-product/omnipage/index.htm

Captricity – Web based service that uses a mixture of technology and human labor to convert uploaded documents into structured data. Product description: http://captricity.com/captricity-at-a-glance/

Enterprise-Level (Cost > $1000) Extract Transfer Load (ETL) Solutions that Directly Read PDFs*

Datawatch Modeler (Formerly Known as Monarch) – http://www.datawatch.com/form-page

HP KeyView – http://www.autonomy.com/products/keyview/

IDR Solutions – Online PDF to SVG and PDF to HTML5 conversions. This vendor also maintained the open source JPedal library until last year. Commercial products: http://www.idrsolutions.com. JPedal: http://sourceforge.net/projects/jpedal/

Informatica B2B Data Transformation – http://www.informatica.com/us/products/b2b-data-exchange/b2b-data-transformation/

Pradea – http://www.praedea.com/

SAP HANA – http://www.saphana.com/community/about-hana/features#/processing-engine/text-and-search

* Thanks to Chris Karnakus at IDG Computerworld and Curt Monash of Monash Research for contributing to this section.

Reviews, Listings and Comparisons:

Discussion of Tabula, Google Refine and Google Fusion Tables at SmartChicago Collaborative. http://www.smartchicagocollaborative.org/pdf-liberation-hackathon-resource-page/

Duke University’s Reporters Lab contains reviews of many of the tools listed above – http://reviews.reporterslab.org/search?q=&type=products&category=pdf-tools-2011-11-09

List of tools for extracting data from scientific papers in PDF format. http://pdfjailbreak.com/tools

Blog Post discussing software resources used at a May 2013 PDF Hackathon in Europe from Peter Murray-Rust: http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/

Comparison of iText, PDFBox and PDFTextExtractor by Madhura Oak – http://www.e-zest.net/blog/extracting-text-from-a-pdf-file/