Congratulations to Our Winning Teams
The PDF Liberation Hackathon took place in six cities between January 17 and 19. In Washington DC, New York and San Francisco we had judging. Following were our prize winners:
1st. Jersey City Budget PDF Liberation: https://github.com/pdfliberation/Jersey-City-Budget-PDF-Liberation
2nd. New York Economic Development Commission Newsletter: https://github.com/pdfliberation/NYCEDCprosedatascraper
Keep on Liberating
The Sunlight Foundation and other sponsors continue the effort to liberate PDFs. The focal point for this activity will be the PDFLiberation organization on GitHub. For future updates, please check:
Participants can work on a PDF extraction challenge provided by one of our sponsoring organizations, can work on their own challenges or develop enhancements to an open source PDF extraction tool.
Here are the challenges we have thus far. One or two more may be added by Friday:
|Name of Challenge||Challenge Description||Folder with Samples|
|Comprehensive Annual Financial Reports||<< Markdown File >>||<< Dropbox Folder >>|
|House of Representatives Financial Disclosures (OpenSecrets.org)||<< Markdown File >>||<< Dropbox Folder >>|
|IRS Form 990 – Not-for-Profit Organization Reports||<< Markdown File >>||<< Dropbox Folder >>|
|Amnesty International Annual Reports – Torture Incident Database||<< Markdown File >>||<< Dropbox Folder >>|
|US Foreign Aid Reports (USAID)||<< Markdown File >>||Web Site|
|Federal Communications Commission Daily Releases||<< Markdown File >>||<< PDF >>|
|New York City Economic Development Commission Monthly Snapshot||<< PDF >>
(for all 3 NYC Challenges)
|New York City Council and Community Board Documents||Web Site|
|New York City Environmental Impact Statements||Web Site|
Also, we will be attempting to build a matrix of PDF problems and solutions during the weekend. For more, see https://github.com/pdfliberation/knowledge/blob/master/studies/dam_pdfs.md.
List of PDF Extraction Resources
Added 1/15/2014: Some commercial PDF solution vendors have agreed to offer special evaluation versions of their software to hackathon participants. While evaluation licenses are common, they often come with restrictions on the number of pages that can be processed – making them useless for the hackathon. The following vendors are providing versions of their software with high page limits or no page limits at all:
Aspose PDF for Java – Download the software from http://www.aspose.com/community/files/72/java-components/aspose.pdf-for-java/entry517431.aspx, unzip it into a folder of your choice and then place this license file in the installation folder. Instructions for placing the license file can be found at http://www.aspose.com/docs/display/pdfjava/Licensing.
PDFLib Text Extraction Tool – See their hackathon web page for an unlimited page trial of their new version.
ABBYY Cloud OCR Software Development Kit – Hackathon participants can perform Optical Character Recognition on up to 5000 pages during the hackathon weekend for free with Abbyy’s cloud based (no installation) solution. Click this link for a product description with registration instructions. Abbyy’s cloud OCR web site: https://cloud.ocrsdk.com. Promo Code: ABBYY_HQ_TRIAL_Hackathon.
Here is a list of PDF tools that you can use at the hackathon or afterwards. I would like to keep this list complete and updated, so please use the comments to tell me about tools and technologies not listed thus far. Because this list is becoming very long, I am now denoting (with a ♦) tools that I believe participants should consider first. Selections are based on my own experiences applying the tools to a sample PDF, whether the project is still active and (for commercial tools) whether a liberal evaluation license is available.
Open source PDF technologies:
♦ Apache PDFBox – General purpose PDF library written in Java. Download page: http://pdfbox.apache.org/downloads.html. Support available through their mailing list at email@example.com (to submit questions, you must subscribe to the list at http://pdfbox.apache.org/mailinglists.html).
♦ Tabula – Open source PDF table extraction tool written in Java and Ruby by Manuel Aristarán. Makes calls to PDFBox. Download page: http://tabula.nerdpower.org. Repository: https://github.com/jazzido/tabula. Tabula-extractor repository: http://github.com/jazzido/tabula-extractor.
PDF Extraction Toolkit – Java framework built on PDFBox by Tamir Hassan for performing document analysis of PDF files and creating custom conversion methods to HTML and other formats. Download Page: http://tamirhassan.com/pdfxtk.html. Repository: https://github.com/tamirhassan/pdfxtk/.
PDFExtract – Text extraction library that extends both PDFBox and Poppler. Written in Java by Øyvind Berg, the tool is no longer under active development but may contain code that can be reused by hackathon participants. Download Page: http://elacin.github.io/PDFExtract/. Repository: https://github.com/elacin/PDFExtract.
PDF2SVG – Java tool developed by Peter Murray-Rust that converts PDFs to Scalable Vector Graphics (SVG) files that can be rendered by most modern browsers. PDF2SVG, which is based on PDFBox, is a component of the larger AMI suite of open source tools created for the purpose of liberating scientific documents. Another component, SVG2XML converts the SVG files to HTML and is currently under heavy development. Download Page: https://bitbucket.org/petermr/pdf2svg-dev/overview. Repository: https://bitbucket.org/petermr/pdf2svg-dev/src.
♦ Poppler (pdftotext, pdfinfo, pdfimages) – Command line tools to extract text, metadata, and bitmap images from PDF files, written in C++, forked from Xpdf. Download page: http://poppler.freedesktop.org/
Ashima PDF Table Extractor – Table extraction tool built in Python and based on Poppler. Repository: https://github.com/ashima/pdf-table-extract
PDF2XML – Open source converter based on XPDF library developed by Hervé Déjean. Download Page: http://sourceforge.net/projects/pdf2xml/
Xpdf (pdftotext, pdfinfo, pdfimages) – Command line tools to extract text, metadata, and bitmap images from PDF files. Also includes a page rasterizer (pdftoppm). Download Page: http://www.foolabs.com/xpdf/
MuPDF – General purpose, open source PDF toolkit written in C by Artifex, the developers of GhostScript. The mudraw component has a basic text extraction utility. Download Page: http://www.mupdf.com/. Repository: http://git.ghostscript.com/?p=mupdf.git;a=summary. Organizers wish to thank Artifex for sponsoring our hackathon.
PDFParser – Open source Python script that displays objects within a PDF. Download Page: http://blog.didierstevens.com/programs/pdf-tools/.
PDFTables – Table extraction tool based on PDFMiner and also written in Python. Repository: https://github.com/okfn/pdftables.
♦ DocHive – Open source tool based on Tesseract and ImageMagick that extracts data from scanned PDFs. Repository: https://github.com/raleighpublicrecord/dochive.
Low-cost commercial PDF technologies:
Adobe Acrobat XI Pro – The original general purpose GUI-based PDF tool that can convert to PDFs to Excel, Word, Powerpoint and HTML. Product Page: http://www.adobe.com/products/acrobat/pdf-to-excel-xlsx-converter.html
Able2Extract – A line of tools from InvestInTech that extracts PDF content to Excel, Word, XML and other formats. GUI and Command Line tools available. Products page: http://www.investintech.com/prod_options.htm
♦ Aspose.Pdf for Java – General purpose PDF library for Java developers that has text extraction functionality. Page explaining how to use Aspose for extraction: http://www.aspose.com/docs/display/pdfjava/Extract+Text+From+All+the+Pages+of+a+PDF+Document. Download Page: http://www.aspose.com/community/files/72/java-components/aspose.pdf-for-java/entry517431.aspx
BCL Technologies – Free, online PDF to Word and PDF to HTML converters. Word conversion page: http://www.pdfonline.com/pdf-to-word-converter/
Cogniview – Extracts PDFs to Excel. Product Page: http://www.cogniview.com/
Docudesk deskUnPDF Converter – Converts PDFs to Excel, Word, XML and other formats. Trial download: http://www.docudesk.com/pdf-downloads
Microsoft Word 2013 – The most recent version of this MS Office component supports direct opening of PDFs. The contents can then be saved in DOCX or other Word-supported formats. Feature Page: http://office.microsoft.com/en-us/word-help/edit-pdf-content-in-word-HA102903948.aspx?CTT=5&origin=HA102809597
NitroPDF – General purpose GUI-based PDF tool that can extract to spreadsheets and documents. Home Page: http://www.nitropdf.com/
Nuance PDF Reader – Free PDF reader with a web service that converts PDFs to spreadsheets and documents. Home Page: http://www.nuance.com/products/pdf-reader/index.htm
Nuance PDF Converter – Product Page: http://www.nuance.com/for-business/document-imaging-and-scanning/pdf-converter/index.htm
♦ PDFLib Text Extraction Tool – Function library that makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. Product Page: http://www.pdflib.com/products/tet/. Special hackathon evaluation version: http://www.pdflib.com/products/tet/hackathon.
PDFTron – General purpose PDF manipulation library that includes text extraction capabilities. Sample Code Page: http://www.pdftron.com/pdfnet/samplecode.html
ScraperWiki Table XTract – Web based solution that returns tables extracted from uploaded PDFs. Product Page: https://scraperwiki.com/tools/tablextract
Simx Text Converter – Extract, Transform and Load (ETL) solution that enables users to create custom routines for converting PDFs and other unstructured formats to database records. Product Page: http://www.simx.com/simx/Products.stp?prm=tc
Snow Tide PDF TextStream. Commercial PDF text extraction component that can be embedded in Java or .Net applications. Single threaded version is free. Download Page: http://www.snowtide.com/downloads
Xpdf Commercial Libraries from Glyph and Cog – Including XpdfText, a PDF text extraction library [http://glyphandcog.com/XpdfText.html] ; XpdfInfo – a PDF metadata extraction library [http://glyphandcog.com/XpdfInfo.html] ; XpdfImageExtract,a PDF image extraction library [contact firstname.lastname@example.org for details] ; and XpdfRasterizer, a library which converts PDF pages to images [http://glyphandcog.com/XpdfRasterizer.html].
Big Faceless Java Library – http://bfo.com/blog/2011/11/16/pdf_text_extraction_in_java.html
IText, a Java PDF Library – http://sourceforge.net/projects/itext/reviews?source=navbar
OCR Technologies (required for scanned PDFs):
Tesseract OCR – Open source OCR library. Home Page: https://code.google.com/p/tesseract-ocr/. This tool does not work directly with PDFs, but a shell script or package can be used to convert a PDF to a TIFF which can be analyzed with Tesseract. Also, a Java interface to Tesseract is available at http://sourceforge.net/projects/tess4j/.
Nuance OmniPage – Commercial OCR tool which works directly with PDFs. Product Page: http://www.nuance.com/for-individuals/by-product/omnipage/index.htm
Captricity – Web based service that uses a mixture of technology and human labor to convert uploaded documents into structured data. Product description: http://captricity.com/captricity-at-a-glance/
Enterprise-Level (Cost > $1000) Extract Transfer Load (ETL) Solutions that Directly Read PDFs*
Datawatch Modeler (Formerly Known as Monarch) – http://www.datawatch.com/form-page
HP KeyView – http://www.autonomy.com/products/keyview/
IDR Solutions – Online PDF to SVG and PDF to HTML5 conversions. This vendor also maintained the open source JPedal library until last year. Commercial products: http://www.idrsolutions.com. JPedal: http://sourceforge.net/projects/jpedal/
Informatica B2B Data Transformation – http://www.informatica.com/us/products/b2b-data-exchange/b2b-data-transformation/
Pradea – http://www.praedea.com/
Reviews, Listings and Comparisons:
Discussion of Tabula, Google Refine and Google Fusion Tables at SmartChicago Collaborative. http://www.smartchicagocollaborative.org/pdf-liberation-hackathon-resource-page/
Duke University’s Reporters Lab contains reviews of many of the tools listed above – http://reviews.reporterslab.org/search?q=&type=products&category=pdf-tools-2011-11-09
List of tools for extracting data from scientific papers in PDF format. http://pdfjailbreak.com/tools
Blog Post discussing software resources used at a May 2013 PDF Hackathon in Europe from Peter Murray-Rust: http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/
Comparison of iText, PDFBox and PDFTextExtractor by Madhura Oak – http://www.e-zest.net/blog/extracting-text-from-a-pdf-file/
Many Thanks to Our Sponsors and Supporting Organizations
… and to these software companies that have provided special licenses for the hackathon: