CF8 PDF Manipulation: Pulling Text Out

So, this morning a friend called me up with a problem. They had received some PDF files from their insurance company, and they needed the data in Word or Excel for manipulation. Now, they could cut and paste the information, but this was time consuming. She went to the Adobe site, trying to find info, and saw 'ColdFusion' on the homepage. This sparked her brain, because she immediately went, "Hey, Cutter does something with ColdFusion! Maybe he can help me!" Lucky for her, we now have ColdFusion 8, with it's built-in PDF support through the use of the CFPDF tag. I had to do a tiny bit of research on this, because Adobe's CF LiveDocs weren't overly clear, but I eventually found out that I could extract text with some very simple DDX processing directives. Ray did a series of posts recently about working with PDF documents. Although none of them answered my question directly, he had written one about using the DDX processing directives. This sent me searching the Adobe site for more information, which is where I came upon the Understanding DDX developer documentation. Basically, by rewriting Ray's simple example, I was able to extract all of the DocumentText from the PDF and dump it into an XML file. First I need the DDX, which is just some simple XML:
<cfsavecontent variable="myddx">
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
    <DocumentText result="OutXML">
        <PDF source="Title"/>
    </DocumentText>
</DDX>
</cfsavecontent>
<cfset myddx = trim(myddx)>
Then, I verify the validity:
<cfif isDDX(myddx)>
yes, its ddx
<cfelse>
no its not
</cfif>
Now, a little explanation. Looking at the DDX, you'll notice I've defined a result and a source. I had tried to define my file names here directly, but ColdFusion didn't like that when I hit the CFPDF tag. Apparently, when using the processddx action of the tag, you are required to define your inputfiles and outputfiles. Further study of the LiveDocs shows that ColdFusion is expecting structures for these defininitions. So, the DDX references certain structure keys (OutXML and Title) which you must define prior to processing your pdf.
<cfset inputStruct = StructNew() />
<cfset inputStruct.Title = "rptLauncher2.pdf" />

<cfset outputStruct = StructNew() />
<cfset outputStruct.OutXML = "words2.xml" />
You now have all of the necessary pieces. All that's required is your call to process your DDX directives.
<cfpdf action="processddx" ddxfile="#myddx#" name="VARIABLES.doc" inputfiles="#inputStruct#" outputfiles="#outputStruct#" />
I CFDump the VARIABLES.doc to see my success or failure, which comes out just fine. I now have a file, words2.xml, sitting in my server's folder, which contains all of the content of the PDF file. Simple and sweet.
TweetBacks
Comments
zac spitzer's Gravatar Do you know if DDX preserve whitespace when you extract text?

I was dealing with some mainframe reports (fixed width text) i had to parse and i initially thought cf8 pdf support like you but ended up using this instead http://text-mining-tool.com/
# Posted By zac spitzer | 12/3/07 8:08 PM
Josh's Gravatar Thank you! This is exactly what I have been searching for. Amazing how simple it is.
# Posted By Josh | 3/14/08 5:06 PM
Rob's Gravatar What about the other direction? populating fields in a PDF that are user enterable. WOuld this DDX procedure work in reverse?
# Posted By Rob | 4/14/08 3:08 PM
Steve 'Cutter' Blades's Gravatar @Rob -

I don't have the DDX documentation in front of me, but I would think it would be possible. That being said, I think you would be better served using the cfpdfform tags for populating a PDF Form. This is one of it's primary purposes.
# Posted By Steve 'Cutter' Blades | 4/14/08 3:58 PM
Margaret's Gravatar I have read your essay as a requirement for Dr. Stranges class at the University of South Alabama.Thank you for sharing your thoughts with everyone. I really enjoyeded the post.
# Posted By Margaret | 9/25/09 10:22 AM
Steve 'Cutter' Blades's Gravatar @Margaret -

Glad the post helped out! What class is it that Dr. Stranges teaches at South Alabama?
# Posted By Steve 'Cutter' Blades | 9/25/09 10:36 AM
BlogCFC was created by Raymond Camden. This blog is running version 5.9.3.006. Contact Blog Owner. Layout inspired by bluerobot.com., with some JQuery thrown in for fun.