|
ColdFusion 9.0 Resources |
Extracting text from a PDF documentYou can use the DocumentText DDX element to return an XML file that contains the text in one or more PDF documents. As with the PDF element, you specify a result attribute the DocumentText element and enclose one or more PDFsource elements within the start and end tags, as the following example shows: <?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="doc1"/>
</DocumentText>
</DDX>
The following code shows the CFM page that calls the DDX file. Instead of writing the output to a PDF file, you specify an XML file for the output: <cfif IsDDX("documentText.ddx">
<cfset ddxfile = ExpandPath("documentText.ddx")>
<cfset sourcefile1 = ExpandPath("book1.pdf")>
<cfset destinationfile = ExpandPath("textDoc.xml")>
<cffile action="read" variable="myVar" file="#ddxfile#"/>
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1="#sourcefile1#">
<cfset outputStruct=StructNew()>
<cfset outputStruct.Out1="#destinationfile#">
<cfpdf action="processddx" ddxfile="#myVar#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="ddxVar">
<!--- Use the cfdump tag to verify that the PDF files processed successfully. --->
<cfdump var="#ddxVar#">
</cfif>
The XML file conforms to a schema specified in doctext.xsd. For more information, see http://ns.adobe.com/DDX/DocText/1.0 When you specify more than one source document, ColdFusion aggregates the pages into one file. The following example shows the DDX code for combining a subset of pages from two documents into one output file: <?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="doc1" pages="1-10"/>
<PDF source="doc2" pages="3-5"/>
</DocumentText>
</DDX>
|