Just this past week, I was given a programming task to take a Microsoft Word template, which had been saved as an XML file (Word Markup-Language format), and auto-populate all of the bookmarks in the document with dynamic data from a database. The purpose of this task was to take the name of the bookmark (for example, “FirstNameLastName”) and populate only that field with the database data, leaving the other data untouched and untransformed. This will allow an end user to manage the static data and formatting of the document, without programming intervention. Therefore, we can have one person from each department in charge of modifying the legal wording, or wording to customers, and not have to have programming create a new document template for every change.

Originally, we had used DDE controls to populate the bookmarks, but this meant that we had to have a client-side control in place to handle the DDE into the Word document, and it also meant that we had to give every single user in the company access to the templates that would create the populated document that would go to customers.

We stuck with using MS Word because it is a user-friendly interface that everybody knows how to use, it was cheap to implement since we already had the licenses for every user, and having hundreds of templates would have required a serious time commitment to transform them into another format. Modifying the existing word templates became the logical (cheap) choice, with the fastest time to production.

When I was charged with this task, I thought that it would be an open-and-shut case with little effort, because WordML is technically valid XML. So, I started out with a simple PERL script that used XML::Simple to read the Word document into a hash table that I could walk with a script and modify tag attributes accordingly. I quickly learned that Microsoft did programmers no favors when they created their proprietary markup language. Depending on the formatting of the document (tabs, tables, etc…), the bookmarks could be at any level in the tree, so walking it would require XPath to find all of the instances of the bookmark… Easy enough, right? Not right. Not right at all. After modifying the script to use XML::DOM::XPath to find the location of all of the “<aml:annotation” tags, I soon realized that M$ fucked us once again. As it turns out, the “<aml:annotation w:type=’BookmarkStart’” element was in itself a complete node that did not contain any data related to the value of the bookmark in the end document, and only seemed to indicate that there was a bookmark somewhere near by. So I had to manually search for where the data was being populated…

And I found it. After the “<aml:annotation w:type=’BookmarkEnd’” node, there is a node starting with a “<w:r” element that isn’t there unless the bookmark is populated. I literally spent hours trying to find this non-existant node… Annoying. Hop, skip, and a jump from there, right? We’ll just take the hash table that we have from XML::Simple, walk the tree until we find the “BookmarkEnd”, and add the missing data… Still, no such break from Microsoft for us; Microsoft Word doesn’t read the formatted output of XML::Simple. :-(

I said screw it to convention and decided to revert my PERL skills back to what PERL does best — text processing. Read in the XML template as a text document, then perform some serious regex and substitutions to achieve the end result. This worked.

Enough talk; the bare metal:

# Search for all occurances of bookmarks -- match the bookmark end on the bookmark's ID so that we know
# we're on the same bookmark
while ($content =~ /<aml:annotation\saml:id="([0-9+]|[0-9]+)"\sw:type="Word\.Bookmark\.Start"\sw:name="([A-Za-z0-9+]|[A-Za-z0-9]+)"\/><aml:annotation\saml:id="(\1)"\sw:type="Word\.Bookmark\.End"([\/])>/gx) {
    my $id = $1;
    my $name = $2;

    # Replace only the bookmarks that we know we need to
    if (grep ( { $_ eq $name } @bms ) ) {
        $content =~ s/<aml:annotation\ aml:id="$id"\ w:type="Word\.Bookmark\.Start"\ w:name="$name"\/><aml:annotation\ aml:id="$id"\ w:type="Word\.Bookmark\.End"\/>/<aml:annotation aml:id="$id"\  w:type="Word\.Bookmark\.Start"\ w:name="$name"\/><aml:annotation\ aml:id="$id"\ w:type="Word\.Bookmark\.End"\/><w:r><w:rPr><w:rFonts\ w:ascii="Times\ New\ Roman"\ w:h-ansi="Times\ New\ Roman"\/><wx:font\ wx:val="Times\ New\ Roman"\/><w:sz-cs\ w:val="24"\/><\/w:rPr><w:t>$rh_dbData->{"$name"}<\/w:t><\/w:r>/gx;
    }
}

Where $content is the un-chomp’d contents of the XML word template, and @bms is an array of bookmarks that we want to match on. That way, when users create “reminder” bookmarks (who does that anyway?), we don’t try to populate them with data, in turn throwing a bunch of uninitialized value errors (because we won’t have the data anyway). Also, one caveat is that absolutely no data can be populated in the bookmark, not even a space, otherwise the regex won’t match, and your resulting document won’t have the populated data… The bookmarks must be completely empty. Any manipulation of the data that’s going into the bookmark will have to be done in the PERL script.

The trick with all of this is to make sure that the key of your dynamic data corresponds with the name of the bookmark. So, for the purposes of this example, we create the bookmark with the name of the database column. Easy as that.

email me with questions.

-dan

Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

© 2013 Dan's Blog Suffusion theme by Sayontan Sinha