PHP SimpleXML CDATA Problem... and My Solution

by Charles Iliya Krempeaux, published on Tue Jun 14th, 2005

PHP5 has a new built in way of handling XML. It's called SimpleXML.

Using this object for "working with" XML can make development alot faster. SimpleXML parses an XML document, and turns it into an object. So if we had a document like:

                
<?xml version="1.0"?>

    <tvshows>
    <show>
        <name>The Simpsons</name>
    </show>

        <show>

        <name>That '70s Show</name>
    </show>

        <show>
        <name>Family Guy</name>

    </show>

        <show>
        <name>Lois &amp; Clark</name>
    </show>
</tvshows>

                
            

Then SimpleXML would give us a (PHP) object something like:

                
object(SimpleXMLElement)#1 (1) {
  ["show"]=>
  array(4) {
    [0]=>
    object(SimpleXMLElement)#2 (1) {
      ["name"]=>
      string(12) "The Simpsons"
    }
    [1]=>
    object(SimpleXMLElement)#3 (1) {
      ["name"]=>

      string(14) "That '70s Show"
    }
    [2]=>
    object(SimpleXMLElement)#4 (1) {
      ["name"]=>
      string(10) "Family Guy"

    }
    [3]=>
    object(SimpleXMLElement)#5 (1) {
      ["name"]=>
      string(12) "Lois & Clark"
    }
  }
}

                
            

(The output above would be what you get if you called var_dump() on the object. It probably looks more complex than it really is. Basically, to get at "The Simpsons" part, we would write "$simplexml->show[2]->name".)

This is useful because: #1 we save alot of time not having to use the old XML parsing methods (... which isn't difficult, just time consuming), #2: we can "use" this in a "foreach" structure, and #3 it's easier for newbies to learn with.

The one big problem is, SimpleXML does not handle CDATA!

(If you don't know what XML CDATA Section is, look at: http://en.wikipedia.org/wiki/CDATA_section)

Look at the last entry:

                
        <name>Lois &amp; Clark</name>
                
            

What if we used CDATA instead to represent this, and had:

                
        <name><![CDATA[Lois & Clark]]></name>
                
            

Well then too bad! SimpleXML just skips all that. It just pretends that it wasn't even there! (Note that when we put the text in the CDATA block, we were able to change the "&amp;" to a "&".)

So in other words, if we had:

                
<?xml version="1.0"?>

<tvshows>
    <show>
        <name>The Simpsons</name>

    </show>

     <show>
        <name>That '70s Show</name>
    </show>

     <show>

        <name>Family Guy</name>
    </show>

     <show>
        <name><![CDATA[Lois & Clark]]></name>

    </show>
</tvshows>
                
            

Then we'd get:

                
object(SimpleXMLElement)#1 (1) {
  ["show"]=>
  array(4) {
    [0]=>
    object(SimpleXMLElement)#2 (1) {
      ["name"]=>
      string(12) "The Simpsons"
    }
    [1]=>

    object(SimpleXMLElement)#3 (1) {
      ["name"]=>
      string(14) "That '70s Show"
    }
    [2]=>
    object(SimpleXMLElement)#4 (1) {
      ["name"]=>
      string(10) "Family Guy"
    }
    [3]=>
    object(SimpleXMLElement)#5 (1) {
      ["name"]=>
      object(SimpleXMLElement)#6 (0) {
      }
    }
  }
}
                
            

Note that the "Lois & Clark" part isn't even there!

So, what's the solution. Well, we can turn the CDATA into XML "escaped" text before giving the "XML data" to SimpleXML. In other words, take to CDATA and do the following conversions...

  • &    becomes    &amp;
  • "    becomes    &quot;
  • <    becomes    &lt;
  • >    becomes    &gt;

(And of course, drop the "<![CDATA[" and "]]>" too.)

I tried doing this with regular expressions but just couldn't figure out the proper way to represent "not a string". (I tired it with POSIX regular expressions are Perl-compatible regular expressions. But couldn't get anything to work.) So, eventually I just decided to write a function for it. (Which is tedious.) So, here it is. Hopefully it will help everyone else to not get frustrated with SimpleXML being too simple:

                
    function uncdata($xml)
    {
        // States:
        //
        //     'out'
        //     '<'
        //     '<!'
        //     '<!['
        //     '<![C'
        //     '<![CD'
        //     '<![CDAT'
        //     '<![CDATA'
        //     'in'
        //     ']'
        //     ']]'
        //
        // (Yes, the states a represented by strings.) 
        //

        $state = 'out';

        $a = str_split($xml);

        $new_xml = '';

        foreach ($a AS $k => $v) {

            // Deal with "state".
            switch ( $state ) {
                case 'out':
                    if ( '<' == $v ) {
                        $state = $v;
                    } else {
                        $new_xml .= $v;
                    }
                break;

                case '<':
                    if ( '!' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;

                 case '<!':
                    if ( '[' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;

                case '<![':
                    if ( 'C' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;

                case '<![C':
                    if ( 'D' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;

                case '<![CD':
                    if ( 'A' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;

                case '<![CDA':
                    if ( 'T' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;

                case '<![CDAT':
                    if ( 'A' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;

                case '<![CDATA':
                    if ( '[' == $v  ) {


                        $cdata = '';
                        $state = 'in';
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;

                case 'in':
                    if ( ']' == $v ) {
                        $state = $v;
                    } else {
                        $cdata .= $v;
                    }
                break;

                case ']':
                    if (  ']' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $cdata .= $state . $v;
                        $state = 'in';
                    }
                break;

                case ']]':
                    if (  '>' == $v  ) {
                        $new_xml .= str_replace('>','&gt;',
                                    str_replace('>','&lt;',
                                    str_replace('"','&quot;',
                                    str_replace('&','&amp;',
                                    $cdata))));
                        $state = 'out';
                    } else {
                        $cdata .= $state . $v;
                        $state = 'in';
                    }
                break;
            } // switch

        }

        //
        // Return.
        //
            return $new_xml;

    }
                
            

So to use this, you'd do something like:

                
    // Get the XML data, with possible CDATA sections in it.
    $xml_data = file_get_contents('http://changelog.ca/feed/rss/');


    // Convert the CDATA sections using the un-cdata function.
    $xml_data = uncdata($xml_data);

    // Create the SimpleXML object (not having to worry about loosing info due to CDATA)
    $simplexml = simplexml_load_string($xml_data);
                
            

Just an extra note. I'm not sure how efficent this is with the use of the str_split() function (to turn a string into an array of characters). But if you are using SimpleXML, you're probably not really worried about that. (Or at least not at that stage of development.)

Hopefully someone will find this useful. (If you find any errors or bugs with it, send me an e-mail and let me know.)


Read more about: , .

Comments

No known comments. (There may be some out there though.)


New Comments

Want to write a comment to this post on your own blog? Then use the HTML code below to link to this article....

Or better yet, use the quote-o-matic below by "selecting" the part of the text (in the article) that you want to quote, and then use the HTML code that will get generated below to link to this article....


Preview:
       

Topics