downloads | documentation | faq | getting help | mailing lists | licenses | wiki | reporting bugs | php.net sites | conferences | my php.net

search for in the

DOMDocument::__construct> <DOMComment::__construct
[edit] Last updated: Fri, 17 May 2013

view this page in

Clase DOMDocument

(PHP 5)

Introducción

Representa un documento HTML o XML en su totalidad; sirve como raíz del árbol de documento.

Sinopsis de la Clase

DOMDocument extends DOMNode {
/* Propiedades */
readonly public string $actualEncoding ;
readonly public DOMConfiguration $config ;
readonly public DOMDocumentType $doctype ;
readonly public DOMElement $documentElement ;
public string $documentURI ;
public string $encoding ;
public bool $formatOutput ;
public bool $preserveWhiteSpace = true ;
public bool $recover ;
public bool $resolveExternals ;
public bool $standalone ;
public bool $strictErrorChecking = true ;
public bool $substituteEntities ;
public bool $validateOnParse = false ;
public string $version ;
readonly public string $xmlEncoding ;
public bool $xmlStandalone ;
public string $xmlVersion ;
/* Métodos */
__construct ([ string $version [, string $encoding ]] )
DOMAttr createAttribute ( string $name )
DOMAttr createAttributeNS ( string $namespaceURI , string $qualifiedName )
DOMCDATASection createCDATASection ( string $data )
DOMComment createComment ( string $data )
DOMDocumentFragment createDocumentFragment ( void )
DOMElement createElement ( string $name [, string $value ] )
DOMElement createElementNS ( string $namespaceURI , string $qualifiedName [, string $value ] )
DOMEntityReference createEntityReference ( string $name )
DOMProcessingInstruction createProcessingInstruction ( string $target [, string $data ] )
DOMText createTextNode ( string $content )
DOMElement getElementById ( string $elementId )
DOMNodeList getElementsByTagName ( string $name )
DOMNodeList getElementsByTagNameNS ( string $namespaceURI , string $localName )
DOMNode importNode ( DOMNode $importedNode [, bool $deep ] )
mixed load ( string $filename [, int $options = 0 ] )
bool loadHTML ( string $source )
bool loadHTMLFile ( string $filename )
mixed loadXML ( string $source [, int $options = 0 ] )
void normalizeDocument ( void )
bool registerNodeClass ( string $baseclass , string $extendedclass )
bool relaxNGValidate ( string $filename )
bool relaxNGValidateSource ( string $source )
int save ( string $filename [, int $options ] )
string saveHTML ([ DOMNode $node = NULL ] )
int saveHTMLFile ( string $filename )
string saveXML ([ DOMNode $node [, int $options ]] )
bool schemaValidate ( string $filename )
bool schemaValidateSource ( string $source )
bool validate ( void )
int xinclude ([ int $options ] )
/* Métodos heredados */
public DOMNode DOMNode::appendChild ( DOMNode $newnode )
public string DOMNode::C14N ([ bool $exclusive [, bool $with_comments [, array $xpath [, array $ns_prefixes ]]]] )
public int DOMNode::C14NFile ( string $uri [, bool $exclusive [, bool $with_comments [, array $xpath [, array $ns_prefixes ]]]] )
public DOMNode DOMNode::cloneNode ([ bool $deep ] )
public int DOMNode::getLineNo ( void )
public string DOMNode::getNodePath ( void )
public bool DOMNode::hasAttributes ( void )
public bool DOMNode::hasChildNodes ( void )
public DOMNode DOMNode::insertBefore ( DOMNode $newnode [, DOMNode $refnode ] )
public bool DOMNode::isDefaultNamespace ( string $namespaceURI )
public bool DOMNode::isSameNode ( DOMNode $node )
public bool DOMNode::isSupported ( string $feature , string $version )
public string DOMNode::lookupNamespaceURI ( string $prefix )
public string DOMNode::lookupPrefix ( string $namespaceURI )
public void DOMNode::normalize ( void )
public DOMNode DOMNode::removeChild ( DOMNode $oldnode )
public DOMNode DOMNode::replaceChild ( DOMNode $newnode , DOMNode $oldnode )
}

Propiedades

actualEncoding

Obsoleto. Codificación actual del documento, es el equivalente de solo lectura de encoding.

config

Obsoleto. Configuración utilizada cuando DOMDocument::normalizeDocument() es invocado.

doctype

La declaración de tipo de documento asociada con este documento.

documentElement

Este es un atributo de conveniencia que permite acceso directo al nodo hijo que es el elemento documento del documento.

documentURI

La ubicación del documento o NULL si es indefinida.

encoding

Codificación del documento, como ha sido especificada en la declaración XML. Este atributo no está presente en la especificación DOM Nivel 3, pero es la única manera de manipular la codificación de un documento XML en esta implementación.

formatOutput

Da formato a la salida con identación y espacios extra.

implementation

El objeto DOMImplementation que maneja este documento.

preserveWhiteSpace

No remover espacios en blanco redundantes. Predeterminado a TRUE.

recover

Proprietario. Activa el modo de recuperación. Ejemplo: intentar analizar documentos mal formados. Este atributo no es parte de la especificación DOM y es especifica para libxml.

resolveExternals

Asigne TRUE para cargar entidades externas a una declaración doctype. Es útil para incluir entidades de caracteres en su documento XML.

standalone

Obsoleto. Si este documento es o no independiente, tal como ha sido especificado por la declaración XML, corresponde a xmlStandalone.

strictErrorChecking

Lanzar DOMException en caso de errores. Predeterminado a TRUE.

substituteEntities

Proprietario.Si se substituyen o no las entidades. Este atributo no forma parte de la especificación DOM y es específico para libxml.

validateOnParse

Cargar y validar contra la DTD. Predeterminado a FALSE.

version

Obsoleto. Versión de XML, corresponde a xmlVersion.

xmlEncoding

Un atributo que forma parte de la declaración XML y especifica la codificación de este documento. Es NULL cuando no ha sido especificado o se desconoce, como cuando el documento fue creado en memoria.

xmlStandalone

Un atributo que forma parte de la declaración XML y especifica cuando este documento es independiente. Es FALSE cuando no ha sido especificado.

xmlVersion

Un atributo que forma parte de la declaracion XML y especifica el número de versión de este documento. Si no hay declaración y este documento soporta la característica "XML" el valor es "1.0".

Notas

Nota:

La extensión DOM utiliza la codificación UTF-8. Use utf8_encode() y utf8_decode() para trabajar con textos con codificación ISO-8859-1 o Iconv para otras codificaciones.

Tabla de contenidos



DOMDocument::__construct> <DOMComment::__construct
[edit] Last updated: Fri, 17 May 2013
 
add a note add a note User Contributed Notes DOMDocument - [17 notes]
up
1
sites.sitesbr.net
3 months ago
How to objetify a DomDocument with hierarchy like:
<root>
    <item>
          <prop1>info1</prop1>
          <prop2>info2</prop2>
          <prop3>info3</prop3>
     </item>
    <item>
          <prop1>info1</prop1>
          <prop2>info2</prop2>
          <prop3>info3</prop3>
     </item>
</root>

It's possible to use in object style to retrieve information, as:

<?php
     $theNodeValue
= $aitem->prop1;
?>

Here is the code: one Class and 2 functions.

<?php
 
class ArrayNode{
       public
$nodeName, $nodeValue;
 }

 function
getChildNodeElements( $domNode ){
    
$nodes = array();
     for(
$i=0; $i < $domNode->childNodes->length; $i++){
      
$cn = $domNode->childNodes->item($i);
       if(
$cn->nodeType == 1){
          
$nodes[] = $cn;
           }
     }
    return
$nodes;
 }

 function
getArrayNodes( $domDoc ){
    
$res = array();

       for(
$i=0; $i < $domDoc->childNodes->length; $i++){
      
$cn = $domDoc->childNodes->item($i);
      
# The first is the root tag...
         
if( $cn->nodeType == 1){
              
# But we want it's childNodes.
               
$sub_cn = getChildNodeElements( $cn);
               
# Found the tagName:
               
$baseItemTagName = $sub_cn[0]->nodeName;
                break;
            }
        }

      
$dnl = $domDoc->getElementsByTagName( $baseItemTagName);

       for(
$i=0; $i< $dnl->length; $i++){
         
$arrayNode = new ArrayNode();

     
# Summary
     
$arrayNode->nodeName = $dnl->item($i)->nodeName;
     
$arrayNode->nodeValue = $dnl->item($i)->nodeValue;

     
# Child Nodes
     
$cn = $dnl->item($i)->childNodes;
      for(
$k=0; $k<$cn->length; $k++){
           if(
$cn->item($k)->nodeName == "#text" && trim($cn->item($k)->nodeValue) == "") continue;
          
$arrayNode->{$cn->item($k)->nodeName} = $cn->item($k)->nodeValue;
      }

     
# Attributes
     
$attr = $dnl->item($i)->attributes;
      for(
$k=0; $k < $attr->length; $k++){
           if(!
is_null($attr)){
            if(
$attr->item($k)->nodeName == "#text" && trim($attr->item($k)->nodeValue) == "") continue;
           
$arrayNode->{$attr->item($k)->nodeName} = $attr->item($k)->nodeValue;
           }
      }

     
$res[] = $arrayNode;

       }

     return
$res;
 }
?>

To use it:

<?php

 
# First you load a XML in a DomDocument variable.

  
$url = "/path/to/yourxmlfile.xml";
  
$domSrc = file_get_contents($url);
  
$dom = new DomDocument();
  
$dom->loadXML( $domSrc );

 
# Then, you get the ArrayNodes from the DomDocument.

   
$ans = getArrayNodes( $dom );

 
    for(
$i=0; $i < count( $ans ) ; $i++){

   
$cn $ans[ $i];

   
$info1 $cn->prop1;
   
$info2 $cn->prop2;
   
$info3 $cn->prop3;
     
        
// ...
 
  
}

?>
up
1
evert at er dot nl
2 years ago
A nice and simple node 2 array I wrote, worth a try ;)

<?php
function getArray($node)
{
   
$array = false;

    if (
$node->hasAttributes())
    {
        foreach (
$node->attributes as $attr)
        {
           
$array[$attr->nodeName] = $attr->nodeValue;
        }
    }

    if (
$node->hasChildNodes())
    {
        if (
$node->childNodes->length == 1)
        {
           
$array[$node->firstChild->nodeName] = $node->firstChild->nodeValue;
        }
        else
        {
            foreach (
$node->childNodes as $childNode)
            {
                if (
$childNode->nodeType != XML_TEXT_NODE)
                {
                   
$array[$childNode->nodeName][] = $this->getArray($childNode);
                }
            }
        }
    }

    return
$array;
}
?>
up
1
fcartegnie
3 years ago
Be careful with formatOutput().

Creating an empty node like this:
createElement('foo','')
instead of
createElement('foo')
will break formatOutput.
up
1
PhilipWayneRollins at gmail dot com
3 years ago
If you want to use the DOMDocument to create xHTML documents here is a simple class

Note this is designed for creating xHTML documents from scratch but could be easily extended to work with xHTML documents. Also this is for xHTML not XML.

<?php
   
class Document
   
{
        public
$doctype;
        public
$head;
        public
$title = 'Sensei Ninja';
        public
$body;
        private
$styles;
        private
$metas;
        private
$scripts;
        private
$document;
       
       
        function
__construct (  )
        {
           
$this->document = new DOMDocument( );
           
$this->head = $this->document->createElement( 'head', ' ' );
           
$this->body = $this->document->createElement( 'body', ' ' );
        }
       
       
        public function
addStyleSheet ( $url, $media='all' )
        {
           
$element = $this->document->createElement( 'link' );
           
$element->setAttribute( 'type', 'text/css' );
           
$element->setAttribute( 'href', $url );
           
$element->setAttribute( 'media', $media );
           
$this->styles[] = $element;
        }
       
       
        public function
addScript ( $url )
        {
           
$element = $this->document->createElement( 'script', ' ' );
           
$element->setAttribute( 'type', 'text/javascript' );
           
$element->setAttribute( 'src', $url );
           
$this->scripts[] = $element;
        }
       
       
        public function
addMetaTag ( $name, $content )
        {
           
$element = $this->document->createElement( 'meta' );
           
$element->setAttribute( 'name', $name );
           
$element->setAttribute( 'content', $content );
           
$this->metas[] = $element;
        }
       
       
        public function
setDescription ( $dec )
        {
           
$this->addMetaTag( 'description', $dec );
        }
       
       
        public function
setKeywords ( $keywords )
        {
           
$this->addMetaTag( 'keywords', $keywords );
        }
       
        public function
createElement ( $nodeName, $nodeValue=null )
        {
          return
$this->document->createElement( $nodeName, $nodeValue );
        }
       
        public function
assemble ( )
        {
           
// Doctype creation
           
$doctype = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML TRANSITIONAL 1.0//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
           
           
// Create the head element
           
$title = $this->document->createElement( 'title', $this->title );
           
// Add stylesheets if needed
           
if ( is_array( $this->styles ))
                foreach (
$this->styles as $element )
                   
$this->head->appendChild( $element );
           
// Add scripts if needed
           
if(  is_array( $this->scripts ))
                foreach (
$this->scripts as $element )
                   
$this->head->appendChild( $element );
           
// Add meta tags if needed
           
if ( is_array( $this->metas ))
                foreach (
$this->metas as $element )
                   
$this->head->appendChild( $element );
           
$this->head->appendChild( $title );
           
           
// Create the document
           
$html = $this->document->createElement( 'html' );
           
$html->setAttribute( 'xmlns', 'http://www.w3.org/1999/xhtml' );
           
$html->setAttribute( 'xml:lang', 'en' );
           
$html->setAttribute( 'lang', 'en' );
           
$html->appendChild( $this->head );
           
$html->appendChild( $this->body );
           
           
           
$this->document->appendChild( $html );
            return
$doctype . $this->document->saveXML( );
        }
       
    }
   
?>

Small example

<?php
        $document
= new Document( );
   
$document->title = 'Hello';
   
$document->addStyleSheet( 'StyleSheets/main.css' );
   
$div = $document->createElement( 'div' );
   
$div->nodeValue = 'Hello, world!';
   
$div->setAttribute( 'style', 'color: red;' );
   
$document->body->appendChild( $div );
   
printf( '%s', $document->assemble( ) );
?>
up
1
cmyk777 at gmail dot com
3 years ago
This function may help to debug current dom element:

<?php
function dom_dump($obj) {
    if (
$classname = get_class($obj)) {
       
$retval = "Instance of $classname, node list: \n";
        switch (
true) {
            case (
$obj instanceof DOMDocument):
               
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->saveXML($obj);
                break;
            case (
$obj instanceof DOMElement):
               
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->ownerDocument->saveXML($obj);
                break;
            case (
$obj instanceof DOMAttr):
               
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->ownerDocument->saveXML($obj);
               
//$retval .= $obj->ownerDocument->saveXML($obj);
               
break;
            case (
$obj instanceof DOMNodeList):
                for (
$i = 0; $i < $obj->length; $i++) {
                   
$retval .= "Item #$i, XPath: {$obj->item($i)->getNodePath()}\n".
"{$obj->item($i)->ownerDocument->saveXML($obj->item($i))}\n";
                }
                break;
            default:
                return
"Instance of unknown class";
        }
    } else {
        return
'no elements...';
    }
    return
htmlspecialchars($retval);
}
?>

Example usage:

<?php
$dom
= new DomDocument();
$dom->load('test.xml');
$body = $dom->documentElement->getElementsByTagName('book');
echo
'<pre>'.dom_dump($body).'<pre>';
?>

Output:

Instance of DOMNodeList, node list:
Item #0, XPath: /library/book[1]
<book isbn="0345342968">
<title>Fahrenheit 451</title>
<author>R. Bradbury</author>
<publisher>Del Rey</publisher>
</book>
Item #1, XPath: /library/book[2]
<book isbn="0048231398">
<title>The Silmarillion</title>
<author>J.R.R. Tolkien</author>
<publisher>G. Allen &amp; Unwin</publisher>
</book>
Item #2, XPath: /library/book[3]
<book isbn="0451524934">
<title>1984</title>
<author>G. Orwell</author>
<publisher>Signet</publisher>
</book>
Item #3, XPath: /library/book[4]
<book isbn="031219126X">
<title>Frankenstein</title>
<author>M. Shelley</author>
<publisher>Bedford</publisher>
</book>
Item #4, XPath: /library/book[5]
<book isbn="0312863551">
<title>The Moon Is a Harsh Mistress</title>
<author>R. A. Heinlein</author>
<publisher>Orb</publisher>
</book>
up
1
Fernando H
5 years ago
Showing a quick example of how to use this class, just so that new users can get a quick start without having to figure it all out by themself. ( At the day of posting, this documentation just got added and is lacking examples. )

<?php

// Set the content type to be XML, so that the browser will   recognise it as XML.
header( "content-type: application/xml; charset=ISO-8859-15" );

// "Create" the document.
$xml = new DOMDocument( "1.0", "ISO-8859-15" );

// Create some elements.
$xml_album = $xml->createElement( "Album" );
$xml_track = $xml->createElement( "Track", "The ninth symphony" );

// Set the attributes.
$xml_track->setAttribute( "length", "0:01:15" );
$xml_track->setAttribute( "bitrate", "64kb/s" );
$xml_track->setAttribute( "channels", "2" );

// Create another element, just to show you can add any (realistic to computer) number of sublevels.
$xml_note = $xml->createElement( "Note", "The last symphony composed by Ludwig van Beethoven." );

// Append the whole bunch.
$xml_track->appendChild( $xml_note );
$xml_album->appendChild( $xml_track );

// Repeat the above with some different values..
$xml_track = $xml->createElement( "Track", "Highway Blues" );

$xml_track->setAttribute( "length", "0:01:33" );
$xml_track->setAttribute( "bitrate", "64kb/s" );
$xml_track->setAttribute( "channels", "2" );
$xml_album->appendChild( $xml_track );

$xml->appendChild( $xml_album );

// Parse the XML.
print $xml->saveXML();

?>

Output:
<Album>
  <Track length="0:01:15" bitrate="64kb/s" channels="2">
    The ninth symphony
    <Note>
      The last symphony composed by Ludwig van Beethoven.
    </Note>
  </Track>
  <Track length="0:01:33" bitrate="64kb/s" channels="2">Highway Blues</Track>
</Album>

If you want your PHP->DOM code to run under the .xml extension, you should set your webserver up to run the .xml extension with PHP ( Refer to the installation/configuration configuration for PHP on how to do this ).

Note that this:
<?php
$xml
= new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = $xml->createElement( "Album" );
$xml_track = $xml->createElement( "Track" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
?>

is NOT the same as this:
<?php
// Will NOT work.
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = new DOMElement( "Album" );
$xml_track = new DOMElement( "Track" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
?>

although this will work:
<?php
$xml
= new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = new DOMElement( "Album" );
$xml->appendChild( $xml_album );
?>
up
1
admin at beerpla dot net
3 years ago
After seeing many complaints about certain DOMDocument shortcomings, such as bad handling of encodings and always saving HTML fragments with <html>, <head>, and DOCTYPE, I decided that a better solution is needed.

So here it is: SmartDOMDocument. You can find it at http://beerpla.net/projects/smartdomdocument/

Currently, the main highlights are:

- SmartDOMDocument inherits from DOMDocument, so it's very easy to use - just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).

- saveHTMLExact() - DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).
Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).
SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want - it saves HTML without adding that extra garbage that DOMDocument does.

- encoding fix - DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.
SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you - just use loadHTML() as you would normally.

- SmartDOMDocument Object As String - you can use a SmartDOMDocument object as a string which will print out its contents.
For example:
<?php
echo "Here is the HTML: $smart_dom_doc";
?>

I'm going to maintain this code and try to fix bugs as they come in.

Enjoy.
up
1
jay at jaygilford dot com
3 years ago
Here's a small function I wrote to get all page links using the DOMDocument which will hopefully be of use to others

<?php
/**
 * @author Jay Gilford
 */
 
/**
 * get_links()
 *
 * @param string $url
 * @return array
 */
function get_links($url) {
 
   
// Create a new DOM Document to hold our webpage structure
   
$xml = new DOMDocument();
 
   
// Load the url's contents into the DOM
   
$xml->loadHTMLFile($url);
 
   
// Empty array to hold all links to return
   
$links = array();
 
   
//Loop through each <a> tag in the dom and add it to the link array
   
foreach($xml->getElementsByTagName('a') as $link) {
       
$links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
    }
 
   
//Return the links
   
return $links;
}
?>
up
0
tloach at gmail dot com
3 years ago
For anyone else who has been having issues with formatOuput not working, here is a work-around:

rather than just doing something like:

<?php
$outXML
= $xml->saveXML();
?>

force it to reload the XML from scratch, then it will format correctly:

<?php
$outXML
= $xml->saveXML();
$xml = new DOMDocument();
$xml->preserveWhiteSpace = false;
$xml->formatOutput = true;
$xml->loadXML($outXML);
$outXML = $xml->saveXML();
?>
up
-1
Nick M
1 year ago
You may need to save all or part of a DOMDocument as an XHTML-friendly string, something compliant with both XML and HTML 4. Here's the DOMDocument class extended with a saveXHTML method:

<?php

/**
 * XHTML Document
 *
 * Represents an entire XHTML DOM document; serves as the root of the document tree.
 */
class XHTMLDocument extends DOMDocument {

 
/**
   * These tags must always self-terminate. Anything else must never self-terminate.
   *
   * @var array
   */
 
public $selfTerminate = array(
     
'area','base','basefont','br','col','frame','hr','img','input','link','meta','param'
 
);
 
 
/**
   * saveXHTML
   *
   * Dumps the internal XML tree back into an XHTML-friendly string.
   *
   * @param DOMNode $node
   *         Use this parameter to output only a specific node rather than the entire document.
   */
 
public function saveXHTML(DOMNode $node=null) {
   
    if (!
$node) $node = $this->firstChild;
   
   
$doc = new DOMDocument('1.0');
   
$clone = $doc->importNode($node->cloneNode(false), true);
   
$term = in_array(strtolower($clone->nodeName), $this->selfTerminate);
   
$inner='';
   
    if (!
$term) {
     
$clone->appendChild(new DOMText(''));
      if (
$node->childNodes) foreach ($node->childNodes as $child) {
       
$inner .= $this->saveXHTML($child);
      }
    }
   
   
$doc->appendChild($clone);
   
$out = $doc->saveXML($clone);
   
    return
$term ? substr($out, 0, -2) . ' />' : str_replace('><', ">$inner<", $out);

  }

}

?>

This hasn't been benchmarked, but is probably significantly slower than saveXML or saveHTML and should be used sparingly.
up
-1
nathan at crause dot name
2 years ago
Be careful with any assumptions you may have about this library. Unlike XML parsers in other languages (such as Java), when parsing an XML Schema constrained XML document, defaults you may define (such as for attributes) will NOT be correctly represented in the DOMDocument.
up
-1
hutch one two zero at gmail dot com
3 years ago
If you're using AJAX, make sure you put a header directive like the following at the top of your php file.

<?php header( "content-type: application/xml; charset=ISO-8859-15" ); ?>

If you don't add this header then your javascript downloadURL function will not work because the content type will not be recognised to be XML.
up
-1
me at richardsnazell dot com
3 years ago
After running into a problem with DomDocument and a UTF-8 encoded input string, a simple work-around was to convert the input string to ISO-8859-1, run my DOM queries, then convert back to UTF-8:

<?php
$xhtml
= 'a UTF-8 encoded string';
$xhtml = utf8_decode($xhtml); // convert UTF-8 string to ISO-8859-1

// doing some DOM manipulation here on $xhtml

$xhtml = utf8_encode($xhtml); // convert ISO-8859-1 string back to UTF-8
?>

Not ideal - but it worked for me.
up
-1
e dot sand at elisand dot com
3 years ago
It should be pointed out that DOMDocument extends DOMNode in every way... that means that you even have access to the DOMNode properties (even though the documentation here does not mention them as being inherited).

I used to use an XPath query to access nodes from a DOMDocument (when getElementById or getElementsByTagName weren't usable), as I believed this to be the only way.  However, since DOMDocument fully extends DOMNode, you can use DOMDocument->firstChild for example to get the first child node.

This simplifies things quite a bit when using an XPath query may seem a bit excessive to get access to something as simple as the child nodes.
up
-1
Atanas Markov (dreamer79bg at gmail dot com)
4 years ago
Here is a simple web scraping example using the PHP DOM that tries to get the largest text body of a HTML document. I needed it for a spider that had to show a short description for a page. It assumes that document annotation can be the largest <div>, <td> or <p> element in the page.
In the example I show a way to prevent a bug in the DOM as it sometimes just doesn't recognize html encoding. It seems to work if you put charset meta tag right after the head tag of the document.

<?php
$ch
= curl_init();
curl_setopt ($ch, CURLOPT_URL, '...put url here...' );
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_VERBOSE,1);
curl_setopt($ch, CURLOPT_USERAGENT, 'set sth...');
curl_setopt ($ch, CURLOPT_REFERER, '...set sth...'); //just a fake referer
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_POST,0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 20);

$html= curl_exec($ch);
$html1= curl_getinfo($ch);

//try to get page encoding as it was sent from server
if ($html1['content_type']){
   
$arr= explode('charset=',$html1['content_type']);
   
$csethdr= strtolower(trim($arr[1]));
} else {
   
$csethdr= false;
}

$cset= false;
$arr= array();

//This has to replace page meta tags for charset with utf-8, but it doesn't actually help(see the bug info).
if (preg_match_all(
'/(<meta\s*http-equiv="Content-Type"\s*content="[^;]*;
\s*charset=([^"]*?)(?:"|\;)[^>]*>)/'
//merge this line
,$html,$arr,PREG_PATTERN_ORDER)){
   
$cset= strtolower(trim($arr[2][0]));
    if (
$cset!='utf-8'||$cset!=$csethdr){
       
$new= str_replace($arr[2][0],'utf-8',$arr[1][0]);
       
$html= str_replace($arr[1][0],$new,$html);
       
$cset= $csethdr;
    } else {
       
$cset= false;
    }

    if (
$cset=='utf-8'){
       
$cset= false;
    }
}
unset(
$arr);
if (
$cset){
   
$html= iconv($cset,'utf-8',$html);
}
unset(
$cset);

//solve dom bug
$html=preg_replace('/<head[^>]*>/','<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
'
,$html);

$dom= new DOMDocument();
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;

function
getMaxTextBody($dom){
   
$content = $dom->getElementsByTagname('div');
   
$content2= $dom->getElementsByTagname('td');
   
$content3= $dom->getElementsByTagname('p');

   
$new= array();
    foreach (
$content as $value) {
       
$new[]= $value;
        unset(
$value);
    }
    unset(
$content);

    foreach (
$content2 as $value) {
       
$new[]= $value;
        unset(
$value);
    }
    unset(
$content2);

    foreach (
$content3 as $value) {
       
$new[]= $value;
        unset(
$value);
    }
    unset(
$content3);

   
$maxlen= 0;
   
$result= '';
    foreach (
$new as $item)
    {
       
$str= $item->nodeValue;
        if (
strlen($str)>$maxlen){
           
$content1= $item->getElementsByTagName('div');
           
$content2= $item->getElementsByTagname('td');
                       
$content3= $item->getElementsByTagname('p');
           
$contentnew= array();
            foreach (
$content1 as $value) {
               
$contentnew[]= $value;
                unset(
$value);
            }
            unset(
$content1);
            foreach (
$content2 as $value) {
               
$contentnew[]= $value;
                unset(
$value);
            }
            unset(
$content2);
            foreach (
$content3 as $value) {
               
$contentnew[]= $value;
                unset(
$value);
            }
            unset(
$content3);

            if (
count($contentnew)==0){
               
$result= $str;
            } else {
                foreach (
$contentnew as $value) {
                   
$str1= getMaxTextBody($value);
                   
$str2= $value->nodeValue;
                       
//let's say largest body has more than 50% of the text in its parent
                                   
if (strlen($str1)*2<strlen($str2)){
                       
$str1= $str2;
                    }
                    if (
strlen($str1)*2>strlen($str)&&strlen($str1)>$maxlen){
                       
$result= $str1;
                    } elseif (
strlen($str1)>$maxlen){
                       
$result= $str1;
                    }
                   
$maxlen= strlen($result);
                }
            }
           
$maxlen= strlen($result);
            unset(
$contnentnew);
        }
    }

    unset(
$new);
    return
$result;
}
print
getMaxTextBody($dom);

?>
up
-1
Jochem Blok
5 years ago
To indent a XML in a pretty way I use:

<?
$sXML = '<root><element><key>a</key><value>b</value></element></root>';
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->formatOutput   = true;
$doc->loadXML($sXML);
echo $doc->saveXML();
?>
up
-2
Yarg Dahc
3 years ago
Child class of DOMDocument which has a toArray() method. Enjoy and/or improve
<?php
class MyDOMDocument extends DOMDocument
{
    public function
toArray(DOMNode $oDomNode = null)
    {
       
// return empty array if dom is blank
       
if (is_null($oDomNode) && !$this->hasChildNodes()) {
            return array();
        }
       
$oDomNode = (is_null($oDomNode)) ? $this->documentElement : $oDomNode;
        if (!
$oDomNode->hasChildNodes()) {
           
$mResult = $oDomNode->nodeValue;
        } else {
           
$mResult = array();
            foreach (
$oDomNode->childNodes as $oChildNode) {
               
// how many of these child nodes do we have?
                // this will give us a clue as to what the result structure should be
               
$oChildNodeList = $oDomNode->getElementsByTagName($oChildNode->nodeName); 
               
$iChildCount = 0;
               
// there are x number of childs in this node that have the same tag name
                // however, we are only interested in the # of siblings with the same tag name
               
foreach ($oChildNodeList as $oNode) {
                    if (
$oNode->parentNode->isSameNode($oChildNode->parentNode)) {
                       
$iChildCount++;
                    }
                }
               
$mValue = $this->toArray($oChildNode);
               
$sKey   = ($oChildNode->nodeName{0} == '#') ? 0 : $oChildNode->nodeName;
               
$mValue = is_array($mValue) ? $mValue[$oChildNode->nodeName] : $mValue;
               
// how many of thse child nodes do we have?
               
if ($iChildCount > 1) {  // more than 1 child - make numeric array
                   
$mResult[$sKey][] = $mValue;
                } else {
                   
$mResult[$sKey] = $mValue;
                }
            }
           
// if the child is <foo>bar</foo>, the result will be array(bar)
            // make the result just 'bar'
           
if (count($mResult) == 1 && isset($mResult[0]) && !is_array($mResult[0])) {
               
$mResult = $mResult[0];
            }
        }
       
// get our attributes if we have any
       
$arAttributes = array();
        if (
$oDomNode->hasAttributes()) {
            foreach (
$oDomNode->attributes as $sAttrName=>$oAttrNode) {
               
// retain namespace prefixes
               
$arAttributes["@{$oAttrNode->nodeName}"] = $oAttrNode->nodeValue;
            }
        }
       
// check for namespace attribute - Namespaces will not show up in the attributes list
       
if ($oDomNode instanceof DOMElement && $oDomNode->getAttribute('xmlns')) {
           
$arAttributes["@xmlns"] = $oDomNode->getAttribute('xmlns');
        }
        if (
count($arAttributes)) {
            if (!
is_array($mResult)) {
               
$mResult = (trim($mResult)) ? array($mResult) : array();
            }
           
$mResult = array_merge($mResult, $arAttributes);
        }
       
$arResult = array($oDomNode->nodeName=>$mResult);
        return
$arResult;
    }
}

$sXml = <<<XML
<nodes>
    <node>text<node>
    <node>
        <field>hello<field>
        <field>world<field>
    <node>
<nodes>
XML;
$dom = new MyDOMDocument;
$dom->loadXml($sXml);
var_dump($dom->toArray());
?>
Output:

array (
    "nodes" => array (
        "node" => array (
            0 => "text",
            1 => array (
            "field" => array (
                0 => "hello",
                1 => "world"
            )
        )
    )
 )

 
show source | credits | stats | sitemap | contact | advertising | mirror sites