Hello 
Recently, I discovered a problem with encoding in XML files.
The setup is:
This works brilliant, except when it comes to special characters, e.g. a "non breaking space":  
When you switch from the build-in file format "xml" to "xml tidied with attributes sorted" the program "HTMLTidy" returns a non-valid xml file, the encoding will fail. Here is a simple example xml file to test this behaviour:
This wrong return of HtmlTidy seems to be caused by the setting of the character encoding in the config file (first line): "char-encoding: raw". The manpage says:
So if I understand the behaviour of HtmlTidy correctly, it pareses the input file and won't translate the byte 0xA0 back to an entity.
Fortunately all our source xml files are encoded to utf8 so I can avoid this problem by changing the encoding in the config file to utf8. But that might not work for other encodings.
Would it be complicated to parse the encoding of the xml-file and set the parameter at the HtmlTidy-call correctly? Or something similar to that? Are there probably some other ideas?
Kind regards,
Iso

Recently, I discovered a problem with encoding in XML files.
The setup is:
- Beyond Compare 3.3.12
- Alternative from build-in file format: XML Tidied with attributes sorted
This works brilliant, except when it comes to special characters, e.g. a "non breaking space":  
When you switch from the build-in file format "xml" to "xml tidied with attributes sorted" the program "HTMLTidy" returns a non-valid xml file, the encoding will fail. Here is a simple example xml file to test this behaviour:
Code:
<?xml version="1.0" encoding="utf-8"?> <foo> <bar>bar baz</bar> </foo>
raw: output values above 127 without conversion to entities
Fortunately all our source xml files are encoded to utf8 so I can avoid this problem by changing the encoding in the config file to utf8. But that might not work for other encodings.
Would it be complicated to parse the encoding of the xml-file and set the parameter at the HtmlTidy-call correctly? Or something similar to that? Are there probably some other ideas?
Kind regards,
Iso
Comment