UrlEncode vs. HtmlEncode

Posted on June 01, 2004  |  

Posted in Development

13 comments

While adding support for TrackBacks in my blog I ran into a weird issue. I was quite confused and after going through MSDN I was confused even more. When I tried posting a text-only TrackBack everything worked fine. But as soon as the "excerpt" (the notification text itself) contained an HTML tag, quotes, etc, the excerpt would get cut off right there on the offending character.

TrackBack Background

When you post a TrackBack, you need to provide 4 textual values:

  • title - the title of the entry
  • excerpt - notification text itself
  • url - a permanent link to your blog entry
  • blog_name - the name of your blog where you posted an entry

Encoding Values Of Form Fields

These four parameters are assembled into a string and sent to a server that accepts TrackBacks. Here's a sample from movabletype.org:

POST http://www.foo.com/mt-tb.cgi/5
Content-Type: application/x-www-form-urlencoded

title=Foo+Bar&url=http://www.bar.com/&
excerpt=My+Excerpt&blog_name=Foo

It goes without saying that you need to encode the parameters before you concatenate them. The natural choice seems to be HttpUtility.HtmlEncode. After all, MSDN describes it as follows:

HTML-encodes a string and returns the encoded string...

URL encoding ensures that all browsers will correctly transmit text in URL strings. Characters such as ?, &,/, and spaces may be truncated or corrupted by some browsers so those characters must be encoded in <A> tags or in query strings where the strings may be re-sent by a browser in a request string.

It also provides an example:

string TestString = "This is a <Test String>.";
string EncodedString = Server.HtmlEncode(TestString);

which is supposed to yield "This+is+a+%3cTest+String%3e.". Well, if you run this code in the debugger it yields "This is a &lt;Test String&gt;." instead! The resulting string is safer for HTTP transfer but it's not good enough to be POSTed. Every "special character" will have an ampersand in front of it (&lt; for <, &quote; for quotes, etc) and the Request.Form collection will split the string on the ampersands. This is exactly what I observed while sending myself test TrackBacks. Instead of those 4 parameters I would end up with 6 or more.

This is where HttpUtility.UrlEncode comes to the rescue. Suspiciously enough, it has almost the exact same wording and even the same code sample. A string encoded with UrlEncode can be safely POSTed to another page.

On the receiving end you need to decode the string and, again, oddly enough, both HtmlDecode and UrlDecode produced the same result.

The Lesson I Learned

Clearly, MSDN is lying about HtmlEncode. There's no sign of the promised conversion. It does make it safer for embedding in XML I think. Those < and > will be converted accordingly and the string will be good for embedding in an XML tag. Also, I looked at its code in Reflector and received a confirmation that MSDN is lying. For those who are curious here's the code of this method:

public static void HtmlEncode(string s, TextWriter output)
{
 char ch1;
 char ch2;
 int num3;
  
 if (s == null)
    return;
 
 int num1 = s.Length;
 int num2 = 0;
 while ((num2 < num1)) {
   ch1 = s.Chars[num2];
   ch2 = ch1;
   if (ch2 != '\"') {
     if (ch2 == '&') goto Label_0064;

     switch ((ch2 - '<')) {
       case 0: output.Write("&lt;"); goto Label_00AE;
       case 1: goto Label_0071;
       case 2: output.Write("&gt;"); goto Label_00AE;
     }
    goto Label_0071;
  }

 output.Write("&quot;");
 goto Label_00AE;
 
 Label_0064:
 output.Write("&amp;");
 goto Label_00AE;
 
 Label_0071:
 if ((ch1 >= ' ') && (ch1 < '\u0100'))
 {
  num3 = ch1;
  output.Write(string.Concat("&#", 
        num3.ToString(NumberFormatInfo.InvariantInfo), ";"));
 }
 else
   output.Write(ch1); 
 
 Label_00AE:
 num2 += 1;
 }
}

No trace of converting a string "the URL way".

I'd Like To Hear From You

Please feel free to share opinions as to what situations HtmlEncode and UrlEncode facilitate better.

13 comments

Shannon J Hager
on May 31, 2004

URL encoding and HTML encoding are not the same thing. If you want to encode for use in a URL, you use URL encoding. If you want to encode for display on an HTML page (converting angle brackets to "& lt ;" for example), you HTML encode it. The docs are wrong, the verbage for URLEncode is used in the HTMLEncode documention you link to above.


Kiliman
on June 2, 2004

I agree with Shannon. I'll just add one other "rule of thumb".

The reason you "encode" data is to prevent certain characters in your data to be misinterpreted by the receiver.

HtmlEncode converts the angle brackets, quotes, ampersands, etc. to the entity values to prevent the HTML parser from confusing it with markup.

UrlEncode converts spaces to "+" and non-alphanumeric to their hex-encoded values. Again this is to prevent the the URL parser from misinterpreting an embedded ?, & or other values.

If you're wondering which one you should use in an HTTP POST, well just think of POST data as an extremely long query string. So naturally you will need to use UrlEncode.

Kiliman


Kiliman
on June 2, 2004

I was curious if Microsoft had fixed that documentation error.

I went to the Longhorn SDK site, and it is still showing the wrong information.

HtmlEncode doc

I don't know where you send documentation bug reports, but you should let them know.


CraigD
on July 6, 2004

This post helped me with my own ExtendedHtmlUtility.

So far I've found it useful in two situations: (1) resolving HTML entities in pages in a search engine spider and (2) outputting entire Chinese, Korean and Japanese pages using entities to represent all 'double byte' characters within the iso-8859-1 charset. Why? A shared Apache server had been set-up to ONLY send the HTTP Content-Type: iso-8859-1 - meaning that browsers could not successfully display the page without the user manually selecting the encoding...

I haven't touched on UrlEncoding (or decoding) but I guess it'd follow the same pattern, with a difference encoded form...


Ryan Walters
on June 16, 2005

There seems to be some confusion over the difference between POST and GET. POST submits a form without appending parameters to the URL. GET is the method that appends the form fields to the URL.


Danny
on April 23, 2006

Hello,

'xmsdnbug@microsoft.com' is an alias you can use to report MSDN bugs. I came across your site today and reported the bug via an internal Microsoft alias, so there is no need for you to report it at this time.

I'd imagine it will take time as it probably has a few layers of approval to go through as well as localization into different languages.

Please note that while I work at Microsoft I have no direct ties to MSDN; I'm in a different product group entirely. :)


Milan Negovan
on April 24, 2006

Thank you, Danny. I noticed that these days it's always quicker to find a product group blog to contact people within Microsoft directly.


Disillusioned
on October 3, 2008

And yet again a polished site and the Internet's long memory results in confusion — luckily I caught the individual who had found this link before he took it on board.

Pity the author didn't actually take the time to learn to read things properly so they could understand that there are two methods being discussed and what the difference between the two methods was so they didn't write this rubbish.

Anyone who reads this comment before reading the article and who wants to lean about HTMLencode and URLencode — for your own sanity please GOOGLE again!


Milan Negovan
on October 6, 2008

Pray, share what the mighty Google hast revealed on the matter. :)


Peter
on March 11, 2009

HTML encoding and URL encoding is not the same thing. Also note that urlencoding in javascript escape(...) is not correct. Try to drop some text in here with a "+"-sign : http://www.urlencoder.net . And then compare to javascript:document.write(escape('mytext +++')); The javascript version does not encode +, but treats it as a valid char.


Sandeep Dhillon
on April 22, 2009

I was making an application that required to POST an HTML document containing French/German characters. I tried doing this operation: URLEncode(HTMLEncode(myText)). But I think due to some reason, perhaps the different implementation for HTMLEncode by Microsoft or no support for non-ASCII characters, I would get a PROTOCOL ERROR in the Response object.

The same application worked just fine with all-English characters. But as soon as a french/german character was introduced, it started failing.

My work around:
Encode the whole HTML document to Base64 string and then use URLEncode.

byte[] bytesToEncode = Encoding.UTF8.GetBytes(myHTMLDocAsString);
string encodedText = Convert.ToBase64String(bytesToEncode);

and then URLEncode(encodedText)

This could be successfully transfered over HTTP as POST without errors!


davitz38
on January 13, 2010

Hi guys,
in case you are looking for an online tool, here's a url encoder and a html encoder
Use them to compare url encode vs html encode :)
David


Erik Eckhardt
on March 18, 2010

Almost 4 years later and Microsoft still hasn't fixed the docs. What a joke!