Javascript URL Encoding and Decoding

Update 6/8/2007

Phillip has discovered and squashed in one go, a hidden bug. At least it was hidden to me although obviously and thankfully not hidden to Phillip!

The bug is concerning multiple % characters being present sequentially and its mention and implementation can be found here.

Update 9/11/2007

Although technically not a bug from the point of view of encoding characters for their inclusion in a URL, a number of people have mentioned that the code as it existed previously did not encode and decode all characters as PHP would, so in that regard, it was a serious bug.

Although a number of people contacted me about the problem, there were two in particular I would like to give credit to directly within the article because not only did they identify the problem, they both provided working solutions although their solutions were different.

For an explanation of their respective solutions, take a look at A Tale of Two bug fixes

Much ado about nothing?

I first started this project with a specific problem at hand. I needed byte for byte output compatibility between Javascript's and PHP's URL encoding processing. Output compatibility in that the output of each process not only had to be decodable, by each other back into the same originating strings that they started out as but more so, they had to be the same even in their encoded states.

I know you are saying to yourself, "Self, Javascript doesn't have a URL encoding function!" and you would be right but Javascript's escape() function is used often enough to attempt to perform URL encoding so as to practically be the same thing in most people's eyes.

escape() and urlencode()

Javascript's escape()/unescape() and PHP's urlencode()/urldecode() functions do pretty well at being able to decode each other's output and get out pretty much what went in on the other side but for a specific application, that wasn't good enough. And, now that I have this function, I use it all the time even in cases where Javascript's escape() and unescape() might suffice.

I say "pretty well" because in one specific case, which will be shown, their decoding of the other's encoding get it totally wrong.

To see how PHP and Javascript deal with encoding characters using urlencode() and escape(), respectively, refer to the following tables.

Original
PHP
Javascript
 
+
%20
!
%21
%21
"
%22
%22
#
%23
%23
$
%24
%24
%
%25
%25
&
%26
%26
'
%27
%27
Original
PHP
Javascript
(
%28
%28
)
%29
%29
*
%2A
*
+
%2B
+
,
%2C
%2C
/
%2F
/
:
%3A
%3A
;
%3B
%3B
Original
PHP
Javascript
<
%3C
%3C
=
%3D
%3D
>
%3E
%3E
?
%3F
%3F
@
%40
@
[
%5B
%5B
\
%5C
%5C
]
%5D
%5D
Original
PHP
Javascript
^
%5E
%5E
`
%60
%60
{
%7B
%7B
|
%7C
%7C
}
%7D
%7D
~
%7E
%7E
_
_
_
.
.
.

Incompatible encodings

Did you notice something? PHP's urlencode() encodes a space as a '+' sign while Javascript's escape() encodes a space as '%20'. That wouldn't be a problem if in their decodings of each other's output, they somehow automagically came up with the right answer, in this case a space, but unfortunately it would have to be magical because it ends up being impossible.

They not only encode the space differently but making it worse, they encode the actual '+' sign differently as well and, in the worst possible way! PHP's urlencode() encodes the '+' sign as a %-string value, '%2B' and Javascript's escape() as a '+'!

The problem immediately becomes insurmountable, PHP will say "+" and Javascript will think 'space', Javascript will say '+' and PHP will think 'space'!

But even beyond the relative usefulness of the functions I created for the Javascript encoding and decoding, the process I used seems to be relatively novel, although straight forward and simple enough as compared to most methods one finds available commonly.

Hunting mosquitos with a bazooka

Virtually all the methods I have seen for Javascript encoding of strings into some type of URL encoded output start out the same, as does mine. One way or another, with some ways being better than others, step through the input string searching for characters needing to be encoded. It's what happens afterward where things start getting interesting!

There seem to be three schools of thought as to what to do next. The first, create an array of all possible hex values, 0 - FF, and then use the character code of the target as an offset to index into the array. True, it works, but, , ,

The second, a bit more interesting and at first seemingly to be a more intelligent approach, create a short array of the possible hex characters, 0 - F and then bit shift and boolean AND the decimal character code to achieve an offset into the shorter array.

You wouldn't want to bet on which one is actually more efficient though.

Then there is the third approach, mine. Simply use a single Javascript object method that does the decimal to hex conversion in one step.

Dec to Hex with toString()

That's right, the very function that converts a number into a string that everyone likes to "ingeniously" avoid by simply adding an empty string to a number using '+'.

Anyone else see the irony of the '+' sign showing up again?

In Javascript, adding an empty string to a number implicitly converts the number into a string, but the toString() method can not only be used for the mundane process of converting numbers to strings but much more useful, be used to convert a number in any base into a string representation in any other base.

One can undestand why some may have avoided using the toString() method of the String object as the funtioning of other Object type's toString() method totally changed, although not for object type Number, between Javascript v1.2 and v1.3.

But even more so, before v1.2, the general Javascript Object method of toString() wasn't widely supported. However, considering that Javascript has been fairly stable for a while now and the toString() method for object type Number has been stable since Javascript v1.1, using toString() should be relatively safe.

I won't mention though that v1.3 came out 10 years ago. Oh, sorry, I just did.

In any event, according to the standards, each Javascript Object type, of which 'Number' is an example, has an object method 'toString()' which is overloaded depending on the Object type.

Although a discussion of all the Object types and exactly how the toString() method is overloaded for each one is outside the scope of this discussion, suffice it to say that it is overloaded for object type Number exactly as we need it to be.

Using Radix to indicate output number base

For the Javascript Number object, the toString() method accepts one parameter, with that parameter being 'radix', or number base. If you want to convert a number into a binary string, base 2, the radix parameter would be '2'. Likewise, with what is more useful for this discussion, to convert to hexidecimal string representation, base 16, the toString() parameter would be '16'.

Of course this doesn't actually convert the number from one base to another but instead, only converts the string representation of the Number object. However, since a string is what we want on output anyway it is for our purposes exactly what we need.

There, and back?

Now that we have these characters encoded into %-Hex strings, how do we get them back out again, i.e. back into decimal numbers?

It would be nice if there was a fromString() method to compliment the toString() method but there isn't. On the other hand, if we found the answer to number-to-string conversions in a string related process, wouldn't it make sense to look to a string-to-number conversion in a number related process? After all, it is one's hope that standards specifications make sense although sometimes that can only ever remain a hope.

parseInt() to the rescue!

The parseInt() function is not only more likely to have been used by more web coders, due to its usefulness in 'cleansing' user input of numerical values as well as stripping the 'px' units from DOM element measurements but more so, just like toString() gives us exactly what we need in the 'radix' parameter, parseInt() also allows us to specify a number base for the required string to number conversion.

Of course more often than not, parseInt() is used without any paramters and so assuming decimal conversions, just as toString() if used, ever, is more than likely used sans Radix parameter but now that we have a reason to use more of the power of these various methods and functions, there's no reason not to.

Apples and oranges but still good together

It probably should have been mentioned before this but there is a significant difference between toString() and parseInt().

While toString() is a Method of a Javascript Object, parseInt() is itself actually a Javascript Function. The difference, beyond simple words, being that while the toString() Method is called on a Number Object, i.e. :

var myNumber = 10;
var myHexNumber = myNumber.toString(16);

the parseInt() Function takes a string Object as one of its parameters, i.e.

var myHexString = '0x2F';
var myDecimalNumber = parseInt(myHexString,10);

How did we get here?

If you remember, we skipped over the 'simple' part, finding characters to encode, to get to the more interesting part, how to encode them but actually, the finding of the characters is a bit interesting as well.

What is more often done is to step through the string to be encoded character by character encoding each in turn as needed. Or, use String.str_replace(), which ends up doing exactly the same thing, except you don't see the Wizard behind the curtain.

But, what if there are hundreds of characters not requiring encoding and only one solitary character in need of encoding stuck in the middle?

Worse yet, what if there are hundreds of characters with actually no characters in need of encoding? How would you like to be fed this entire page one character at a time only to find out at the end, that there was nothing on the page of interest? If you don't know what that is like, just wait, the end of this page isn't all that far, or maybe it is.

Wouldn't it make much more sense to group all the characters together that we can safely ignore to be able to more quickly get at the characters actually in need of encoding? Well, even if it doesn't make more sense, that's the way we are going to do it here.

To array or not to array?

One obvious way to search for what is in need of encoding is put what you want to search for in an array and then iterate through the array doing a search and replace along the way. While that obviously works, PHP encodes almost 30 characters and doing a full string search 30 times does not sound like fun.

Another way to do it, although often shied away from, is the use of Regular Expressions, regex. There is a reason that it is shied away from though. When the coders who came up with Regular Expressions got together, they decided to create something seemingly so complicated and hard to understand that it would ensure their respective futures in the coding business because only they would understand how to use it.

Actually, that last part may not necessarily be true, I hope, although it sure seems like it sometimes. And, although I half jokingly thought that for a significant period of time, an application called RegexBuddy totally changed that for me.

I don't normally give a plug to any specific product or service other than those that provide information and even more rarely those that cost money but RegexBuddy is such a powerful product, it deserves it.

In any event, Regular Expressions is what we are going to use so on with an explanation of how they will be used.

A simple regex expression

Although the use of regex is decided, how it is to be used still is not. Again we are faced with choices, search for all instances of characters in need of encoding, let's call it the "brute force" method, or eliminate all characters not needing encoding, the 'elimination game'. In many cases it is a toss up which way to go, six dozen of one, half of the other, but in this case, the later turns out to be the more efficent, as well as being more straight forward.

Let's talk about that last part first, being more straight forward. Consider the case of the percent sign, '%'. We can't very well create an all inclusive regular expression which involves the '%' sign because as soon as we encode the first character, the '%' sign in the encoded strings will start showing up as needing encoding again. Infinite loops are also not my idea of fun and to try to come up with some way around this problem that does not in itself cause even more problems would also not be fun.

On the other hand, going the other possible route, eliminating all characters not in need of encoding has a side benefit of, in this case, being more efficient.

Regarding the efficiency, a 'brute force' regex used to search for specific instances of target characters might look like this:

/([\s!"#$%&'()*+\/:;<=>?`{|}~\]\[\^\\])/

Don't worry though, since we aren't going to use it, you don't have to worry too much about understanding what it means other than knowing that ignoring the Regex opening and closing '/' characters, the Capture Group opening and closing '(' and ')' and the Class opening and closing '[' and ']', along with the numerous '\' character used to "escape" the following character, the rest of the characters would basically be used to make a character per character comparison with each character in the string to be searched.

With the almost 30 characters in that expression, the 'space', !, ", #, $, %, &, (, ), *, +, /, :, ;, <, =, >, ?, `, {, |, }, ~, ], [, ^ and finally the \, almost 30 base comparisons per search string would be required. Were there only a few, instead of close to 30, I'd be tempted to go that route but when the 30 base comparisons ends up being multiplied by the total number of characters being searched and also, since there is a much better way, this method gets a pass.

The common implementation of regular expressions

Regular Expressions support the specifying of not only specific characters or strings of characters that must be matched, as shown just previously but also, and more useful in our project, specifying a range of characters to be matched.

Although it may seem like searching through a string based on a range of characters is not much different than searching through a string for individual characters contained in an array, regex implementations can make one very important optimization.

Instead of a regex process performing a comparison which requires one comaparison for each possible comparison pair, such as:

var targetString = 'String being searched.';
var searchCharacter = {'a','b','c','d'....};
for (var x = 0; x < searchString.length; x++) {
  for (var y = 0; y < searchCharacter.length; y++) {
    if (targetString[x] == searchCharacter[y])
      GOTCHA!!!
  }
}

A regex expression can be written and used that instead will perform a comparison only once on each character being searched among against a range of characters being searched for using something like this:

var targetString = 'String being searched.';
var startCharacter = 'a';
var endCharacter = 'z';
for (var x = 0; x < searchString.length; x++) {
  if (Asc(startCharacter) <= Asc(targetString[x]) 
    && Asc(endCharacter) >= Asc(targetString[x]))
    GOTCHA!!!
}

The Elimination game

Besides searching for specific instances of a given character, regex can also be used to search for occurances of a range of characters specified in a character Class, e.g. find anything between the lower case letters 'a' and 'z', find anything between the upper case letters 'A' to 'Z' or find any numeral digit between the digits '0' to '9'.

Such a regex definition could look something like this:

/([a-zA-Z0-9])/

The core of what that means is exactly what was described in the example just above this regex. Find anything between 'a' to 'z', 'A' to 'Z' or '0' to '9'. The rest of it, the opening and closing expression delimiters '/', the opening and closing Capture Group characters '(' and ')' and the open and closing Character Class characters '[' and ']' are important but won't be gone into any great detail yet.

However, that expression has one serious problem, it would return only a single character each time essentially putting us back at square -1 having to search through each and every character.

Returning consecutive matches

Fortunately regex has a method to solve our little problem by returning as many consecutive characters that fit the conditions. This character is '*' and goes after whatever it is that needs to return more than a single character like this:

/([a-zA-Z0-9]*)/

Since the '*' is outside the Class definition but inside the Capture Group, it will include as many characters fitting the Class definition into any Capture Group it returns. A "Capture Group" is basically a list/string of all the characters matching the query parameters.

We are now 99.99999% ready to start encoding strings!! Unfortunately our regex is still not quite right. That last little bit, were it not dealt with, would find us in an infinite loop as it continually matches and returns all occurance matching the required conditions. So close but yet so far.

But regex comes to our rescue again by providing a means to limit what is matched. We can alter the latest regex so that it will only return a match if it ocurrs at the beginning of the string being searched. How this helps us is that if we find a match, remove it, the next character will be a character that we need to encode. If we then 'remove' that character, we can cycle through the rest of the string removing non-encoding characters and characters needing to be encoded as we come across them until we run out of string to process.

We won't actually be removing any part of the string but instead, shifting a pseudo pointer as we process/elminate characters needing to be processed.

Match only the beginning of a string

The addition needed to be made to the latest regex is called the 'carret'. I don't know why it is called a 'carret' although my suspicion is that it has something, like Regular Expressions themselves, to do with long careers and high salaries but don't quote me on that. Personally, I prefer to call it the 'Upward Pointing Thingy', UPT for short, but that seems to lack industry acceptance.

In any event, our final regex, for the encoding side now looks like this:

/(^[a-zA-Z0-9]*)/

Reducing required processing

Since as you will notice there are some characters that PHP passes through and doesn't encode, we can add them to the regex we are using to capture character strings for which encoding is not required. These characters are the underscore, '_' and period, '.' so our final final encoding regex, this time I promise, becomes:

/(^[a-zA-Z0-9_.]*)/

What that regex means is, in human understandable terms, capture a group of consecutive characters beginning at the start of the string that are within the ranges of lower case 'a' to 'z', upper case 'A' to 'Z', a number between '0' and '9', an underscore '_' or a period '.'.

Now, all that is left is to use charCode() to extract the character code value of the character to be encoded, use toString(Radix) to convert the character codes to hex, use pad(2, "0", 0) to make sure that the character code contains at least two characters and if not, add a leading "0", slap a '%' sign on the front of the hex value and put the result back into the string.

There, that wasn't too hard. Now we can put everything together we've covered so far to come up with to encode any given string. But first, let's finish up the last little bit needed to cover the decoding side as well before showing the process walk-through for both.

Decoding what we have encoded

Most of what we need has already been covered, namely the conversion of hex values back into decimal, which can then be converted into characters. The only major thing needing to be covered is how to extract the %-hex strings needing conversion in the first place.

Personally, I find this portion exceptionally elegant, well, at least the most elegant of all the parts of this project, if in deed any of it even comes close to 'elegant' and doesn't instead hover dangerously close to down right ugly.

Instead of having to search for and extract many different separate characters, we can simply look for any '%' ocurring in the string to be decoded, capture the two characters following it, which will be the hex value we attached the '%' onto the front of in the encoding process and we have what we need to work with.

We will again use regex to do the dirty work for us but this time the expression will be a bit different, hopefully still interesting but different.

The decoding regex

As mentioned just previously, we need to look for the '%' character and then use the two characters following it and since the two following characters will be in hex, the possible characters we need to look for are within the ranges of 0-9, a-f and A-F. Look familiar?

But, since any '%' character we find will have been put there by the encoding process, it is safer just to grab the following two characters no matter what they are since if they are not within the range we actually need, there has been an error somewhwere and it would be good to know at that point.

So, first we look for and Capture the '%' character:

/(%)/

You could think of the '/'s as indicating what one is looking for and the '(' and ')' as indicating what you want 'Captured' and returned but in any event, that will give us the '%' only.

Since we really don't care, at this point, what the two following characters are, we look for any characters, for which regex uses a period '.' to indicate. That gives us this:

/(%.)/

In words, look for and return any ocurrrance of a '%' followed by any characters, except a line break. That of course would end up returning all the characters up to the end of the string so we need to add a little bit. We need to add a regex Quantifier to indicate the number of repetitions of the any-character we want returned which in this case is 2 and exactly 2. We then end up with:

But, here is where Phillip's found bug comes into play...

What Phillip found was that by assuming that any two characters following a '%' are part of an encoding, the case where serial '%' exist, which would not be a URL encoding, is not covered causing the function to attempt to decode characters erroneously.

To take into account the possibility of serial '%' existing, we can simply check to see that any '%' we find is not followed immediately by another "%' and then capture the following two characters which will then give us:

/(%[^%]{2})/

The [^%] indicates that the character following the first '%' is not another '%' and what is contained within the '{' and '}' indicating the number of times the repetition of target characters is to ocurr.

On a side note, if we wanted a variable number of repetitions, we could use something like '0,2' to indicate that we wanted between 0 and 2 repetitions but in our case, we want 2 and exactly 2.

There, was that simple or what? Now we are ready to walk through how it all works.

The encoding walkthrough

  1. Declare initial variables including the Regular Expression we will be using.
  2. Begin a loop that will continue processing until we have run out of string to process.
    1. Execute the regular expression against the input string and capture any matches.
    2. Check the result of the regex, an array, and test that the array is not null, is of two members as it should be and that the second member is not an empty string. We do it in this order, left to right because testing the contents of the second member of a broken array will lead to an error and the testing of the length of an array that doesn't exist may do so also.
    3. Since the characters returned, if any, do not need encoding, add them to a string container initialized at the beginning for return on process completion.
    4. Increment the input string position counter by the number of characters returned.
    5. If the array returned contains no matched characters, the next character in line must be a character needing encoding.
    6. Check to see if the character needing encoding is a space and if so, add a '+' sign to the string container.
    7. If the character needing encoding is not a space, use inputString.charCodeAt(current position) to get the decimal value of the character.
    8. Use Number.toString(16) to get the hex value of the character code and add a '%' to the front of it.
    9. Add the encoded string to the string container.
    10. Increment the input string position counter by one, the length of the single character needing encoding.
  3. Repeat loop as necessary.

The decoding walkthrough

  1. Initialize variables including the regex.
  2. Begin a loop that will continue processing as long as the regex process returns a result that is valid and non-empty.
    1. Using parseInt() on what was matched in the looping check using a 1 character offset to ignore the '%' and the a Radix parameter of 16, convert the matched hex value to binary.
    2. Use String.fromCharCode(binary value) to convert the result of the previous step into a character.
    3. Use String.replace() to replace all occurances of the results of the regex match with the corresponding decoded character.
  3. Repeat loop as necessary making sure to shake and not stir.

On to the coding

You are waiting for what, exactly?

A Tale of Two Bug Fixes

The problem that both Peter Newman of New Zealand Kiwiparty and Mikkel Hansen of greenman.dk identified was that some characters, specifically control characters not normally found in a URL have hex character codes values of less than h10.

Values less than h10 are of course single digits and so would result in only one character after the % sign which would mean that during decoding, the character following the encoded value would be assumed to be part of the encoding and so not only would the decoding result in the wrong character but a character that should not have been part of the encoding in the first place was lost.

One solution was sent to me by Mikkel and makes use of JSFromHell.com pad(). It is an elegant solution and makes the code look cleaner but turns out to be overkill. In this case, we don't have a variable length string of padding required and we don't need to support various character strings being used for padding, we only need to check for there being two characters and if there isn't, prepend a "0".

Also, I am not a big fan of libraries because more often than not, one will use a 20k library only to use a single function from the library. If one is going to write their own code, or use code from others, it seems more efficient to use the code that does what and only what you want instead of trying to be everything to everyone.

On the other hand, were I to need padding in numerous functions or even numerous times in a single function, I'd likely use a library type method but when the need for padding doesn't require numerous instances of the process nor variable lengths of padding or variable padding parameters, I don't really see the benefit.

Peter's solution looks very much like something I would have come up with, cheap and dirty. It fulfills the requirements exactly, does exactly what it is supposed to do and performs reliably and predictably.

Peter's solution is actualy exactly what I outlined above, if the length of the hex character code is less than two, slap a "0" on the front of it.

Often times one can find elegance in simplicity.

Now that the explanation of the latest update is finished, you can return to the discussion at Much ado about nothing? or if you have already read the discussion, continue on to the code explanation here.