赞
踩
原文地址如下:
http://www.skorks.com/2010/05/what-every-developer-should-know-about-urls/
I have recently written about the value of fundamentals in software development. I am still firmly of the opinion that you need to have your fundamentals down solid, if you want to be a decent developer. However, several people made a valid point in response to that post, in that it is often difficult to know what the fundamentals actually are (be they macro or micro level). So, I thought it would be a good idea to do an ongoing series of posts on some of the things that I consider to be fundamental – this post is the first instalment.
Being a developer this day and age, it would be almost impossible for you to avoid doing some kind of web-related work at some point in your career. That means you will inevitably have to deal with URLs at one time or another. We all know what URLs are about, but there is a difference between knowing URLs like a user and knowing them like a developer should know them.
As a web developer you really have no excuse for not knowing everything there is to know about URLs, there is just not that much to them. But, I have found that evenexperienced developers often have some glaring holes in their knowledge of URLs. So, I thought I would do a quick tour of everything that every developer should know about URLs. Strap yourself in – this won't take long :).
This is easy, starts with HTTP and ends with .com right :)? Most URLs have the same general syntax, made up of the following nine parts:
<scheme>://<username>:<password>@<host>:<port>/<path>;<parameters>?<query>#<fragment>
Most URLs won't contain all of the parts. The most common components, as you undoubtedly know, are the scheme, host and path. Let's have a look at each of these in turn:
ftp://some_user@blah.com/ ftp://some_user:some_path@blah.com/
If you don't supply the username and password and the URL you're trying to access requires one, the application you're using (e.g. browser) will supply some defaults.
http://www.blah.com/some;param1=foo/crazy;param2=bar/path.html
The URL above is perfectly valid, although this ability of path segments to hold parameters is almost never used (I've never seen it personally).
http://www.blah.com/some/crazy/path.html;param1=foo;param2=bar
As I said, they are not very common
http://www.blah.com/some/crazy/path.html?param1=foo¶m2=bar http://www.blah.com/some/crazy/path.html?param1=foo;param2=bar
That's it, all you need to know about the structure of a URL. From now on you no longer have any excuse for calling the fragment – "that hash link thingy to go to a particular part of the html file".
There is a lot of confusion regarding which characters are safe to use in a URL and which are not, as well as how a URL should be properly encoded. Developers often try to infer this stuff from general knowledge (i.e. the / and : characters should obviously be encoded since they have special meaning in a URL). This is not necessary, you should know this stuff solid – it's simple. Here is the low down.
There are several sets of characters you need to be aware of when it comes to URLs. Firstly, the characters that have special meaning within a URL are known as reserved characters, these are:
";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
What this means is that these characters are normally used in a URL as-is and are meaningful within a URL context (i.e. separate components from each other etc.). If a part of a URL (such as a query parameter), is likely to contain one of these characters, it should be escaped before being included in the URL. I have spoken about URL encoding before, check it out, we will revisit it shortly.
The second set of characters to be aware of is the unreserved set. It is made up of the following characters
"-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
The characters can be included as-is in any part of the URL (note that they may not be allowed as part of a particular component of a URL). This basically means you don't need to encode/escape these characters when including them as part of a URL. You CAN escape them without changing the semantics of a URL, but it is not recommended.
The third set to be aware of is the 'unwise' set, i.e. it is unwise to use these characters as part of a URL. It is made up of the following characters
"{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
These characters are considered unwise to use in a URL because gateways are known to sometimes modify such characters, or they are used as delimiters. That doesn't mean that these characters will always be modified by a gateway, but it can happen. So, if you include these as part of a URL without escaping them, you do this at your own risk. What it really means is you should always escape these characters if a part of your URL (i.e. like a query param) is likely to contain them.
The last set of characters is the excluded set. It is made up of all ASCII control characters, the space character as well the following characters (known as delimiters)
"<" | ">" | "#" | "%" | '"'
The control characters are non-printable US-ASCII characters (i.e. hexadecimal 00-1F as well as 7F). These characters must always be escaped if they are included in a component of a URL. Some, such as # (hash) and % (percent) have special meaning within the context of a URL (they can really be considered equivalent to the reserved characters). Other characters in this set have no printable representation and therefore escaping them is the only way to represent them. The <, > and “ characters should be escaped since these characters are often used to delimit URLs in text.
To URL encode/escape a character we simply append its 2 character ASCII hexadecimal value to the % character. So, the URL encoding of a space character is %20 – we have all seen that one. The % character itself is encoded as %25.
That's all you need to know about various special characters in URLs. Of course aside from those characters, alpha-numerics are allowed and don't need to be encoded :).
A few things you have to remember. A URL should always be in its encoded form. The only time you should decode parts of the URL is when you're pulling the URL apart (for whatever reason). Each part of the URL must be encoded separately, this should be pretty obvious, you don't want to try encoding an already constructed URL, since there is no way to distinguish when reserved characters are used for their reserved purpose (they shouldn't be encoded) and when they are part of a URL component (which means they should be encoded). Lastly you should never try to double encode/decode a URL. Consider that if you encode a URL once but try to decode it twice and one of the URL components contains the % character you can destroy your URL e.g.:
http://blah.com/yadda.html?param1=abc%613
When encoded it will look like this:
http://blah.com/yadda.html?param1=abc%25613
If you try to decode it twice you will get:
http://blah.com/yadda.html?param1=abc%613
Correct
http://blah.com/yadda.html?param1=abca3
Stuffed
By the way I am not just pulling this stuff out of thin air. It is all defined in RFC 2396, you can go and check it out if you like, although it is by no means the most entertaining thing you can read, I'd like to hope my post is somewhat less dry :).
The last thing that every developer should know is the difference between an absolute and relative URL as well as how to turn a relative URL into its absolute form.
The first part of that is pretty easy, if a URL contains a scheme (such as http), then it can be considered an absolute URL. Relative URLs are a little bit more complicated.
A relative URL is always interpreted relative to another URL (hence the name :)), this other URL is known as the base URL. To convert a relative URL into its absolute form we firstly need to figure out the base URL, and then, depending on the syntax of our relative URL we combine it with the base to form its absolute form.
We normally see a relative URL inside an html document. In this case there are two ways to find out what the base is.
Once we have a base URL, we can try and turn our relative URL into an absolute one. First, we need to try and break our relative URL into components (i.e. scheme, authority (host, port), path, query string, fragment). Once this is done, there are several special cases to be aware of, all of which mean that our relative URL wasn't really relative.
If none of those special cases occurred then we have a real relative URL on our hands. Now we need to proceed as follows.
At this point we simply append any query string or fragment that our relative URL may have contained to our URL using appropriate separators and we have finished turning our relative URL into an absolute one.
Here are some examples of applying the above algorithm:
1) base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar relative: rel1 final absolute: http://www.blah.com/yadda1/yadda2/rel1 2) base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar relative: /rel1 final absolute: http://www.blah.com/rel1 3) base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar relative: ../rel1 final absolute: http://www.blah.com/yadda1/rel1 4) base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar relative: ./rel1?param2=baz#bar2 final absolute: http://www.blah.com/yadda1/yadda2/rel1?param2=baz#bar2 5) base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar relative: .. final absolute: http://www.blah.com/yadda1/
Now you should be able to confidently turn any relative URL into an absolute one, as well as know when to use the different forms of relative URL and what the implications will be. For me this has come in handy time and time again in my web development endeavours.
There you go that's really all there is to know about URLs, it's all relatively simple (forgive the pun :)) so no excuse for being unsure about some of this stuff next time. Talking about next time, one of the most common things you need to do when it comes to URLs is recognise if a piece of text is infact a URL, so next time I will show you how to do this using regular expressions (as well as show you how to pull URLs out of text). It should be pretty easy to construct a decent regex now that we've got the structure and special characters down. Stay tuned.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。