Clean URLs for a better search engine ranking
by Pascal Opitz on February 28 2006, 04:22
Search engines are often key to the successful promotion and running of your website, with high traffic making or breaking your online
business. To maximise the visibility of your site in the organic listings of the biggest search engines it is important to strategically work
out how keywords are used.
While link building (placing links to the site or from the site) and, most importantly, writing useful content form the foundation of search engine rankings,
some careful attention to how your site treats URLs will influence its ranking massively.
URLs
The messy ones
Most big websites are rendered out of a database and it is very rare to find systems generating the pages statically onto a webserver
to save processing power. Most small to mid-range CMS make use of on-the-fly rendering and the same applies to most of the tailor-made
dynamic sites I've seen so far.
The most common ways of passing information between these dynamic pages are:
- In a cookie
- In a session
- In the host-header (POST)
- In a the URL as a querystring (GET)
The last one mentioned is by far the most common. It's also the only way that variables passed to an application can be bookmarked and sent by email to other people, since cookies and sessions are bound to the specific computer and browser. But let's have a look how a URL works:
protocol://myserver/folder/file.ext?queryvariable=value#anchorname
Historically, search engines were not able to spider links with querystring parameters because of page rendering speeds and so-called
spider traps. Today, most of the big search engine spiders will follow these untidy links, doing their best to strip out the portions that
can cause them trouble. Forcing them to do this, however, makes one of the most common and easy techniques, the GET method and the use
of the GET array in various scripting languages, the worst coding technique when it comes to search page rankings.
A URL like this is not ideal for most search engines:
http://myserver/folder/file.php?pageid=230
The clean ones
Therefore the first step to improve your URLs would be to move information needed to trigger the page rendering into another part of the URL... Something similar to these:
http://myserver/folder/230/file.php http://myserver/folder/230.php http://myserver/GUID_whatever_230.php
The meaningful ones
But this still is not the ideal URL. Not for people who have to type it in, nor for search engine rankings, since it doesn't contain any meaningful keywords. An ideal example would look more like this:
http://myserver/this/url/is/stuffed/with/keywords/index.htm
As you can see, this is more legible than any kind of cryptic ID. It is far more easy to remember for human visitors and
it is keyword rich for search engines as well.
Google pays especially close attention to the keywords within the URL, and they can, if they match what can be found in the content,
drastically improve the ranking. I suggest you try to think of a system that logically makes sense and that represents the path to your page,
similar to a breacrumb navigation maybe.
A nice article about dirty and clean URLs can be found on the website of Port80 Software.
Technique
How to rewrite URLs
Now that we have worked out how a good URL should look we can actually rethink the way our web-application renders pages.
It's obvious that we need to point the URLs that contain the information to the same file that contained the script
dealing with the query string.
There are a couple of ways to do this: Apache's Force Type for example,
with others for ASP and .Net, but with PHP and Apache the most comnmon technique to rewrite URLs is the
Apache module mod_rewrite.
What is mod_rewrite?
Basically, mod_rewrite is a module for Apache that provides an engine that is able to rewrite URLs to other locations using regular expressions.
It is not activated in Apache by default though, and if you run your website on a shared hosting server you might have to ask your hosting
provider to get it up and running for you.
To get yourself right into the sytax for URL rewriting I recommend reading A Beginner's Guide to URL Rewriting
on sitepoint.com and the URL Rewriting Guide written by Ralf S. Engelschall,
the guy who wrote the module.
Rewrite rules and htaccess files
Usually you would define a rewrite rule in an htaccess file put into the roots folder of your site. I'm just giving a little example here rather than going into too much detail. Please check the comments to see what each line does.
RewriteEngine On # activate mod_rewrite RewriteCond %{REQUEST_URI} ^/admin.* [OR] # if in folder admin RewriteCond %{REQUEST_FILENAME} -f [OR] # or if the request is a real file RewriteCond %{REQUEST_FILENAME} -d # or if an existing directory RewriteRule ^(.+) - [PT,L] # then leave the URL as it is RewriteRule ^(.*) myscript.php # else rewrite is to myscript.php
A more detailed introduction to Rewrite rules can be found on the manual pages of mod_rewrite. Even a quick look will show you that mod_rewrite offers a sophisticated toolkit for rewriting URLs including the search for files in multiple locations and even time-dependent rewriting. Clean URLs are only one of many reasons to get amiliar with mod_rewrite.
Fancy an example now?
Enough of the theory. Now that we've found out how to rewrite URLs to a specific files I want to give a quick and very simple example of how I tweaked old code quickly and efficiently using mod_rewrite and a bit of code. Afterwards my PHP application was capable of handling clean URLs instead of GET parameters... and the whole thing took me just half an hour.
The old URL
In the existing application the rendering output got triggered by the GET parameter "page_id"
http://server/index.php?page_id=100
The new URL
The pattern for a quick tweak I worked out uses the set prefix "page", then the page_id (that before was found in the get parameter) and finally a modified title slug to improve the indexing.
http://server/page/100/here+are+my+keywords
Three lines of code
All I needed to do was to read the page_id from the URL and assigning it to the GET variable. In this case I used a simple regular expression but you could use explode or any other technique.
<?php preg_match("//page/(d+)/.*+/", $_SERVER['REQUEST_URI'], $match); if($match[1]) $_GET['page_id'] = $match[1]; ?>
Security
Always bear in mind that you never should trust the URL. As with all form inputs and GET parameters you need to escape variables taken out of the REQUEST_URI before you use them in your script, otherwise you're inviting script kiddies to hack your application. This is particularly important for scripts that use eval() or write values into databases, store files or do anything else that could cause crucial damage.
Conclusion
Using clean URLs improves your site and the search engine rankings.
It's more likely that people will be able to remember certain locations within the site.
Your page-rank in Google is likely to go up and you stand a better chance of turning up in search engines.
With mod_rewrite and a couple of small tweaks existing applications
can usually be coaxed into using clean URLs.
Comments
I actually demonstrated mod_rewrite during one of the Linux courses I’m delivering and the students were amazed with how easy it is to use mod_rewrite.
by Marco on October 6 2005, 10:15 #
Do you have that code for ASP.Net too?
by Henrik on October 6 2005, 10:31 #
I’ve personally had quite a bit of luck using a different, PHP based, method too. Another Sitepoint plug infact – using the Pathvars class from The PHP Anthology book.
I had a rambling stab (www.morethanseven.net/weblog/40) at writing some of it down but mine was lets say less clear, consise and more err, error prone that this article.
by Gareth on October 6 2005, 12:59 #
by Pascal Opitz on October 6 2005, 17:04 #
Keywords in URLs are given a LOT of weight by the current Google algorithm so this is a great technique for improving organic search engine listings… You just need to look at the URLs of the top 10 sites next time you try to buy some consumer electronics online to see proof of this.
by Mike Stenhouse on October 7 2005, 06:01 #
A while back I wrote something on the same subject, but with a different approach.
by Valentin Agachi on October 10 2005, 13:39 #
Just to give you a quick example: Imagine a CMS that, in order to save server load, could export the file output. You actually could run both, static files and dynamic output, seamlessly working together by using the condition
All requests that don’t point to an existing file get rewritten to the CMS instead of throwing a 404.
This obviously goes far beyond the possibilites of Force-Type and opens completely new approaches on how to structure an application.
by Pascal Opitz on October 11 2005, 04:31 #
Just one point: does the `id` in the URL add some value? I think not, so if the slug is a field in the database, why use it to select the row (article)?.
By the way, this very page is accesible through or or
by choan on October 11 2005, 11:07 #
The keywords out in are actually only SEO, nothing else. For the application logic they don’t matter at all.
Actually the article will be reachable with GET parameter as well, in order to keep the old links to the site working. So try
And you’ll get here as well. which isn’t a bad thing, is it?
by Pascal Opitz on October 11 2005, 16:05 #
All it takes is someone to link to you with the “wrong” URL… unless I’m missing something? Using robots.txt to exclude old URL patterns perhaps?
by Mark on October 14 2005, 03:28 #
by Pascal Opitz on October 14 2005, 08:03 #
I’m not sure that they are black-and-white strict in that sense, and in the case of URL rewriting, where’s the benefit for yourself setting up this duplicate? Your clicks and links get devided between 2 identical pages, so would be your search engine ranking. That’s penalty enough, no?
in this extensive search engine ranking doc duplication is mentioned a couple of times, our case with moderate importance, but, think about what else dilutes your “unique” page: leaving out the “www.” is a whole duplicate set of your website (Pascal just fixed that in the RSS), session keys offer millions of duplicates.
This very obvious problem makes me think that a penalty is only given when it’s a more severe problem.
I believe that when it comes to a decision like this, the user should come before the search engine. Let them have 2 urls, or how many they’d like to have. What’s to do on our side is to redirect them to the desired url.
by Matthias on October 17 2005, 09:03 #
The www.domain.com vs domain.com issue is also a good point, but a) it’s easy to redirect one to the other and b) search engines would have an easier time catering for this example of duplication than, say a URL like /events/id/32 vs /events.asp?id=32 vs /events/my_event_title.html.
I guess the upshot is, tough unless you’re going to robots.txt the URL formats you don’t want spiders to pick up on – which as I understand it means listing every URL (including parameters) in robots.txt.
by Mark on October 17 2005, 11:23 #
by Pascal Opitz on October 18 2005, 09:57 #
by Son Nguyen on October 31 2005, 22:30 #
A module that is as common as mod_rewrite should be available at most hosting companies and if you move to servers where you can do the configuration yourself it’s even more easy to set up.
The effort to support accessible and legible URL schemes rather than cryptical querystrings is a minor price to pay, I reckon.
Another thing to say is that “should be able” implies that they aren’t. Same like “all browsers should be standard compliant” it has nothing to do with the solution of a problem.
by Pascal Opitz on November 1 2005, 05:06 #
by Keyword Rankings on November 14 2005, 15:08 #
by kkaa on November 28 2005, 14:28 #
by Marko on December 27 2005, 18:59 #
by Pascal Opitz on December 29 2005, 18:49 #
In fact, look at this page that shows blogs and their number of google pages indexed. http://blognetworklist.com/bgooglepages.php
Why so many zeros? Did blog network blogs just get penalized?
by Dave on January 16 2006, 01:11 #
by Doug on November 28 2006, 07:45 #
Mod Rewrites on Apache servers are CPU intensive because you are pattern matching against every single file and directory request including .gifs, .jpg, .html, .php etc… and if you have LARGE volumes of traffic, you will eventually run into performance problems because of all this parsing.
Word to the wise. KISS. Keep your .htaccess files AS SIMPLE and short. Less is better!
Know that if your website is running on a shared hosting server, your service provider may eventually tell you that you have to get off the shared-server and go dedicated because you are taking up to much CPU and slowing down the HTML publishing of 100 or more shared-hosting websites also hosted on your shared-server.
But in terms of “Bang for the Buck”, in the spirit of better SEO and Google Pagerank – Apache Mod-Rewrites and CLEAN, HUMAN-readable, keyword-rich URls are worth it!
If you need a good host with easy to use tools that allows you complete control over your shared hosting environment so you can play with the .htaccess file and use Mod-Rewrites:
Mike Filsaime, Mark Joyner, Armondo Montelongo (Flip this House on A&E), myself and http://ebiz-iq.com/recommends/kiosk
Here are some more good HTACCESS tips:
http://www.IsPopularOnline.com/search/htaccess+rewrite
Looking forward to your next article!
JTMcNaught
http://www.IsPopularOnline
Where Little-Guys get Noticed Online
and 1000’s promote your URLs
by JTMcNaught on May 17 2007, 11:28 #
by Pascal Opitz on January 16 2008, 12:40 #
by Tercüme bürosu on January 16 2008, 04:11 #