Canonicalization
Canonicalization is the process that determines how various equivalent forms of a name are resolved to a single standard name. The single standard name is also known as the canonical name. For example, most people would consider these the same URLs: www.example.com; https://example.com/; https://www.example.com/; and https://www.example.com/index.html — but technically, all of these URLs are different, so a web server could return completely different content for each of these page URLs.
When a search engine "canonicalizes" a URL, it tries to pick the one that seems like the best representative from that set. In some cases it is believed a few algorithms penalize these multiple home page URLs for being duplicate content.
One other problem is that some ranking algorithms, specifically Google's PageRank, ends up spreading out the PageRank across all URL versions of a set of homepages; so when your site should actually enjoy a PageRank of five, it ends up ranked as a PR 3 because the distribution of ranking points is spread out over multiple pages.
Further contributing to a lesser PageRank is the fact that all people post links differently to your site. Some use the 'www,' some don't. Some use the '.com,' while others use the full '.com/index.html.' This means that each of these pages is receiving its own inbound link credits, rather than all of that link juice pointing to one single URL.
Many older sites built their homepage as an 'index.htm.' Then a couple years later when they updated and redesigned the site, they changed over to 'index.html' — and then a few years after that, they redesigned again, using a content management system (CMS), and now their homepage is named 'index.php.' The old versions were not removed and now they have up to three different homepage file types (htm, html and php) on the server and each of those can have six or more variations of URLs — you may even be displaying different homepages to different people if you've forgotten to take down the old pages!
How do you know if your home page is resolving to multiple URLs? Open a browser and go to your homepage and change the variations of the URL — do it with and without the 'http,' with and without the 'www' and with and without the 'index.htm' (or whichever extension you're using). If each time you type in a different URL version your page opens without changing the URL in the address bar — in other words the URL displays exactly how you type it in each version without defaulting to one, single, SAME, URL — then you need to fix this problem.
How do you fix this? By setting up proper, permanent 301 redirects in your .htaccess file:
RewriteEngine On
ErrorDocument 404 https://www.yourwebsite.com/error.html
RewriteBase /
RewriteCond %{HTTP_HOST} !^www\.yourwebsite\.com$
RewriteRule ^(.*)$ https://www.yourwebsite.com/$1 [R=301,L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(([^/]+/)*)index\.html\ HTTP/
RewriteRule index\.html$ https://www.yourwebsite.com/%1 [R=301,L]
In our example, the first 'condition' takes all requests for BASE ROOT URL options, such as 'yourwebsite.com;' 'www.yourwebsite.com;' 'https://yourwebsite.com;' and 'https://www.yourwebsite.com;' and redirects them to the URL you specify; in this case, 'https://www.yourwebsite.com.'
Be sure to change "yourwebsite.com" to your own domain name.
Check Your Links
Dead links are no good for customers, potential customs or search engine spiders. On almost every site I review I find that 10 percent or more of the internal links are dead.
It is so easy to accidentally mistype an URL for links. You must check your links on a regular basis! If you have a small site with relatively few links then you can test this manually. If you have a large site then you can use automated link checkers found online or purchase software to check your links.
Orphan Pages
An orphan page is one that has absolutely no site navigation on it. I see this many times on sites that offer a larger view of a photo or a snippet of a video. They pop these open in a new page and the picture or video is the only object on the page.
If folks don't realize you've popped them on to a new page, they will try the back button to go back to the previous page — which of course will not work and then they're lost — or people send links to their buddies of those picture or video snippet pages and when their friends get there and like what they see, they don't know how to find your home page to get more of your content.
This isn't good at all, so make sure to have your site navigation links on ALL web pages. At the very least, add a link back to the page they clicked away from and/or your sitemap.
Custom Error Pages
Bad links are everywhere, such as links to old pages that are no longer on your server —— while mistyped links occur constantly. This is why you must have custom error pages; they don't have to be anything fancy; and often, the simpler they are the better.
Make sure your site navigation is included on the error page, and add a simple "Sorry, we can't find that page" message, along with a little guidance such as "click here to go to our homepage or here to go to our sitemap to find what you're looking for."
In our sample .htaccess code, you'll notice the last line deals with the proper re-direction of errors on your site to your custom error page.
Subfolder Index Pages
You probably use subfolders for images, scripts, ads and many other files; but you may not be using an index file in each of those folders. This means that anyone can type in the base URL of the subfolder, such as 'www.yoursite.com/images/' and the server will look for an index page in that folder — and if an index file is absent from that folder, then the server will put up the parent folder page with all of the folder's contents clearly listed.
If you want to keep prying eyes out of those folders, an index page is needed. You can make a simple "you don't belong here" type of page or even redirect that index to another page of your choosing, such as your homepage, for example.
In our next installment, we’ll cover maximizing universal and image search results; setting up a robots.txt file; preventing someone from sending spam that appears to be from your domain; using the primary sitemap files; Google, Yahoo and MSN webmaster tools that can improve your site’s search engine performance; and analyzing your traffic sources.