The concept of ‘duplicate content’ is one of the most misunderstood topics in the SEO community. There is a lot of confusion surrounding what duplicate content actually is, how Google handles duplicate content, and how harmful duplicate content can be for a website.
There are a lot of myths floating around about duplicate content and duplicate content penalties. The reality is, the vast majority cases of duplicate content are an innocuous mistake made by webmasters when they have multiple pages with the same (or very similar) content. Unintentional duplicate content is not malicious or deceitful in any way; it will not eventuate into a Google penalty.
What is Duplicate Content?
According to Google, duplicate content refers to substantive blocks of content within or across domains that either completely matches other content or are appreciably similar. Mostly, this is not deceptive in origin. Duplicate content is content that appears on the internet in more than one place, and generally, it usually occurs within the same domain.
What is Duplicate Content in SEO?
Google is committed to delivering rich, unique, relevant and informative content in their search results. Over the years, the quality of search results has increased dramatically as Google’s algorithm evolved to become more sophisticated. Google is well aware of content that is duplicated on more than one page and will often attempt to filter pages out to show the correct content.
Content that has repeated on several pages can be confusing for Google – unless there are signals put in place to direct the search engine to the master page; they have a difficult time trying to work out which one to rank. In terms of SEO, Google will not impose a penalty for duplicate content SEO alone. However, Google will filter identical content, which can cause a loss of rankings for web pages.
Is Duplicate Content Damaging for SEO?
Yes, but not always with Google penalties or manual actions. Webmasters unintentionally create the vast majority of cases of duplicate content SEO.
While technically not a penalty, duplicate content can still have an impact on search engine rankings. When there are more than one pieces of content that are very similar, search engines have a difficult time trying to decipher the best version.
Duplicate content is a nuisance for search engines as it confuses them; they don’t know which version to include or exclude from their index. The search engines don’t know whether to direct the link metrics to one page or keep it separated between multiple versions. And finally, the search engines don’t know which version to rank for relevant search query results.
When duplicate content SEO occurs, webmasters may suffer rankings and traffic losses. Two main issues usually cause these losses:
- Search engines serve to provide the best search experience for users. For this reason, they will very rarely share multiple versions of content that are very similar. Search engines have to choose which version is most likely to be more insightful, which dilutes the visibility of both pieces of content.
- Instances when there are two pages with the same content, link equity will dilute as other sites will have to choose which page to link to. Instead of links all pointing to the one page, they will be spread across the variations of duplicate content. Given inbound links are a known ranking factor, this can further impact the search visibility of content.
Again, to summarise:
- Duplicate content itself is not grounds for a site penalty or manual action
- Google chooses to consolidate matching content and only show one version – this is because they know that users want authenticity, diversity and unique content
- Duplicate content is not enough for manual action unless it intends to manipulate search results
- The worst-case scenario from Google filtering content is that the less desirable page presents in search results
How to Identify Duplicate Content
In order to find duplicate content, it’s crucial to get an understanding of how duplicate content issues occur. Majority of the time, website owners and publishers aren’t purposely creating duplicate content. However, duplicate content is said to take up almost 30% of the web.
Duplicate Content on Same Domain
Here are some common ways that duplicate content unintentionally occurs on the same domain:
1. HTTP vs. HTTPS & www vs. non-www pages
If a website has the SSL certificate installed, there will automatically be two new versions of a website – https:// and http://. Most sites will also have ‘www’ prefix variations that are live and visible to search engines.
This means there is potentially four separate versions of each page competing for the same search visibility and rankings:
If all of these pages are separate, live, and are providing identical content, the website will run into issues with duplicate content.
The same could be said for the use of the trailing slash – if pages are making use of the trailing slash, the duplicate pages without will automatically be created as separated pages and vies versa. For example:
Both URLs may look as though they are the same page and have identical content; however, without a 301-redirect pointing from the less-desired page to the more desired page, Google will treat these as two separate pages. And thus, the trailing slash creates a problem with duplicate content.
2. URL Variations
URL parameters, such as click tracking and analytics code, can cause URL variations that may lead to duplicate content issues.
URL parameters are added parts of a URL, often visible after a question mark (?), ampersand (&), or equals (=) sign. Commonly found on eCommerce sites, URL parameters are used to serve content based on specific searches or product groups. While they are helpful to users trying to narrow down to specific results, they have the potential to create multiple identical pages and cause duplicate content issues.
Some content management systems rely on session IDs in the URL. A session is a brief history of what a visitor has done on a website and session IDs create a way to maintain that session for the next time the visitor comes back to the website. A good example of when a session ID is created is when a user adds items to a shopping cart in an eCommerce site, the items are stored in a session, and the user can come back to make the purchase at a later date with the item still in the shopping cart.
When session IDs rely on URLs, every internal link on the website gets that session added to its URL. Because each session ID is unique to the session, it creates a new URL (with the same content on the page) and therefore can cause duplicate content.
Another example of a URL variation for the same page is printer-friendly pages. Printer-friendly versions of content that are also indexing can also cause duplicate content problems.
Duplicate Content vs. Copied Content
The other side of the duplicate content coin is copied content. Copied content is different from duplicating content as it refers to a situation in which someone has deliberately taken material from the original source and repurposed it as their own. Anyone who uses and copies someone else’s content without authorisation is at risk of violating both Google’s guidelines and copyright infringement laws.
If Google classifies your duplicate content as thin content, spun content, or copied content, then you will face a problem that violates Google’s website performance recommendations. The different types of copied content discussed in the Google Search Quality Evaluator Guidelines in 2017 include:
- Content copied precisely from an identifiable source. Sometimes an entire page is copied, and sometimes just parts of the page are copied. Sometimes multiple pages are copied and then pasted together into a single page. Text that has been copied word for word is the easiest type of copied content to identify.
- Content that is copied; but modified somewhat from the original. This type of copying makes it challenging to find the exact original source. Sometimes just a few words are substituted, or whole sentences are altered, or a “find and replace” modification is made, where one word is replaced with another throughout the text. These types of adjustments are intentionally done to make it tough to detect the original source of the content. We call this kind of content “copied with minimal alteration.”
- Content copied from a changing source, such as a search results page or news feed. You often will not be able to find an exact matching original source if it is a copy of “dynamic” content (content that changes regularly). However, we will still consider this to be copied content.
Semantics aside, duplicate content is treated differently to copied content. The difference between duplicated and copied content is that the intent and nature of duplicated text is not malicious; copied content is the act of stealing content from another source and trying to pass it as your own. Where duplicated content may result in loss of search visibility, copied content can be penalised algorithmically or manually by Google.
Identifying Duplicate Content
Your website can contain duplicate content issues without realising. One of the easiest ways to spot and identify duplicate content is to use Google to your advantage. In cases of finding duplicate content, there are many search operators you can use. If you want to see all of the URLs on your site that contains your keyword X article, you can use:
site:example.com intitle: “Keyword X”
Google will then provide you with all of the pages on your domain that contain that keyword. You can essentially use the same practice to find duplicate content across the entire web. If the title of the suspected duplicated article is “Keyword X – What You Should Know”, you can search:
Intitle: “Keyword X – What You Should Know”
Google will provide all the sites that match that title, including any websites that have scraped or copied content from your article.
It is also possible to utilise an automated tool that checks your webpage text for duplicate content found on other domains. If the text on a page has been scraped, copied, or even spun, the offending URL should show up with tools such as Copyscape.
Solutions for Duplicate Content
Given there are several different ways for duplicate content to occur, the solutions to fix duplicate content are reliant on the situation. Once you’ve identified the source of duplicate content, here are some practical solutions to implement:
A canonical tag is a snippet of HTML code that defines the main version of a duplicate. Adding a canonical link element to the duplicate page can help to let the search engines know that a specific URL represents the master copy of a page. Canonical tags will consolidate signals and help to pick the preferred page.
2. 301 Redirects
Pick the desired page and make 301 redirects from the other URLs. This includes any URL variations, https and http pages, and all old duplicated content URLs to the proper canonical URLs. 301 redirects prevent most duplication issues as the pages will simply direct to the master page.
3. Manage URLs
To avoid duplicate content, manage unnecessary URL variations. It is possible to build a script that always puts URL parameters in the same order – this is often referred to as a URL factory. Setting up parameters to tell Google how to handle the URL parameters will signal what they are for rather than Google trying to figure it out.
Duplicated printer-friendly pages are entirely unnecessary; webmasters can alternatively use a print style sheet.
If there are session IDs in your URLs, these can be disabled in your content management system’s settings.
If your duplicate content issues are purely from one particular URL issue; it is entirely possible to fix the problem from the source and manage URL structure better without causing any more damage. By managing URL variations neatly and concisely, webmasters can avoid having unnecessary multiple pages for the same content.
4. Meta Robots Noindex Tag
One meta tag that can be particularly helpful when dealing with duplicate content on the same domain is meta robots. When the values “noindex, follow” are used in HTML head, it lets Google know that they can exclude the page from their index.
The meta robots tag allows the page to be crawled, without indexing the page. This means the search engines can go through the content, with only indexing the master page. Google explicitly cautions against restricting crawl access to duplicate content on your website, so using meta robots is a good solution for sites that want to purely limit indexing on duplicate pages. The meta robots tag is particularly useful for pagination duplicate content issues.
Historically, search engine experts were under the impression that duplicate content was a major red flag when Google was crawling your site. While duplicate content is confusing, messy, and can produce separate pages competing for the same keywords, duplicate content alone is not grounds for a Google penalty. We can finally bust the myth that duplicate content will result in a site being penalised.
While duplicate content is not as dangerous as initially thought; duplicate content is entirely fixable, and should be fixed. Help Google’s search engines out by signalling to the desired pages and creating less confusion. Stop your pages from competing with each other by cleaning up URL variations, nodding to the correct pages, and maintaining consistency.
For the people in the back: duplicate content alone is not a penalty threat, however, taking the time to avoid duplicate content can go a long way in achieving high search ranking visibility and increased organic traffic.