From what word count could identical text between two pages be considered duplicate content by search engines?
Let’s put ourselves in the following context: you have a very basic website. But two of your pages presenting your company have more or less the same content. At least at first sight.
When you count the words, you realise that page A has more words than page B. But in their semantic and grammatical composition, the two texts seem to be identical in whole or in part.
Let’s say that 80% of the text on page B is a replica of the text on page A.
Can we deduce from this that the content of page B is a duplicate content of page A from a search engine point of view?
It is difficult to answer this question because, in my opinion and from what I have learned of the subject, there has never been a concrete answer given.
Duplicate content is difficult to quantify
When I say concrete answer, I mean that there are no clear and pronounced proportions or figures by Google and other search engines. Google’s recommendations on duplicate content are very vague:
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content in the same language or are appreciably similar.https://developers.google.com/search/docs/advanced/guidelines/duplicate-content
Google then gives some examples that clearly correspond to duplicate pages rather than the text that we tend to deal with.
How much text then is considered duplicate content?
To answer this question, let’s put our logic into practice. If two pages have exactly the same content, then it is certain that it is duplicate content. The same is true if the content on these same pages has many similarities.
If you present tourist tours in Paris and Bordeaux, there is a good chance that the content involves the same vocabulary. The difference will be in the use of entities such as the names of monuments, places etc…
On the other hand, the notion of duplicate content is more subtle if two pages have only one identical paragraph.
I also consider that particular attention should be paid to the type of page. A data sheet for one nut will hardly be unique from a data sheet for another nut. Duplicated content will be « logical » because it is expected.
This principle of logic is for me the best indicator of duplicated content. It forces us to consider the text we write as a more valuable element with regard to the Internet users to whom we address it.
In fact, Google gives us a bit of a heads up on this subject by explaining that :
If you have many pages that are similar, consider expanding each page or consolidating the pages into one.https://developers.google.com/search/docs/advanced/guidelines/duplicate-content
To conclude this article with perhaps clearer answers, I will give you my estimate of duplicate content. I tend to think that if 70% of a text is identical to another, then it should be modified.
Of course, this ratio is personal and depends on the volume of words in a text. A page with 200 words of duplicated text will have to be rewritten in its entirety. If these same 200 identical words are drowned out by 1000 others, I think that there is only a semantic readjustment to be made.