The discussion about exactly what is meant by duplicate content and whether duplicate content is a problem has been underway for some time now and there is no sign that it is going to go away. So exactly how is duplicate content defined and is it really a problem?
The general and widely accepted view is that duplicate content indeed matters and, despite the fact that one highly respected search engine optimization expert recently wrote an article opposing this view, even a quick look at the mass of material which has been written on this topic in recent weeks will clearly show that this is very much a minority opinion.
If we accept that duplicate content does really matter, then how do we define duplicate content? If I write an article for an article directory and then alter that same article for submission to a second directory how will the search engines evaluate these two articles and decide whether they contain duplicate content? The fact of the matter is that we don't know, however, here are one webmaster's thoughts.
When duplicate content checking was initially introduced by the search engines it was very much a case of viewing one web page as a whole against another and no attempt was made to begin cutting up the two pages and comparing individual page elements. At that time you could take identical content and simply add an introduction and conclusion to one of the two pages and that would be enough to avoid the problem of duplicate content. Sadly for many writers those days have long since disappeared.
The search engines now divide up the two pages to allow them to examine individual elements and it is this which is the core of the present discussion. Most people agree that attention is now focused upon the central content of a page rather than the structure of the page. A large number of site owners make use of templates to build their pages which define the structure of each page including such things as navigation menus, headers and footers. This is widely believed to be acceptable and the search engines do not class this as duplicate content. What the search engines are checking is the actual content contained in the body of the page. But just how do they check this page content?
Some people contend that this examination is done at 'block' level (that is to say at the level of individual sentences or paragraphs), while others think that filtering looks for phrases or possibly even for individual words. None of us really knows of course but it might seem reasonable to assume that the likeliest basis of checking would be to make use of either sentence or phrase matching.
Sentence matching is fairly straightforward and merely involves breaking both pages down into chunks following the punctuation on the page. For example, look at this sentence:
It is relatively simple to get a good deal on a shower unit, as long as you know what to look for.
This would be seen as either a single sentence or as two sentences, depending on whether you use the traditional definition of a full-stop as being the end of a sentence or choose to adopt a flexible approach and make use of other punctuation marks, such as commas.
Matching based on phrases is a little bit more difficult. What constitutes a phrase? Should it be made up of 2 or 3 or 4 or 20 words?
Just for the moment let us say that a phrase is 3 words. If this were the case then the following phrases would be seen as duplicate content if they were to appear on two pages which were being examined:
Did you know
The answer is
In those days
In the end
Day to day
At that time
You can get
One way to
These five phrases are all standard day to day phrases which could appear on pages about rose gardening, flying kites, search engine optimization or anything else you can think of. Now some people contend that the search engines do examine pages at this level. In fact, when I questioned the support staff of one popular content checker (Dupecop) about the manner in which their system checked for duplicate content they replied saying:
"DupeCop compares both individual words and 3-word phrases. It also ignores all punctuation and scans across sentences"
I was not surprised therefore that when I Your guess would be as good as mine.
Over the years I have written literally hundreds of articles and have closely monitored the results in terms of duplicate content penalties, as far as it is possible for anybody to do this. On the basis of my own experience I would say that filtering is not conducted down to the level of 3 or 4 word phrases but is far more likely to stop at sentence level. So, providing you re-write your content down to this level, you should have no problem in escaping the filters. In fact, even if one or two of your sentences are duplicated you ought to still be fine.