Duplicate Content is one of the most perplexing problems in SEO. In this post I am going to outline 15 things about how Google handles duplicate content. This will include my leaning heavily on interviews with Vanessa Fox and Adam Lasnik. If I leave something out, just let me know, and I will add it to this post.
1. Google’s standard response is to filter out duplicate pages, and only show one page with a given set of content in its search results.
2. I have seen in the SERPs evidence that large media companies seem to be able to show copies of press releases and do not get filtered out.
3. Google rarely penalizes sites for duplicate content. Their view is that it is usually inadvertent.
4. There are cases where Google does penalize. This takes some egregious act, or the implementation of a site that is seen as having little end user value. I have seen instances of algorithmically applied penalties for sites with large amounts of duplicate content.
5. An example of a site that adds little value is a thin affiliate site, which is a site that uses copies of third party content for the great majority of its content, and exists to get search traffic and promote affiliate programs. If this is your site, Google may well seek to penalize you.
6 Google does a good job of handling foreign language versions of site. They will most likely not see a Spanish language version and an English language versions of sites as duplicates of one another.
7. A tougher problem is US and UK variants of sites (”color” v.s. “colour”). The best way to handle this is with in-country hosting to make it easier for them to detect that.
8. Google recommends that you use Noindex metatags or robots.txt to help identify duplicate pages you don’t want indexed. For example, you might use this with “Print” versions of pages you have on your site.
9. Vanessa Fox indicated in her Duplicate Content Summit at SMX that Google will not punish a site for implementing NoFollow links to a large number of internal site links. However, the recommendation is still that you should use robots.txt or NoIndex metatags.
10. When Google comes to your site, they have in mind a number of pages that they are going to crawl. One of the costs of duplicate content is that when the crawler loads a duplicate page, one that they are not going to index, they have loaded that page instead of a page that they might index. This is a big downside to duplicate content if your site is not (more) fully indexed as a result.
11. I also believe that duplicate content pages cause internal bleeding of page rank. In other words, link juice passed to pages that are duplicates is wasted, and this is better passed on to other pages.
12. Google finds it easy to detect certain types of duplicate content, such as print pages, archive pages in blogs, and thin affiliates. These are usually recognized as being inadvertent.
13. They are still working on RSS feeds and the best way to keep them from showing up as duplicate content. The acquisition of FeedBurner will likely speed the resolution of that issue.
14. One key think they use as a signal as to what page to select from a group of duplicates, is that they look at and see what page is linked to the most.
15. Lastly, if you are doing a search and you DO want to see duplicate content results, just do your search, get the results, and append the “&filter=0″ parameter to the end of your search results and refresh the page.
Here is a summary of Ways to Create Duplicate Content, and Adam Lasnik’s post on Deftly Dealing with Duplicate Content that explains how you handle this problem on your site.
by Eric Enge - Source: stonetemple.com
15 Things About How Google Handles Duplicate Content
August 7, 2007, 9:29 amEU OKs German Online Search-Engine Grant
August 6, 2007, 9:58 am
The European Union on Thursday authorized Germany to give $165 million for research on Internet search-engine technologies that could someday challenge U.S. search giant Google Inc.
The Theseus research project - the German arm of what the French call Quaero - is aiming to develop the world's most advanced multimedia search engine for the next-generation Internet. It would translate, identify and index images, audio and text.
Initially, the German government would pay several "icebreaker" companies - Siemens AG, SAP, AG, Deutsche Thomson oHG and EMPOLIS GmbH, owned by Bertelsmann AG - to kick start research. Later, the German funding would be spread out to small and medium businesses for them to build on the earlier research. France is still discussing a similar subsidy plan with the European Commission, aiming to give $112 million to research led by French video-technology company Thomson.
EU regulators said it could allow Germany to subsidize the Theseus project until 2011 because the government grants were made in a way that would prevent giving any one company an unfair advantage over others. The grants also are expected to help industry work more closely with scientists, making the research more efficient.
Fragmented European research efforts are one of the reasons blamed for the region lagging behind the United States in information technology. European companies in general spend far less on research than those based in other parts of the world, and the EU said the project should help change that.
Mountain View, Calif.-based Google Inc. (nasdaq: GOOG - news - people ) isn't standing still. It continues to introduce new software and has been aggressively seeking ways to import offline media, such as books and television shows, into its search engine.
Google, founded nearly nine years ago, is now one of the world's best known - and most valuable - companies. On Thursday, it posted a second-quarter profit of $925.1 million - a 28 percent jump from the same time last year.
Copyright 2007 Associated Press. All rights reserved. This material may not be published broadcast, rewritten, or redistributed
Source: Forbes.com
The Theseus research project - the German arm of what the French call Quaero - is aiming to develop the world's most advanced multimedia search engine for the next-generation Internet. It would translate, identify and index images, audio and text.
Initially, the German government would pay several "icebreaker" companies - Siemens AG, SAP, AG, Deutsche Thomson oHG and EMPOLIS GmbH, owned by Bertelsmann AG - to kick start research. Later, the German funding would be spread out to small and medium businesses for them to build on the earlier research. France is still discussing a similar subsidy plan with the European Commission, aiming to give $112 million to research led by French video-technology company Thomson.
EU regulators said it could allow Germany to subsidize the Theseus project until 2011 because the government grants were made in a way that would prevent giving any one company an unfair advantage over others. The grants also are expected to help industry work more closely with scientists, making the research more efficient.
Fragmented European research efforts are one of the reasons blamed for the region lagging behind the United States in information technology. European companies in general spend far less on research than those based in other parts of the world, and the EU said the project should help change that.
Mountain View, Calif.-based Google Inc. (nasdaq: GOOG - news - people ) isn't standing still. It continues to introduce new software and has been aggressively seeking ways to import offline media, such as books and television shows, into its search engine.
Google, founded nearly nine years ago, is now one of the world's best known - and most valuable - companies. On Thursday, it posted a second-quarter profit of $925.1 million - a 28 percent jump from the same time last year.
Copyright 2007 Associated Press. All rights reserved. This material may not be published broadcast, rewritten, or redistributed
Source: Forbes.com
Google's cookie cut may not be enough for EU
August 6, 2007, 9:56 am
A member of an influential European Union privacy group has said it will meet to discuss whether Google has gone far enough in reducing the amount of time the Google cookie stays on computers.
Alexander Dix, Berlin's security and privacy representative, told CNET News.com sister site ZDNet UK that the Article 29 Data Protection Working Party, a group of European privacy experts, welcomed Google reducing its cookie time to two years, but said the group would discuss whether Google has gone far enough.
"It's certainly an improvement, but we will have to discuss whether this is enough," Dix said. "It's a good thing that Google has addressed the question of a cookie time limit."
Cookies are small files stored on a computer so that it can be recognized when it revisits Web sites, enabling the site to remember the user's preferences for things like e-commerce, and sites that require a log-in.
Dix said that Google renewing the cookie every time a person used either Google or a site using Google applications, such as Google Analytics, was not a major privacy concern, as users could control cookies by configuring their browser.
"People can influence cookies by configuring their browser--they can just accept one session. Users have more choice than with their log profiles," he said.
Even so, the privacy expert said that cookies were still a concern for the data watchdog, especially cookies that users have accepted or rejected without knowing they have done so. However, Dix said that a bigger concern was the anonymization of server log data, and that the only major search company to disclose its server log data-retention policy had been Google, which anonymizes server logs after 18 to 24 months. Major search players such as Microsoft and Yahoo have yet to disclose their server log data-retention policy, Dix said.
"Certainly Microsoft and Yahoo have not discussed server log profile retention so far. Google has, and we would welcome it if Yahoo and Microsoft did the same," Dix said.
Server log data shows how a computer has been used to search, and can be mined to provide information. Dix said that the major search players had not disclosed how they intended to use that information.
"Our main concern about all search engine providers is that they are transparent about what they intend to do with the information--a concern Microsoft hasn't addressed so far. Maybe they have a privacy-friendly policy--I don't know. They should certainly tell users if they have one," said Dix.
A senior representative for Yahoo Europe said the company will make an announcement on data retention policies "in a matter of weeks."
"Our policies reflect the fact that our users' trust is one of Yahoo's most valuable assets. Maintaining that trust and protecting our users' privacy is paramount to us. Our data retention practices vary according to the diverse nature of our services. We don't break out that information currently as we view it to be commercially sensitive," said the representative.
"We only keep data as long as is required by law and is useful for our business purposes. In some cases, that is as short (a period) as a few weeks. This data is used to benefit our users in many ways. That includes protection against fraud, personalized content, product innovations based on what we learn about how users interact with our site, and best-in-class free services paid for by targeted advertising," the representative added.
Microsoft declined to comment.
Source: news.com.com
Alexander Dix, Berlin's security and privacy representative, told CNET News.com sister site ZDNet UK that the Article 29 Data Protection Working Party, a group of European privacy experts, welcomed Google reducing its cookie time to two years, but said the group would discuss whether Google has gone far enough.
"It's certainly an improvement, but we will have to discuss whether this is enough," Dix said. "It's a good thing that Google has addressed the question of a cookie time limit."
Cookies are small files stored on a computer so that it can be recognized when it revisits Web sites, enabling the site to remember the user's preferences for things like e-commerce, and sites that require a log-in.
Dix said that Google renewing the cookie every time a person used either Google or a site using Google applications, such as Google Analytics, was not a major privacy concern, as users could control cookies by configuring their browser.
"People can influence cookies by configuring their browser--they can just accept one session. Users have more choice than with their log profiles," he said.
Even so, the privacy expert said that cookies were still a concern for the data watchdog, especially cookies that users have accepted or rejected without knowing they have done so. However, Dix said that a bigger concern was the anonymization of server log data, and that the only major search company to disclose its server log data-retention policy had been Google, which anonymizes server logs after 18 to 24 months. Major search players such as Microsoft and Yahoo have yet to disclose their server log data-retention policy, Dix said.
"Certainly Microsoft and Yahoo have not discussed server log profile retention so far. Google has, and we would welcome it if Yahoo and Microsoft did the same," Dix said.
Server log data shows how a computer has been used to search, and can be mined to provide information. Dix said that the major search players had not disclosed how they intended to use that information.
"Our main concern about all search engine providers is that they are transparent about what they intend to do with the information--a concern Microsoft hasn't addressed so far. Maybe they have a privacy-friendly policy--I don't know. They should certainly tell users if they have one," said Dix.
A senior representative for Yahoo Europe said the company will make an announcement on data retention policies "in a matter of weeks."
"Our policies reflect the fact that our users' trust is one of Yahoo's most valuable assets. Maintaining that trust and protecting our users' privacy is paramount to us. Our data retention practices vary according to the diverse nature of our services. We don't break out that information currently as we view it to be commercially sensitive," said the representative.
"We only keep data as long as is required by law and is useful for our business purposes. In some cases, that is as short (a period) as a few weeks. This data is used to benefit our users in many ways. That includes protection against fraud, personalized content, product innovations based on what we learn about how users interact with our site, and best-in-class free services paid for by targeted advertising," the representative added.
Microsoft declined to comment.
Source: news.com.com
Page :
1





