Imagine, for a moment, that every Web search gave only accurate, verified information. Imagine that questions concerning real facts about the real world returned lists of websites ordered by how well those site's facts matched the real world.
Search for "Barack Obama's nationality," and websites claiming "Kenya" would be banished to the 32nd page of the list. Search for "measles and autism" and you'd have to scroll down for 10 minutes before you found a page claiming they were linked.
Imagine a world in which information on the Web had to be accurate. Well, you don't have to imagine very hard, because that world now seems entirely possible.
In today's world, Web searches rank sites based on their popularity — in terms of links made from other sites to the site in question — as well as the "quality" of those links. Recently, however, researchers at Google published a remarkable paper demonstrating how rankings in a Web search can be driven by something entirely different: the veracity of the facts the sites contain.
The new paper instantly caused a stir. Advocates and detractors argued over what constitutes a "fact" as they pondered the far-reaching consequences of a truth-based Internet.
Before we consider those consequences, however, it's important to see the research itself as a fascinating measure of how powerful the growing field of "data science" has become. As the authors state in their introduction:
"In this paper, we address the fundamental question of estimating how trustworthy a given web source is. Informally, we define the trustworthiness or accuracy of a web source as the probability that it contains the correct value for a fact (such as Barack Obama's nationality), assuming that it mentions any value for that fact ..."
To achieve their goal, the researchers devised a "knowledge-based trust" evaluation algorithm to define any site's accuracy. They write:
"We extract a plurality of facts from many pages using information extraction techniques. We then jointly estimate the correctness of these facts and the accuracy of the sources using inference in a probabilistic model. Inference is an iterative process, since we believe a source is accurate if its facts are correct, and we believe the facts are correct if they are extracted from an accurate source. We leverage the redundancy of information on the web to break the symmetry. Furthermore, we show how to initialize our estimate of the accuracy of sources based on authoritative information, in order to ensure that this iterative process converges to a good solution."
Thus, the whole process is repetitive, drilling down to an ever-better link between claims and verifiable knowledge about the world. It works via the researchers' use of Google's giant Knowledge Graph and Knowledge Vault projects, which have been using the Web to build links between facts and reference works for an insane amount of information. "Facts" are actually represented in the study as "knowledge triples" such as (Albany, New York, capital) or (Barack Obama, nationality, American). By comparing a specific knowledge triple found in any given Web page against the giant databases, the algorithm can determine any website's accuracy in relation to established facts.
It's a powerful and clever approach, but more to the point, it appears to work. Applied to 2.8 billion facts, the researchers were able to evaluate the trustworthiness of more than 119 million websites. Thus it appears that, yes, accuracy can be used as a criterion for ranking Web-searches.
But to understand what this means as compared to the old link-based rankings, consider this from The Washington Post's Caitlin Dewey:
"In one trial with a random sampling of pages, researchers found that only 20 of 85 factually correct sites were ranked highly under Google's current scheme. A switch could, theoretically, put better and more reliable information in the path of the millions of people who use Google every day. And in that regard, it could have implications not only for [search engine design] — but for civil society and media literacy."
Google has been explicit that this is only research and there are no plans to implement the system anytime soon. Still, the reaction to the paper makes it clear that, for some, even the ideas in the paper present significant problems.
As Anthony Watts, who runs a popular climate skeptic website, told Fox News, "I worry about this issue greatly. ... My site gets a significant portion of its daily traffic from Google." He added, "It is a very slippery and dangerous slope because there's no arguing with a machine."
If such accuracy-based Web searches were ever to become the norm, it seems clear that the Internet — as a public space for information distribution — would be fundamentally changed. And with that possibility, we can see how even the discussion around Google's research raises two fundamental questions for society: Do we believe there are actual facts about the world? Do we believe there are ways to judge them to be so?
There is a lot riding on our answers.
Adam Frank is a co-founder of the 13.7 blog, an astrophysics professor at the University of Rochester, a book author and a self-described "evangelist of science." You can keep up with more of what Adam is thinking on Facebook and Twitter: @adamfrank4.
civil society
accuracy
web search
facts