Google recently announced they are researching the use of estimations of trustworthiness of websites to help prioritize results returned from search requests. This excites people who hope that pseudoscience and crazy conspiracy theories will get less attention, but it upsets people who are worried that the results will be biased:
“I worry about this issue greatly,” said Anthony Watts, founder of climate denying website “Watts Up With That,” in an interview with FoxNews.com. “My site gets a significant portion of its daily traffic from Google… It is a very slippery and dangerous slope because there’s no arguing with a machine.”
That’s from a Salon article by Joanna Rothkopf, who seems to think that keeping people away form Watt’s site is a great idea:
[…] some anti-science advocates are upset about the potential development, likely because their websites will become buried under content that is, well, true.
That not really fair. Most of the people who are concerned about Google’s research aren’t afraid they will lose out to the truth. They think they have the truth, and what worries them is that they will lose out to ignorance, confusion, and bias. Or they worry that the system is rigged to allow their ideological opponents to keep lying:
But others who follow media bias note that even the media watchdogs – let alone the sites used by the Google researchers like Wikipedia – are often biased.
“They’re very good at debunking myths if they upset liberals, but if it’s a liberal or left-wing falsehood, the fact-checkers don’t seem as excited about debunking it,” Rich Noyes, research director at the Media Research Center, told FoxNews.com.
I think this concern originates from a misunderstanding of what Knowledge Vault really is and how it works.
“Google should be commended for taking on the great task of fighting against propaganda and misinformation,” Nomiki Konst, executive director of The Accountability Project, told FoxNews.com.
“Hopefully Google will work closely with the FCC and journalism watchdogs in setting up standards to validate what is factual and who represents themselves as journalists,” Konst said.
Actually, Google will do nothing of the sort, because Konst’s statement is based on imagined capabilities that Knowledge Vault just doesn’t have. This is what you get when you interview people about something that they haven’t had time to learn about.
Meanwhile, Jack Marshall at Ethics Alarms is also concerned about where this could lead:
Can you see Google reducing the rank of websites that are consistently deceitful and misleading, like those claiming the women make only 77% as much as men because of gender bias, or that one in four women who go to college are raped, or that Mike Brown had his hands up […]? I can’t.
I think probably not, but because of technical limitations, not ideological bias.
Will websites that assert religious beliefs be judged “untrue”? How about sites that assert that Islam is a violent and revolutionary religion? Determining which sites get the most traffic and links can be determined objectively; deciding what is true and factual requires complex and debatable distinctions between opinion and fact, metaphor, hyperbole, ideology, skepticism, and deceit […]. Will just facts be at issue, or deceitful arguments made specifically to make readers believe what isn’t true?
From a quick glance at the research papers, I’m pretty sure Google’s Knowledge Vault doesn’t come anywhere close to being able to untangle problems like those.
One of the things people are sometimes surprised to learn about search engines like Google is that they have very little understanding of what the text on a page actually means. (I’m talking here about mainstream public search engines, not highly experimental research projects, which may do slightly better.) At their most basic, search engines are all about recognizing words.
Search engines break down documents — web pages — into compact statistical descriptions of the words they contain. Mathematically, these descriptions define a vector space, and each document’s location in that space depends on which words it contains, how unusual the words are (“the” is pretty much ignored, “coprolalia” gets lots of attention), how many times the words appear in the document, where they appear in the document, how large the document is, and so on. As a result of the way these document vectors are constructed, the distance between two documents in vector space is inversely proportional to their similarity. Documents in the same region of the document space probably have similar topics.
When a user types a query into a search engine, it is essentially treated as a very short document, its vector is constructed, and the search engine finds its location in the document space. All the nearby documents are then returned as the result of the search, with the nearest documents at the top of the list.
Well, more or less. I’m giving an oversimplified description of how search engines match documents. There are actually a lot of refinements to the search process. For example, before document vectors are constructed, all the words have to be stemmed, meaning that related words like “walk”, “walks”, “walked”, and “walking” are all mapped into “walk” so that a search for “walk” is able to locate a document that only ever uses “walking.” The engine might also have a dictionary of synonyms, so a search for “teacher” will also find documents the only use “educator”, “professor”, or “schoolmarm”.
In addition, once documents are matched to the query, there’s more to prioritizing the results than just the distance metric in document space. For example, the document space is not uniform, and some types of documents will tend to group together because they are about the same thing. E.g. news articles about the upcoming Avengers movie will tend to all use the same groups of unusual words and phrases — “avengers”, “tony stark”, “hulk”, “thor”, “hawkeye”, “nick fury”, “ultron”. When searching for documents near a query vector, a search engine might identify nearby clusters of documents and give a boost to the rank of representative documents from the cluster, on the theory that documents about popular subjects are more likely to be relevant.
Another priority adjustment is the one that made Google famous: Page Rank. The folks who created Google realized that the World Wide Web offered them more information about a document collection than just the content. It offered links. This was important because links were created by humans, and the human ability to read and understand documents is the gold standard. So if a large number of humans thought a document on the web was important enough to link to, Google gave it a more prominent position in the search results. This was a revolution in web search engines, and it made Google the preferred starting point on the web. (And launched the search engine optimization industry.)
Google watchers are also pretty sure that Google makes a other adjustments to results. It seems to prioritize sites that have been around a while, presumably on the assumption that brand-new sites are suspect. On the other hand, Google seems to love new content on established sites, probably for its novelty and for signs that the site is being maintained and kept up-to-date. Google also penalizes sites that break its rules, such as by selling links, and they seem to have ways of spotting link farms of websites created solely for the purpose of jacking up Page Rank.
This latest idea, rating pages according to an estimate of trustworthiness, is just another attempt to refine the search results. It works by attempting a more sophisticated understanding of the content of web pages than just recognizing words. Knowledge Vault uses modern natural language processing algorithms to extract some small amount of meaning from the text in the form of relations, which are three-part tuples consisting of subject, the name of a property of that subject, and an object. For example:
<Illinois,subdivision of,United States>
<Illinois,capital,Springfield>
<Barack Obama,Senator of,Illinois>
<Barack Obama,birthplace,Honolulu>
(I don’t think Google’s Knowledge Vault is available online, but if you want to get an idea of how well relation extraction works on real document, you can try out a demo of the AlchemyLanguage relation extraction API, which is part of IBM’s Watson project. Just copy and paste a block of text or feed it the URL of a web page and you can explore what it figures out. I find that the Entities, Concepts, and Relations tabs are pretty interesting. It’s nowhere near human quality, but it’s better than I would have thought.)
Once the Knowledge Vault has a collection of these relations, it needs a way to figure out which ones are true. A simple way to do that is to start with collections of relations from a trustworthy source. Google starts with a collection of curated databases which are believed to be fairly reliable, such as Freebase and the various Wikimedia projects.
(To get an idea what kind of data is stored in these knowledge collections, check out the entry for former President Bill Clinton at the open source Freebase and Wikidata databases and at the commercial WolframAlpha database.)
Google can then compare relations extracted from web pages against relations in the trusted databases, and do some analysis to estimate the trustworthiness of the web pages: Pages which get a lot of known facts wrong would be given a low trustworthiness score, and they could be pushed down in the search results, relative to more trustworthy pages.
A lot of the relations extracted from web pages will be new relations which are neither proved nor disproved by the trusted data. However, if the KnowledgeVault keeps finding the same new relations present on pages it has ranked as trustworthy based on the relations it does know about, then it can start to rank those new relations as true as well. Then it can begin to use them in the trustworthiness evaluation process for other new pages. Basically, trustworthy pages can be used to identify true facts, and true facts can be used to identify trustworthy pages, and this can be repeated over and over to expand the fact repository while keeping it anchored to a few million trusted facts drawn from curated databases.
The result will be clusters in the document space of known trustworthy pages. When Google’s search engine maps a query into that space, rather than taking the strictly nearest documents, it can reach out into one of the trustworthy clusters in search of a better result.
Now we can begin to consider some of the concerns raised above. As you can see from the simplicity of the relations and the ranking system, Google is not going to make “complex and debatable distinctions between opinion and fact, metaphor, hyperbole, ideology, skepticism, and deceit.” Nor will they “work closely with the FCC and journalism watchdogs” to set all this up. The process (assuming Google decides to use it) is far too simple and mechanistic for any of that.
Let me give you an example of how it might work. Lately my Twitter timeline has been filled with inane arguments about whether President Obama loves America. If these were web pages, how would Google decide whether to boost the “Obama hates America” pages or the “Obama loves America” pages?
Well, if your web page says, “Obama was born in Honolulu and he loves America,” and my webpage says “Obama was born in Kenya and he hates America,” Google would see two relations from each of us about the entity Obama, one about his birthplace, and one about his feelings toward America. Since Obama’s birthplace is in the baseline database, Google would recognize that your website has one true fact about Obama and mine has one false fact about Obama, which would make Google trust your page more than mine. That trust would also carry over into the other relation on your page, and Google would ever so slightly begin to believe that that Obama loves America.
(Actually, the Knowledge Vault algorithm has to find multiple verifiable facts on a page or website before it will render a judgement on its trustworthiness, but I’m simplifying.)
If hundreds or thousands of websites weighed in on this debate, and if the pages asserting that Obama loves America had significantly more true facts and fewer false ones than the pages that assert that Obama hates America, the Knowledge Vault would eventually start to think of <Obama,loves,America> as a true fact and <Obama,hates,America> as a false one. Soon any page from which <Obama,loves America> can be extracted would be ranked higher than an otherwise equal page from which <Obama,hates,America> can be extracted.
Does that seem crazy to you? That Google would find a page more trustworthy because of what is pretty clearly an opinion? I think it actually makes some sense when you remember three important things. First, Google’s trust adjustments are a statistical inference from data: For whatever unknown and unknowable reason, web pages expressing that opinion have had more checkable facts correct than web pages expressing the opposite opinion, so it seems reasonable to assume that the uncheckable facts are also more likely to be correct.
So for many of the statements of fact that Jack asks about, “…women make only 77% as much as men because of gender bias…one in four women who go to college are raped…Mike Brown had his hands up…Islam is a violent and revolutionary religion” the answer appears to be that Google will tend to judge these statements as true if and only if Knowledge Vault tends to find them on pages that have other true statements. Honest pages are trusted to contain honest information.
The second thing to keep in mind is that Knowledge Vault’s fact database would be only one of several factors that determine a page’s search engine result placement. Google is secretive about its algorithms, but the search engine certainly looks at similarity scoring from the document vector space and Google Page Rank. Google watchers also believe that Google scores pages for load speed, technical correctness, layout clarity, security, and an especially secret method for detecting black hat search engine optimization tricks. If Google does add Knowledge Vault trustworthiness scores to the mix, it will likely only adjust the results computed by other methods.
The third and final thing to remember is that Google’s search engine has one overriding goal: To return results relevant to the user. By definition, true relevance can only be evaluated by humans, so before Google rolls out a search algorithm that uses Knowledge Vault, they will first make it available to their search quality raters — a rotating pool of several thousand part-time workers all over the world — who will compare its results side-by-side to the results produced by the current algorithm.
If it does a better job and returns results they consider relevant, Google will next feed results from the new algorithm to a small percentage of live search users and analyze how they click on links. If the change appears to have a positive effect, an engineering team will make the final decision on whether to roll out the new algorithm with its new trustworthiness metric.
Knowledge Vault doesn’t have to be right all of the time or even most of the time. It only has to be right enough to improve search performance. Knowledge Vault doesn’t have to be perfect. It just has to make Google search better.
Finally, Jack Marshall asks this question about Knowledge Vault:
We just were informed that “cholesterol is not as bad for you as we once thought,” after years of being told that consuming eggs, milk and steak would kill as for sure. There were nutritional and economic consequences of that “fact.” Would Google’s new search methods have buried the assertions of contrarian scientists, who were claiming this years ago, as liars?
I dunno. Maybe. Remember that Knowledge Vault assumes that truths imply trustworthiness and trustworthiness implies truths. If the websites expressing the contrarian opinion on cholesterol appeared trustworthy in all other ways, if they contained statements just as likely to be rated truthful as statements on websites that demonized cholesterol, then Google would probably give them equal weight.
The Knowledge Vault trustworthiness estimation algorithm is partially circular and self-referential — trustworthy web sites contain true facts, and true facts are those contained on trustworthy websites — so it would likely have a tendency to reinforce orthodoxy. (The Page Rank and clustering algorithms have similar tendencies for similar reasons.) It would be nice to have information retrieval technology that didn’t have these limitations…but then again, it would also be nice to have human beings that didn’t have these limitations.
After all, the cholesterol contrarians Jack’s talking about weren’t marginalized by a search engine, they were marginalized by the consensus of doctors and dieticians and nutritionists, perhaps for reasons that seemed good at the time. It’s not realistic to expect Google’s search engine to pore through the mass of scientific publications and pick out truths that have gone unrecognized by almost the entirety of the scientific community. Artificial intelligence that powerful is still in the realm of science fiction.
Besides, if you want to find out about the cholesterol controversy, you only have to Google “cholesterol controversy.”
[…] taken the time to find a description of the Google Knowledge Vault, however, he would have learned (as I did) that Google was just researching the idea of tweaking their algorithm to check pages against a set […]