Over on her Facebook page, writer Jennifer Abel is getting pissed off at some of the stuff Facebook is recommending for her:
If this were England, I would sue Facebook for libel; I am THAT offended by the pages they suggest I “like.” Seriously: what the hell did I EVER post, here or anyplace else, to make anybody think I’m a bigot who would support any of those vile organizations designed specifically to deny full human rights to gay people? Hey, Facebook: why not recommend that I “like” Stormfront and the Klan, too? After getting a swastika tastefully tattooed on my ass, of course.
This is why I like Jennifer so much.
Later, in a comment, she elaborates:
…Seriously: for all the stuff I’ve posted on Facebook — including things like “Aww, how sweet, this same-sex elderly couple is getting married” — what the hell makes them think “Oh, yeah, Jennifer is a GREAT candidate to join one of those hateful anti-gay groups with the word ‘family’ in the title”? And given all the anti-TSA stuff I’ve done, what the fuckity-fuck makes them think I want to get a degree in “Homeland Security”?
You don’t have to spend all your free time in Facebook to know what she’s talking about. Facebook can recommend some strange stuff.
I don’t know how Facebook chooses recommendations, but I know a little bit about data mining and searching document collections, and I think I can make some educated guesses. I’m assuming that the algorithms used by Facebook for finding recommendations are related to the algorithms Amazon uses to make product recommendations and (to a lesser extent) the algorithms Google uses for document search. If I’m right, several mechanisms seem likely to be the culprits behind Facebook’s strange recommendations.
We should start with a fact that some people find surprising: No matter what it seems like, neither Facebook nor Google nor Amazon has any idea what you’re talking about. Computers understand a lot of artificial languages — Java, C#, PHP, HTML, CSS, Python — because they are constructed according to rigorous sets of simple rules and talk about a limited set of concepts. When it comes to understanding natural languages such as English, however, random 3rd-graders have much better reading comprehension than even the most advanced software. A service like Google only appears to understand our language because it uses some very clever shortcuts and a lot of processing power.
Early search engines worked entirely off of the individual words in a piece of text, ignoring context completely. On any given web page, rare words scored high and common words scored low. Extremely common words like “and” and “the” were ignored entirely. So if someone searched for several unusual keywords, and your web page happened to have those words, it was likely to be returned near the top of the list of results.
(This is why it’s hard to get to the top of the list for keywords like “criminal lawyer” — the word combination is not very rare — but it’s slightly easier to get to the top for “New York criminal lawyer” and much easier to get to the top of “Muncie Indiana criminal lawyer free consultation”.)
Search technology has gotten better, but to get an idea how primitive it remains, you only have to look at one of the most well-know natural language applications in the world, Apple’s Siri. The voice recognition system is pretty good at figuring out the words (compared to earlier voice systems) but once it gets the words, Siri still has trouble making sense of what you’re trying to say. Ask it “How far away is Moscow?” and it shows you Moscow on a map. It completely missed the question and fell back on matching the keyword “Moscow”.
(Impressively, WolframAlpha gets the answer right — guessing at my location from my IP address — but that’s exactly the kind of question it was designed to answer. You can stump it easily enough with other questions. Siri, by the way, knows about WolframAlpha, but it wasn’t smart enough to recognize my question as the kind of query it should refer to WolframAlpha.)
If Facebook looks at the text of posts to make recommendations — and I’m not sure that it does — it probably can’t understand the text in a post any better than Siri. If you rant about anti-gay discrimination in your timeline — or “like” a page that opposes anti-gay discrimination — Facebook’s computers may pick up on the words and phrases you use, such as “gay” or “family” or “God”, but they won’t have a clue why you’re using those words, or how much you disagree with religious objections to gay marriage. Organizational Facebook pages that support gay marriage and those that oppose it probably seem very similar to a keyword-oriented matching algorithm — they’re talking about the same thing from two different points of view, after all — and if you keep ranting about the Department of Homeland Security, Facebook will assume you want a job there.
Facebook adds to the confusion because it’s always talking about things for you to “like,” but the traditional goal of search engine technology was not to find things you like, but to find things that are relevant. When Facebook tries to find stuff for you to “like,” it essentially treats content you create as a giant query in a search engine. So if you like 10 pages that talk about gay marriage and you write about gay marriage in your timeline, Facebook will recommend other pages and people that talk about gay marriage, but it can’t understand if you support or oppose gay marriage.
If you think of Facebook as recommending relevant things rather than likeable things, then its suggestions to Jennifer were dead-on: She may not like them, but they matter to her, and they spurred her to write about them. (And, in the time it took me to write this, she has gone on to write a Daily Dot article on the subject.)
Facebook’s algorithm for finding suggestions probably depends on data drawn from three basic sources. First, there’s stuff you like, post on your timeline, or otherwise interact with. Second, there’s stuff your friends like, post on timelines (yours or theirs), or otherwise interact with. Then, given those two collections of stuff, Facebook’s algorithm can find other people who have shown an interest in the same things. From that collection of people, the algorithm derives its third data set, consisting of things those other people like, post on their timelines, and otherwise interact with.
[Update: Gideon reminds me in a tweet that there's a fourth source of data: Other sites you visit, and the things that you do there, provided those sites load Facebook content, even if you don't click on it. Facebook would be able to use this to increase the number of people used to build the third data set above.]
This last mechanism is similar to how Amazon can look at the products you buy and recommend other items you might like. It’s based on finding other customers who view and buy the same things as you and then looking at what else those people tend to view and buy. With large data sets — Google, Amazon, and Facebook are all about “big data” — these algorithms can be very effective. (I find that Amazon in particular makes some eerily accurate guesses.) So if you “like” something wildly popular like Dr. Who, Facebook’s computers will find tons of people with similar interests and notice that they also share interests in shows like Star Trek or Farscape, which Facebook will probably recommend to you.
However, when the data is very sparse, the queries can return highly variable results of little significance. It’s similar to how the accuracy of a survey falls off when the sample size is small: Ask 10,000 people about their vote and you can predict the outcome of the next election; ask 5 people about their vote and your result is random nonsense. Since Amazon and Facebook are essentially surveying other people with similar interests in order to predict your interests, if your interests are obscure and unusual then there won’t be many other people from whom to get data, which can lead to strange results.
Suppose you search Amazon to find something obscure, maybe a little-known French translation of an old Turkish book. If it’s esoteric enough, perhaps only one other person has also bought that book in recent times. And then let’s assume that maybe a month later they had to buy a toy for their daughter and settled on a My Little Pony play set. Now, when you visit the page for your obscure book, Amazon’s algorithms are going to look for all other people who bought that product and then at all the other products they bought. And in this hypothetical case with only one other buyer, Amazon is going to see your French translation and offer you Twilight Sparkle.
A similar effect occurs when something really big hits Amazon: So many people buy it that no matter what product you search for, some of the people who bought your product also bought the hugely popular thing. So when you search for a new camcorder, Amazon recommends the new Twilight novel.
I’m pretty sure Facebook is not immune to this problem and may even make it worse because it gives extra weight to people near you in the social graph, effectively narrowing its dataset. If you “like” a little-known performance artist that almost no one else has heard of, then when Facebook’s algorithm searches for other people who like that artist, it may only find one person anywhere near you in the social graph who likes that artist. And if that person also likes Stormfront and the KKK, guess what Facebook’s algorithm is going to suggest?
Another big issue is that a search engine like Google is engineered to produce stable result sets. Given the same query, it should return the same result set every time. You might not always see it that way, e.g. if the queries go to two different servers using document indexes with different update schedules, but the design intent is to return the best result, which should be always be the same for the same incarnation of the document database.
That wouldn’t work on Facebook. You’d quickly grow tired of receiving the same recommendations over and over, no matter how much you liked them. So Facebook’s servers probably try to mix things up a bit, randomly pulling in suggestions from much farther afield than if they used a purely mathematically optimal set.
It’s possible too that Facebook pays attention to what you click on, and if you ignore its best guesses too often, it demotes those suggestions and lets something else into the recommendations list. If you keep ignoring its suggestions, more and more of the weird low-ranked pages will bubble up and be recommended.
Finally, think about how you respond on those rare occasions that Facebook suggests something you actually like: You look at it, and you click “like”, and it joins the collection of all the other things you already “like”. And now Facebook has no need to ever recommend it to you again. Between the things you “like” when you set up your account, and the things you “like” along the way, after a little while all of the good suggestions get used up, and all that’s left is the weird stuff.
To summarize, let me first remind you that this is just guesswork on my part — I don’t know anything definitive about Facebook’s algorithms for recommendations — but here are some of the factors that I think contribute to the screwiness of Facebook’s recommendations:
- Facebook doesn’t understanding natural languages so it doesn’t understand what you and your friends are writing about.
- Keyword-based matching finds text that uses similar words, which may or may not express similar ideas.
- Text search and data mining algorithms are intended to find stuff that is relevant or interesting to you, which may not mean it’s stuff you really “like.”
- Facebook’s recommendations to you are influenced by your friends’ activities and interests.
- Facebook’s recommendations are also influenced by other people who show the same interests as you and your friends.
- [Update: Facebook also learns about you from your activities on affiliated sites.]
- Topics that are very popular can overwhelm the algorithms and show up everywhere.
- Activities related to rare or unusual topics can have the effect reducing the amount of data available for data mining, which increases the variability and reduces the significance of the results.
- Bottlenecks in the social graph can also reduce the amount of data available for mining, which increases the variability and reduces the significance of the results.
- In the quest for clicks, Facebook intentionally offers you unusual opportunities.
- By ignoring suggestions for things you do like, you may be encouraging Facebook to show you other things.
- You use up all the good stuff early, so the stuff you get later tends to be crap.
As I said at the beginning, most of this is guesswork, but I think that if even half of my guesses are correct, it’s not hard to see why Facebook recommendations are so strange.