Raimon Bosch, Departament de Tecnologies de la Informació i les Comunicacions (DTIC), Pompeu Fabra University
This is a study analyzing the different state of the art techniques to generate personalized search results. We will focus on how user’s interactions in social networks are being used to improve user’s experience. We will also investigate if sentiment analysis is starting to being used as a new factor to build personalized search rankings.
Personalization is the next step of search engines in order to improve quality. If a search engine can analyze your search history, your time spent in each website or even your activity in social networks is more likely that it can suggest you good documents. As information retrieval has evolved, personalization has to. If we compare the architecture of Google in 1998  it is very different from today. Today google’s search results are intelligent enough to recommend you a restaurant where your friends have been to, it can predict weather or even finish your queries for you.
It is clear that search ranking algorithms used today are less similar to the original pageRank formula . Nowadays, search engine optimization has become a discipline that tries to predict which are the best changes that you can do in your website to be the first in search rankings. Now, the rank formula takes hundreds of other variables that were not included in the first algorithm. Also, with personalization we enter in the field of privacy issues, it is known that some users might feel spied and they start to wonder if there are other search engines that only provide contextual information.
The analysis of the information provided in social networks is crucial for search engines. The old idea of mouth-to-mouth recommendations now can be accessed easily through the digital print that we have left on the internet through the different social networks that we participate in. This information can be used for personalized search experience, but also can be used to build recommendations for users that are not our friends, that even are not living in our country or they do not have our age but they share some interests with us.
2.1. Personalization vs. Contextualization.
Those are two trends inside the world of information retrieval. For personalization we understand any process of using personal information to alter search results to improve user’s experience. Contextualization is about providing related information to a topic as much close to reality as possible. If for example, you are building a website about video game reviews and you are showing a review about football it will be good for a user if you show him other football game reviews or even some basketball game reviews.
But Personalization can be used to do good contextualizations. How do we know that a recommendation about a basketball game can be useful while we are reading football game review? This is because we know that a % of the users that have been interested in this particular review have been interested also in this basketball game. Maybe because release date was similar, or because it was done by the same company, or because it is a very good game. Sometimes contextualization is not possible if you do not gather data from personalizations.
2.2. The social graph
Social networks are an open door to the social connections of a person. Through this we can know which are the closest friends of a user, or analyze which persons are specially relevant for a determined topic. Lately, we have seen how data recommended by people in our circles are being incorporated in google’s search results. We have seen also the raise of products like Facebook’s timeline, a tool created to track our activity in the different Facebook applications. A new generation of graph algorithms are coming to transform social connections in useful information for search engines.
2.3. Social Rank
Social networks are becoming also a collector of valuable information not only for search personalization, also for improving search contextualization. The share button of facebook is a new indicator to detect content that is very relevant for people. The idea of virality for instance when a video is shared by millions of people because is specially emotional. Similarly, we can see how everyday Twitter is more used as a public stream to criticize brands, sometimes encouraged by the anonymity of fake profiles. Using sentiment analysis techniques we can build new reputation algorithms based in real opinions expressed in the different social networks. So we can have search rankings where companies with certain values are first in rankings.
2.4 Place-based networks
Some social networks are also a way to obtaining information about places. Restaurants, night clubs or relevant touristic places are gathering information from people that is visiting them. Through this information we have available a collection of thoughts about this places that can be useful in order to choose a place to have a dinner, to have a cocktail or to go out at night. This kind of information is also useful to detect new places that are relevant for people. If for instance, a place has been very visited lately because they are offering a new menu this can be a real-time indicator of change in the relevance of a place. This kind of trends can be aggregated also to build reputation ranks in the case of big consumer chains.
3. State of the art
3.1. Personalization on search engines.
Back in 2006 Steve Wedig and Omid Madani  did a very extensive paper about personalization opportunities in search engines. In this study they tried to find patterns in click history of Yahoo’s search engine and they discovered that click-rate is different when users are in front of a topic of special interest for them. Through those “expert users”, they detected relevant documents in a specific topic and improved the ranking algorithm. In 2008 Shengliang Xu et al.  applied the concept of folksonomy (social and collaborative tagging) to perform search personalizations. By giving to users the ability to tag documents with keywords they built a search engine that outperformed older solutions. In 2009, Mariam Daoud et al.  also outperformed older search solutions by using graph based query profiles. Those graph profiles demonstrated to be very useful to detect correlations between topics. If for instance a user that searches for “databases”, searches also for “software” during the same session is likely that both topics are related.
Focusing on weblog data, Bennett et al. in  compared how short-term behavior interacts with long-term behavior on search engines. And how both behaviours may be used in isolation or in union to optimally contribute to gains in relevance through search personalization. They proved how information provided in a session (short-term) is not very useful at beginning, so until a user has not refined a query 3 or 4 times it does not create good recommendations and it is better to do recommendations with historical data. This is mainly because first queries tend to be more ambiguous. There are also semantic approaches, Leung et al.  presented a framework that uses ontologies to keep several topics for each user. Introduces the term Concept User Personalization (CUP). So if a user searches for “apple” and usually clicks for documents talking about fruits and not the apple brand, the concept “fruit” will be part of a personalization for this user. To discover this matches between concepts and documents they use click through data. The ontology obtained with click-throughs is used to map document preferences into user concept preferences, which is used to train a Ranking SVM (RSVM) to produce a vector for each user. They adopt the idea of pointwise mutual information (PMI) to establish semantic similarity between concepts.
Ben Steichen et al.  focused on comparing Personalized Information Retrieval (PIR) with Adaptative Hypermedia techniques (AH). PIR typically aims to bias search results towards more personally relevant information by modifying document ranking algorithms. Some examples would be personalised relevance feedback, topic-sensitive pagerank, or standard keyword frequency measures. The study indicates that today PIR techniques have been more successful because they are based on statistical analysis of historical usage (which is already available) and they tend to be more efficient. By contrast, Adaptive Hypermedia (AH) has addressed the challenge of biasing content retrieval by adapting towards personalisation “dimensions” such as user goals or prior knowledge. Some examples of this techniques would be link adaptation, concept-level adaptation or fuzzy ontologies. AH solutions provide less noise because they are based around structured conceptual models rather than search history. This today is not an advantage because creates a dependency with manual annotations to refine those conceptual models but the rise of semantic technologies is making more easy to apply this kind of solutions.
3.2 Recommendation systems, reputation and trust
In the chapter 4.2 of the Corporate Semantic Web Report, 2011  we have a analysis about techniques that are used in classical and state-of-the-art user recommendations: collaborative filtering, content-based filtering and knowledge-based filtering are highlighted techniques (See Annex A for more information). It also identifies very well all the different sources of information that we need to build recommendations: interests/preferences, knowledge, background, goals, context, user platform, physical context and human dimension. One of the future trends for recommendations remarked in this report is Semantic Web. That will provide recommender systems with a more precise understanding of the application domain and get a richer representation of user related information.
Frank E. Walter et al.  proposes one of the first graph models to introduce social relationship into content recommendations. They explore the concept of transivity for trust propagation to build the model, so if A trusts in B and B trusts in C is likely that A can have affinity with C. The model is based in agents holding lists of document recommendations. So when an agent searches for a new document, it has to ask to neighboring agents for a relevant document.
In chapter 7 of  Linyuan Lü et al. discusses about social filtering and how this variable is introduced inside similarity recommendation algorithms. The concept of trust is also analyzed as a way to control user’s behaviour so bad behaviour can make the user less important (for instance when you get bad reviews on eBay after selling a product). This trust can be used as well to detect noisy ratings by comparing them with data only from a set of implicitly trusted users. In the chapter they show also some models based in rumour dynamics so when a user approves or disapproves a item, this recommendation is shown to all user’s followers.
3.3. Social signals on search engines.
Bao et al. in 2007  was one of the first papers that explored the idea of social page rank. They explored the idea of social annotation vs. web creator annotation (traditional links) and how the text that a user provides when shares a document in a social network can be used to determine its topic similarity and can be used during search. They also detected how some documents with high social page rank have no page rank at all and viceversa, so this can be a useful way to detect new documents that otherway would not be included in a search index. Finally, they did an experiment including this social rank information and outperformed a solution with traditional page rank.
Kang-Pyo Lee et al.  discusses about the idea of a social inverted index based in social tagging. We understand a social tagging those selected words that are used when a document is shared in a social network. A more accurate example of a tag-based web search engine would be Delicious, Flickr or BibSonomy. This study shows a comparison between classical search engines and this new approach and presents a solution to create this kind of indexes. The approach fully supports social dimension of social tagging by adding a user sublist to each resource. This information, although is space expensive facilitates the dynamic resource relevance calculation.
In  Muralidharan et al. used the eye-tracking technique to study how users interact with social annotations. They saw that the read pattern was not very different and a user was mainly still reading the title and the url to decide if a document is relevant. This could be due to a inattentional blindness, so a user has so many stimulus is not able to perceive new stimulus. Furthermore, they noted that the field is not sufficiently developed so most of the annotations were not useful because they came from strangers, or unfamiliar friends with uncertain expertise. To make successful social annotations is crucial to work in algorithms to decide which friends are relevant for a specific topic. Also analyzing social annotations Patrick Pantel et al.  showed that close social friends or experts in a topic provide most utility, meanwhile distant connections have less utility (by a factor of over 50%). They applied a binary supervised learning algorithm to determine if a social annotation is useful or not for a given query. To build this model they used social connection aspects like circle, affinity, topic expertise, geographical distance or interest valence mixed with classical approaches like query aspect and content aspect (See Annex B for more information).
As we have seen in the state-of-the-art chapter, one of the most important things in personalization is gather good data. Researchers have taken data from lots of places in order to study the field of personalization. The most cheap and effective is weblog data because from there we can have access to click through data and detect patterns in user behaviour. Approaches more ambitious like  had taken information from mouse scroll velocity to detect patterns and use them to recommend documents in subsequent searches. We can use an eye-tracking technique, specially if we want to analyze the effectivity of a search interface. Another good source of data are social connections, but we have to be careful to not introduce noise. Only social connections that can provide certain expertise to a topic or that are very close to a user can be taken. The results presented at  and  show that selecting the right social connections is the key factor to improve user engagement.
Once we have all this data collected the big question comes. How we build a model that captures this information? And more important, how we connect this new information to help in search rankings? The probability to end up with noise or with worse search rankings is high if do not build a good model. This leads us to another hot-topic in search personalization and social recommendations in general: semantic technologies. We have seen in  how we can build ontologies by using knowledge models. By matching graph-based query profiles with topic ontologies we can provide user recommendations for a specific topic. As semantic technologies evolve this knowledge models will grow and will cover more topics offering better search results.
In the last years we have seen how another type of search engines have emerged, especially for searching images or link bookmarks, called tag-based web search. On those kind of systems users have to choose a set of words every time that they add a document so folksonomy concepts can be applied. This new model of search engines has been a perfect exploration field to experiment with social and semantic graph algorithms. This strong relationship between users, words and documents requires to create new kind of indexes  and compute rankings with different algorithms .
The field of document recommendations converges with personalization, thats why we have included some state-of-the-art solutions in this survey. Graph-closeness between two users can be a factor to offer recommendations although we have to consider some privacy issues (as Machanavajjhala et al. showed in ). If for instance a user has only one friend you cannot offer as a recommendation all the search history of your friend. Some researchers had investigated the mouth-to-mouth dynamics in a network (also called rumours in social groups) to build those models .
The field of search personalization is a wide research field where multiple techniques and ideas have been tested. If you are thinking in doing personalization for your search engine, the most cheap and easy approach is try to use historical weblog data and gather some information from there. From this data we can discover things like topics of special interest for each user, relevant documents for a user (and by transitivity for a topic), or more generic things like query intent and user goal.
Considering social data is a hard approach. Firstly, because building a model that provides accurate recommendations needs data that is not in the weblog. And secondly, because you might fail and rely in user’s friends that are not close enough to reality. We have to remember that the print that we left in social networks is this: a print. So your real friends or the real experts that you rely on most of times will not be on this print (specially if your application has no entire access to this print). On the other hand, including some relevant social data in a search engine can be a very interesting approach to improve engagement. We have seen in  that some documents marked as relevant in social networks are not relevant in search engines. So including some kind of boosting for those documents is a reliable step to do.
The area of sentiment analysis for search personalization is not very exploited yet. Social data is not always positive, we can be speaking negatively about a brand or a specific product on our Facebook’s wall and this is not considered for rankings. Furthermore, I think that sentiment analysis is interesting enough to use it inside traditional documents by looking at comments for example. In comments (once we clan spammy messages) we can collect relevant information about sentiment and topic orientation of a document as we have seen in  where PA Gloor et al. developed a model able to detect trends and demonstrated how fluctuations of stock market were correlated with user’s comments in Yahoo! Finance. Also, I think that the immediate future of this research field is oriented towards recommendations using social data. The rumour dynamics of a network  or the idea of an agent-based model where your friends can provide you relevant documents  are very interesting approaches. I believe that using Twitter’s data will be the quickest path to see this kind of network dynamics in action.
Annex A. Recommendation techniques:
A.1. Collaborative filtering.
Collaborative filtering is usually based in a collection of user ratings. The idea is to use this past ratings from expert users to generate recommendations for newbie users. It uses cosine angle between vectors to calculate similarity between users. This kind of recommendation has a weak point for new objects that have not been rated yet. Also, if a user has strange tastes can introduce spam in the system easily.
A.2. Content-based filtering.
Content based filtering uses the words inside the documents to generate item vectors and in the same way is using features from the visited documents by a user to generate a user’s vector. With this information the system erases this weak point with new documents since you can directly get “more like this” recommendations by using the content of the item that you are showing to the user. This kind of solution can lead to the problem of overspecialization, so you always get the same recommendations for the same documents. This can be solved by adding some randomness.
A.3. Knowledge-based filtering.
Knowledge based recommenders are not based only in item information and they take personal information from the user to build a complex model. The main challenge of this systems is the definition and the maintenance of the knowledge-base that will provide matching with the personal information gathered or obtained from the user.
Annex B. Social connection aspects:
We talk about a circle when a group of people share something in common. So two people in the same workplace or in the same collegue they are in the same circle.
The affinity refers to the degree of closeness between two members in a network. This problem can be approached from a binary perspective so a friend is close or distant or we can talk about topic affinity so a friend is close when the topic is rock music, but it is not close when we talk about cars.
A member in a social network can be an expert in some topic. For instance, if you have a friend that is posting all the day articles about politics, he is an expert in this topic and might be interesting to pick this user in order to build recommendations about this topic.
B.4. Geographical distance.
Obviously when two people live in the same city is more likely that they can share interests, specially when they share information about places.
B.5. Interest valence.
What in facebook we call share and like. Or in twitter mark as favorite, or retweet. Each operation has different implications and must be considered when building a model.
B.6. Content Aspects (non-social).
The concept of review can be an example of a content aspect. If a user rewards an article with 5 stars he is somehow marking a document as relevant. Other non-social might be considered such as word frequency, keyword density, etc…
B.7. Query Aspects (non-social).
A query has intention. Usually when a query is just informative the social information is not that useful. Also, determining the topic of a query can be useful. When the topic change, the query’s rules change.
 B. Steichen, H. Ashman, and V. Wade, “A comparative survey of Personalised Information Retrieval and Adaptive Hypermedia techniques,” Information Processing & Management, 2012.
 K. W.-T. Leung, D. L. Lee, W. Ng, and H. Y. Fung, “A framework for personalizing web search with concept-based user profiles,” ACM Transactions on Internet Technology (TOIT), vol. 11, no. 4, p. 17, 2012.
 S. Wedig and O. Madani, “A large-scale analysis of query logs for assessing personalization opportunities,” in Conference on Knowledge Discovery in Data: Proceedings of the 12 th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, vol. 20, pp. 742–747.
 F. E. Walter, S. Battiston, and F. Schweitzer, “A model of a trust-based recommendation system on a social network,” Autonomous Agents and Multi-Agent Systems, vol. 16, no. 1, pp. 57–74, 2008.
 M. Daoud, L. Tamine-Lechani, M. Boughanem, and B. Chebaro, “A session based personalized search using an ontological user profile,” in Proceedings of the 2009 ACM symposium on Applied Computing, 2009, pp. 1732–1736.
 K.-P. Lee, H.-G. Kim, and H.-J. Kim, “A social inverted index for social-tagging-based information retrieval,” Journal of Information Science, vol. 38, no. 4, pp. 313–332, 2012.
 A. Paschke, G. Coskun, R. Heese, R. Oldakowski, M. Rothe, R. Schäfermeier, O. Streibel, K. Teymourian, and A. Todor, “Corporate Semantic Web Report IV,” 2011.
 D. Barbagallo, C. Cappiello, C. Francalanci, and M. Matera, “Enhancing the Selection of Web Sources: A Reputation Based Approach,” Enterprise Information Systems, pp. 464–476, 2011.
 S. Xu, S. Bao, B. Fei, Z. Su, and Y. Yu, “Exploring folksonomy for personalized search,” in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008, pp. 155–162.
 H. R. Rezaei, M. N. Dehkordi, and R. A. Moghadam, “Improving Performance of Search Engines Based on Fuzzy Classification,” Indian Journal of Science and Technology, vol. 5, no. 11, 2012.
 P. N. Bennett, R. W. White, W. Chu, S. T. Dumais, P. Bailey, F. Borisyuk, and X. Cui, “Modeling the impact of short-and long-term behavior on search personalization,” in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, 2012, pp. 185–194.
 S. Bao, G. Xue, X. Wu, Y. Yu, B. Fei, and Z. Su, “Optimizing web search using social annotations,” in Proceedings of the 16th international conference on World Wide Web, 2007, pp. 501–510.
 A. Machanavajjhala, A. Korolova, and A. D. Sarma, “Personalized social recommendations: accurate or private,” Proceedings of the VLDB Endowment, vol. 4, no. 7, pp. 440–450, 2011.
 L. Lü, M. Medo, C. H. Yeung, Y.-C. Zhang, Z.-K. Zhang, and T. Zhou, “Recommender systems,” Physics Reports, 2012.
 A. Muralidharan, Z. Gyongyi, and E. Chi, “Social annotations in web search,” in Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems, 2012, pp. 1085–1094.
 P. Pantel, M. Gamon, O. Alonso, and K. Haas, “Social annotations: utility and prediction modeling,” in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, 2012, pp. 285–294.
 S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,” Computer networks and ISDN systems, vol. 30, no. 1, pp. 107–117, 1998.
 L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: bringing order to the web.,” 1999.
 K. Thirunarayan and P. Anantharam, “Trust networks: Interpersonal, sensor, and social,” in Collaboration Technologies and Systems (CTS), 2011 International Conference on, 2011, pp. 13–21.
 P. A. Gloor, J. Krauss, S. Nann, K. Fischbach, and D. Schoder, “Web science 2.0: Identifying trends through semantic social network analysis,” in Computational Science and Engineering, 2009. CSE’09. International Conference on, 2009, vol. 4, pp. 215–222.