Limiting Large scale crawls of social networking sites

 
 

What is the motivation?


Large scale data aggregating crawlers are problematic for Online social networking site (OSN) users as well as for OSNs. OSNs, such as Facebook, Twitter, and Orkut, contain data about millions of users. These OSNs allow users to browse the profile of other users in the network, making it easy for users to connect, communicate and share content. This core functionality of OSNs, however, can be exploited by crawlers to aggregate data about large numbers of OSN users for re-publication [1] or other more nefarious purposes [2] that violate users’ privacy.


Crawlers present a significant problem not only for OSN users but also for OSN site operators. First, many OSNs view the user data as a valuable asset that could be leveraged to generate revenue in the future, for example, via targeted advertisements. So OSNs have an incentive to prevent third party crawlers from accessing their data. Second, while OSN operators can ensure that data is used according to privacy policies specified on their sites, they cannot make any guarantees about how crawlers will use that data. A third party that crawls an OSN can do anything with that data (e.g. re- publish the data or infer private information [2]). Yet, if the third party crawler does something nefarious, the OSN operator is likely to be held responsible, at least in the court of public opinion. For example, Facebook was widely blamed in the popular press for allowing a crawler to gather public profiles of a large number of users [1].


Are not the social networking sites doing anything about it?


They do, but their techniques have little effect on Sybil crawlers. Today OSN operators employ various rate-limiting techniques to restrict a crawler’s ability to scrape the network. These techniques typically rely on limiting the number of user profiles a single user account or IP address can view in a given period of time [3]. Unfortunately, these schemes can be easily circumvented by a Sybil attack, in which the crawler creates a large number of fake user accounts and/or hires a botnet to gain access to multiple IP addresses.


What is the solution?


In this work, we propose “Genie”, a system that OSN operators can deploy to limit Sybil crawlers. Genie relies on a key assumption about OSNs, namely, that it would be hard for crawlers to establish an arbitrarily large number of links to users in an OSN. Intuitively, this assumption is based on the observation that forming a new link between two users requires a certain amount of familiarity between the users involved. Genie leverages this insight to limit large scale crawls. It ties the ability of a user to crawl the OSN to the number of links she establishes with the rest of the network.



REFERENCES


[1] http://tcrn.ch/9JvvmU.


[2] http://bit.ly/jlarLI.


[3] T. Stein, E. Chen, and K. Mangla. Facebook Immune System. In SNS’11.



People


            Student                    Postdoc                    Faculty                    Alumni


        Mainack Mondal         Allen Clement            Peter Druschel        Ansley Post

        Bimal Viswanath                                    Krishna P. Gummadi

                                                                            Alan Mislove


Contact:


        please send your queries/concerns/suggestions to: mainack {at} mpi-sws.org