Computer Science Building, Room 151
The goal of my work is to detect implicit social ties or closely-linked entities within a data set. In data consisting of people (or other entities) and their affiliations or discrete attributes, we identify unusually similar pairs of people, and we pose the question: Can their similarity be explained by chance, or it is due to a direct ("copying") relationship between the people?
This question has applications to social network analysis, such as in fraud detection, where it may be valuable to identify such relationships or other types of coordinated activity. The question also arises in situations where we must decide whether we have observed two distinct people or the same person twice.
The thesis will explore how to assess this question, and in particular how one's judgments and confidence depend not only on the two people in question but also on properties of the entire data set. I will develop methods for solving this problem and experiment with them across multiple synthetic and real-world data sets. My approach requires a model of the copying relationship, a model of independent people, and a method for distinguishing between them. I will focus on two aspects of the problem: (1) choosing background models to fit arbitrary, correlated affiliation data, and (2) understanding how the ability to detect copies is affected by factors like data sparsity and the numbers of people and affiliations, independent of the fit of the models. Finally, I will apply these methods to several real-world data sets and evaluate them using domain-specific information.
Advisor: David Jensen