Can we learn to locate objects in images, only from the list of objects those images contain? Or the sentiment of a phrase in a review from the overall score? Can we tell who voted for Obama in 2012? Or which population strata are more likely to be infected by Ebola, only looking at geographical incidence and census data? Are large corporations able to infer sensitive traits of their customers such as sex preferences, unemployment or ethnicity, only based on state-level statistics?
In contrast, how can we publicly release data containing personal information to the research community, while guaranteeing that individuals’ sensitive information will not be compromised? How realistic is the idea of outsourcing machine-learning tasks without sharing datasets but only a few statistics sufficient for training?
Despite their diversity, solutions to those problems can be surprisingly alike, as they all play with the same elements: variables without a clear one-to-one mapping, and the search for/the protection against models and statistics sufficient to recover the relevant variables.
Aggregate statistics and obfuscated data are abundant, as they are released much more frequently than plain individual-level information; the latter are often too sensitive because of privacy constraints or business value, or too expensive to collect. Learning in those scenarios has been conceptualized, for example, by multiple instance learning [1,2,3], learning from label proportions [4,5,6,7,8,9], and learning from noisy labels [10,11], and it is common in a variety of application fields, such as computer vision [13,14], sentiment analysis [15] and bioinformatics [1], whenever labels for single image patches, sentences or proteins are unknown, while higher-level supervision is possible.
The setting is not limited to machine learning though. The problem of inferring individual-level behavior from aggregate-level information is commonly faced in the natural, social and medical disciplines, where statistical solutions to problems commonly referred to as aggregation bias [16] or the ecological fallacy [17] are important tools for quantitative reasoning. Solutions and techniques to making valid inference from aggregated data include the ecological inference in political science [18,19,20], econometrics [21] and epidemiology [22], and the modifiable areal unit problem in spatial statistics [23].
But as those approaches are shown to be effective in practice, to the point that the available statistics reveal sensitive attributes with high accuracy, the question is turned around into a search for privacy guarantees. Traditional statistics has studied the problem of confidential data release [24]. Research in k-anonymity, l-diversity [25, 26, 27, 28] and, more recently, differential privacy [29, 30, 31, 32, 33] has proposed procedures to mask data in a way that one can trade-off protection and usefulness for statistical analysis.
Read the CFP.