Symposium on Pattern Discovery and PhD Defense of Michael Mampaey

You are cordially invited to the public defense of my PhD dissertation entitled "Summarizing Data with Informative Patterns" on Friday the 21st of October 2011. During the day a Symposium on Pattern Discovery will be held. If you wish to attend the defense or the symposium, please complete the registration form.

Programme

  • 10:00h Welcome coffee
  • 10:30h Talk by Toon Calders (Eindhoven University of Technology)
    "Online Discovery of Top-k Similar Motifs in Time Series Data"

    In recent years there has been a growing interest in the study and analysis of flows of so-called data streams. Typical examples of such streams include Internet traffic data and continuous sensor readings. Traditional data mining approaches are not suitable for mining such streams, because they assume static data stored in a database, whereas streams are continuous, high speed, and unbounded. Therefore, streams must be analyzed as they are produced and high quality, online results need to be guaranteed. In my presentation I will give a short overview of recent results in stream mining, after which I will concentrate on our own recent in motif discovery from streams. A motif is a pair of non-overlapping sequences with very similar shapes in a time series. We studied the online top-k most similar motif discovery problem. A special case of this problem corresponding to k=1 was investigated in the literature by Mueen and Keogh. We generalized the problem to any k and propose space-efficient algorithms for solving it. I will show that our algorithms are optimal in terms of space. In the particular case when k=1, our algorithms achieve better performance both in terms of space and time consumption than the algorithm of Mueen and Keogh. The results are demonstrated by both theoretical analysis and extensive experiments with both synthetic and real-life data.
  • 11:15h Talk by Arno Siebes (Utrecht University)
    "Association Rules That Compress"

    In the past six years I have been involved in finding interesting sets of item sets. One could argue, however, that item sets are never interesting. Or, slightly more accurate, that association rules are more interesting than item sets.
    Hence, a natural question is: can we extend all our work on MDL-based item set discovery to association rules? In this talk I will discuss ongoing work on this, which is done together with Rene Kersten. The conclusion is: yes we can, but it ain't easy.
  • 12:00h Lunch break (on your own)
    Many places to have lunch are close by in the city centre of Antwerp.
  • 13:30h Talk by Geoff Webb (Monash University)
    "Finding Interesting Itemsets"

    Association discovery is classically cast in terms of discovering association rules, but often the division of associated items into an antecedent and consequent is redundant and serves only to generate many representations of each underlying association. This talk presents my approach to discovery of potentially interesting itemsets and the manner in which it addresses the problem of the extreme numbers of associations generated by most techniques.
  • 14:15h Talk by Floris Geerts (University of Antwerp)
    "Data Quality: Research Opportunities in Data Mining"

    Real-life data are often dirty: inconsistent, inaccurate, incomplete stale and/or duplicated. The prevalent use of Internet has been increasing the risks, in an unprecedented scale, of creating and propagating dirty data. This highlights the need for the study of data quality. This talk is to provide an overview of recent advances in the area of data quality. We present a conditional dependency theory for capturing data inconsistencies, matching dependencies for data deduplication, and currency constraints to identify stale data. Currently, the task to specify such data quality rules is left to the user or domain expert. In this talk, the need for new data mining techniques to discover such rules in an automated way is advocated.
  • 15:00h Coffee break
  • 16:00h Public PhD defense by Michael Mampaey
    "Summarizing Data with Informative Patterns"

    The increasingly low cost and relative ease of data acquisition due to technological advances, has allowed us to create vast repositories of data. The field of data mining is involved with the discovery of useful and non-trivial information from such databases, in order to gain novel insight from them. Specifically, explorative data mining techniques such as pattern mining form a popular area of research, offering descriptions of data in the form of local patterns. In practice, however, these techniques tend to overwhelm the user with countless patterns that are potentially interesting, but many of which are highly redundant. This gives rise to the need for succinct, informative summarization methods.
    In this dissertation we explore how local patterns can be used to summarize data. That is, we investigate how we can automatically present a concise, yet informative and interpretable piece of information to the user, accurately summarizing a given dataset in terms of its most important patterns. We consider binary and categorical data, and use itemsets and attribute sets as patterns. We propose several methods. First, we find strong dependencies between sets of attributes, using information-theoretic properties to avoid reporting redundant results. A second approach creates high-level overviews of categorical data, by clustering attributes into correlated groups. Each group comes with a description of the corresponding part of the data, thus providing an intuitive grasp of the prevailing structures. Third, we focus on integrating background knowledge into the data mining process in order to avoid redundancy. This is done by constructing a maximum entropy model based on simple statistics, against which itemsets are ranked. Finally, we present a method that builds a probabilistic model of the data with a collection of itemsets, employing the Minimum Description Length principle to find the best summary. Extensive experimentation shows that the proposed methods produce simple yet high-quality, insightful data summaries, and that their construction can be achieved efficiently.
  • 18:00h Reception

Registration

Registration is closed.

Coordinates

Date:
Friday the 21st October 2011
Time:
10:00h (Symposium)
16:00h (Public PhD defense)
Location:
Promotiezaal, klooster van de Grauwzusters
Stadscampus, Universiteit Antwerpen
Lange Sint-Annastraat 7
2000 Antwerpen


View Larger Map

Jury members