Prof. Dr.

Martin Theobald


Biographical Sketch

As of 2012, I joined the ADReM research group at the University of Antwerp as an Associate Professor, where I'm teaching courses on Databases and Information Retrieval. Between 2008 and 2012, I worked as a Senior Researcher at the Max Planck Institute for Informatics in Saarbrücken, with which I'm still standing in close collaboration. Before this, I spent two years as a Post-Doc at the Stanford Infolab, where I worked on the Trio probabilistic database system as well as on the BioAct and WebBase projects. I received a doctoral degree from Saarland University in 2006 for my work on the TopX search engine for the ranked retrieval of XML data. I'm currently an Area Editor for Elsevier's Information Systems, and I served on the program committees and as a reviewer for numerous international journals, conferences and workshops, including TODS, TKDE, VLDB-J, PVDLB, SIGMOD, SIGIR, ICDE, WSDM and WWW. And—by the way—I like cycling.

Research Interests

  1. Probabilistic and Temporal Databases

  2. Information Retrieval and Ranking in Databases

  3. Scalable RDF Data Management

  4. Information Extraction


PhD Students

  1. Hernan Blanco (Distributed Probabilistic Databases)

  2. Maximilian Dylla (Probabilistic & Temporal Databases, Top-k, Learning)

  3. Sairam Gurajada (Distributed RDF Indexing & Querying)

  4. Dat Ba Nguyen (On-the-Fly Named Entity Disambiguation)


Recent selected papers:

  1. TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing. SIGMOD 2014

  2. A Temporal-Probabilistic Database Model for Information Extraction. PVLDB (6) 2014

  3. Querying and Learning in Probabilistic Databases. Reasoning Web 2014

  4. Top-k Query Processing in Probabilistic Databases with Non-Materialized Views. ICDE 2013

See also:

  1. MPII Database

  2. Google Schoolar

  3. DBLP Trier

Tutorials & Summer School Lectures

  1. "Querying and Learning in Probabilistic Databases" at the Reasoning Web 2014 Summer School, Athens, September 2014

  2. "Scalable RDF Data Management" at the 2nd Workshop on Open Data (WOD), LIP6, Paris, May 2013

  3. "Database Foundations for Scalable RDF Processing" at the Reasoning Web 2011 Summer School, Galway, August 2011

  4. "Semantic Knowledge Bases from Web Sources" at IJCAI 2011, Barcelona, July 2011

  5. "Harvesting Knowledge from Web Data and Text" at CIKM 2010, Toronto, October 2010

  6. "From Information to Knowledge: Harvesting Entities and Relationships from Web Sources" at PODS 2010, Indianapolis, June 2010


  1. Google Focused Research Award, December 2010

  2. ACM SIGMOD Dissertation Award Honorable Mention '06, SIGMOD Conference, Beijing, June 2007

  3. GI-DBIS Dissertation Award '06/'07, BTW Conference, Aachen, March 2007

  4. Otto Hahn Medal of the Max Planck Society '06, Annual Gathering of the Max Planck Society, Kiel, June 2007

Talks & Posters

  1. 10 Years of Probabilistic Querying - What Next? Keynote at ADBIS 2013, Genoa, September 2013

  2. Scalable RDF Data Management & SPARQL Query Processing, Open-Data-Days Tutorial at LIP-6, Paris, May 2013

  3. URDF: Query-Time Reasoning in Uncertain RDF Knowledge Bases, VLDS 2012, Istanbul, August 2012

  4. D2R2: Disk-Oriented Deductive Reasoning in a (RISC-style) RDF Engine, RuleML 2011, Fort Lauderdale, November 2011

  5. Yago-QA: Answering Questions by Structured Knowledge Queries, ICSC 2011, Stanford, September 2011

  6. Interactive Reasoning in Large and Uncertain RDF Knowledge Bases, Free University of Bozen-Bolzano, December 2010

  7. LIVE - A lineage-supported, versioned DBMS, SSDBM 2010, Heidelberg, June 2010

  8. From Information to Knowledge - Harvesting Entities and Relationships From Web Sources, PODS Tutorial, Indianapolis, June 2010

  9. TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency Tracks, INEX Workshop, Brisbane, December 2009

  10. TopX 2.0 at the INEX 2008 Efficiency Track, INEX Workshop, Schloss Dagstuhl, December 2008

  11. Overview of the INEX 2008 Efficiency Track, INEX Workshop, Schloss Dagstuhl, December 2008

  12. SpotSigs - Robust and Efficient Near Duplicate Detection in Large Web Collections, SIGIR, Singapore, July 2008

  13. TopX 2.0, Dagstuhl Seminar on Ranked XML Retrieval, Schloss Dagstuhl, March 2008

  14. Trio - A System for Integrated Management of Data, Lineage, and Uncertainty, USI Lugano, March 2008

  15. TopX (basically a compilation of previous talks with lots of animations, given at various occasions)

  16. TopX - Efficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data, BTW, Aachen, March 2007

  17. An Efficient and Versatile Query Engine for TopX Search, VLDB, Trondheim, September 2005

  18. Efficient and Self-Tuning Incremental Query Expansion for Top-k Query Processing, SIGIR, Salvador de Bahia, August 2005

  19. Efficient Top-k Query Processing for Text, Semistructured, and Structured Data, MPI-IMPRS, May 2005

  20. Probabilistic Top-k Query Processing [poster], MPI-IMPRS, September 2004

  21. Top-k Query Processing with Probabilistic Guarantees, VLDB, Toronto, September 2004

  22. BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources, RIAO, Avignon, April 2004

  23. Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data, WebDB 2003, San Diego, June 2003

Projects & Open-Source Releases

  1. AIDA-light
    High-Throughput Named Entity Disambiguation.

  2. URDF
    A framework of tools and techniques for scalable inference in uncertain RDF knowledge bases.

  3. INEX
    INitiative for the Evaluation of XML Retrieval (see Linked Data and Efficiency tracks).

  4. TopX 2.0
    The current version of TopX with customized index structures and a C++ query processor.

  6. Stanford Trio Project
    A system for the integrated management of data, uncertainty, and lineage.

  7. SpotSigs
    Java source code accompanying our 2008 SIGIR paper "Efficient and Robust Near-Duplicate Detection in Large Web Collections". SpotSigs needs the Java Colt package for some basic hashing operations. The SpotSigs package also contains efficient Java implementations of Locality Sensitive Hashing (LSH, incl. Min-Hashing) and the I-MATCH algorithm for near-duplicate document detection. Our Gold Set contains the 2,160 manually selected near-duplicate news articles we used as reference set in the paper.

  8. JNI_SVM-light-6.01-64bit
    Java Native Interface (JNI) for Thorsten Joachim's genuine SVM-light v6.01 in a compact Java API. The precompiled libraries include support for both 32-bit and 64-bit Windows and Linux environments. The JNI interface supports the full functionality of SVM-light such as classification, regression, and full Java-side parameterization. All sources can easily be recompiled for more eccentric operating systems. See also the JavaDoc and test class for more details.

  9. TopX
    Efficient and Versatile Top-k Query Processing for Text, Semistructured, and Structured Data (original Java version).

  10. BINGO! Focused Crawler
    Bookmark-Induced Gathering of Information with Adaptive Classification into Personalized Ontologies.

photo courtesy of Hector Garcia-Molina