Workshop on Big Graph Analysis Systems – Københavns Universitet

Videresend til en ven Resize Print kalender-ikon Bookmark and Share

Datalogisk Institut, DIKU > Begivenhedsmappen > Begivenheder 2017 > Workshop on Big Graph ...

There is an increasing amount of data that takes the form of complex graphs in various applications, such as social network, linked data, telecommunication networks, chemistry, life science, etc. The analysis of such data should not only focus on the attributes attached to the nodes or edges, but also on the way how the nodes are interconnected.

To address the challenges of the analysis of big graph data and develop supporting system technologies for data scientists, it requires joint efforts from various research areas, including but not limited to graph data management, linked data management, datalog, high-performance computing, distributed/parallel systems, etc.

The purpose of this workshop is to get together experts from several relevant communities that are actively addressing the problems related to big graph data analysis to identify commonalities and synergies and to stimulate cross-collaboration that can be expected but yet to be explored.

The workshop will host invited talks and will feature plenty of free time for discussions.

Organizers

  • Yongluan Zhou, University of Copenhagen
  • Amol Deshpande, University of Maryland at College Park
  • Marcos António Vaz Salles, University of Copenhagen

Workshop Program (Tentative)

21 August

09.00 - 09.15 Welcome with a short introduction to the university and the department
09.15 - 10.00 Yannis Papakonstantinou, University of California San Diego

The GQ-Fast accelerator for queries over typed graphs


Abstract:


The GQ-Fast system processes (potentially extended) SQL queries over relational data. It is specialized for the case where the relational data capture an encoding of an annotated graph and the queries correspond to graph analytics. In the current version, these queries involve aggregation, join, semijoin, intersection and selection thus presenting a wide superset of fixed-length graph reachability queries and of tree pattern queries. Extensions to variable-length reachability are in process.

We present challenging real-world interactive (OLAP) graph analytics scenarios, and show that row stores, column stores and graph databases are unacceptably slow in such OLAP scenarios. GQ-Fast's internal structure corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing and compression. GQ-Fast uses a bottom-up fully pipelined query execution model, which enables (a) aggressive compression (e.g., compressed bitmaps and Huffman) and (b) avoids intermediate results that consist of row IDs (which are typical in column databases). GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access.
We used GQ-Fast to accelerate queries for two OLAP dashboards in the biomedical field. GQ-Fast outperforms PostgreSQL by 2–4 orders of magnitude and MonetDB, Vertica and Neo4j by 1–3 orders of magnitude when all of them are running on RAM. We discuss the origins of these performance benefits.

Finally, we present an under-development extension where the GQ-Fast memory-resident database/processor is in charge of the computationally demanding part of processing graph analytics, while a polystore accesses multiple sources, mirrors in GQ-Fast the graph structure and decomposes graph analytics queries into parts executed at GQ-Fast and parts executed at the original sources. 


Speaker:


Yannis Papakonstantinou is a Professor of Computer Science and Engineering at the University of California, San Diego. His research is in the intersection of data management technologies and the web, where he has published over one hundred research articles that have received more than 14,000 citations, according to Google Scholar. A common theme of his research is the extension of database platforms and query processors beyond centralized relational databases and into semistructured & graph databases, integrated views of distributed databases and web services, textual data and queries involving keyword search, and most recently spatiotemporal sensor data. He has given multiple tutorials and invited talks, has served on journal editorial boards and has chaired and participated in program committees for many international conferences and workshops. He is a co-director and teaches for UCSD's Master of Advanced Studies in Data Science.

Yannis enjoys to commercialize his research and to inform his research accordingly. He was the CEO and Chief Scientist of Enosys Software, which built and commercialized an early Enterprise Information Integration platform for structured and semistructured data. The Enosys Software was OEM'd and sold under the BEA Liquid Data and BEA Aqualogic brand names, eventually acquired in 2003 by BEA Systems. He has also consulted for Amazon Web Services and multiple startups.

Yannis holds a Diploma of Electrical Engineering from the National Technical University of Athens, MS and Ph.D. in Computer Science from Stanford University (1997) and an NSF CAREER award for his work on data integration.

10.00 - 10.45 Amol Deshpande, University of Maryland

GraphGen: Adaptive Graph Extraction and Analytics over Relational Databases


Abstract:


Graph querying and analytics are becoming an increasingly important component of the arsenal of tools for extracting different kinds of insights from data. However, graphs are not the primary representation choice for most data today, and users who want to employ graph analytics are forced to extract data from their data stores, construct the requisite graphs, and then use a specialized engine to write and execute their graph analysis tasks. This cumbersome and costly process not only raises barriers in using graph analytics, but also makes it hard to explore and identify hidden or implicit graphs in the data.

In this talk, I will present our ongoing work on an end-to-end graph analysis framework, called GraphGen, that sits atop an RDBMS and enables users to declaratively specify graph extraction tasks, visually explore the extracted graphs, and write and execute graph algorithms over them, either directly or using existing graph libraries like the widely used NetworkX Python library. GraphGen has a fundamentally different goal from recent work on using relational databases to store graph data through "shredding". Instead, GraphGen is intended to analyze graphs that are present in existing relational databases. GraphGen attempts to utilize the underlying relational database to the full extent possible by pushing down computation, uses a novel condensed representation to handle graphs that may be too large to extract in their entirety, allows writing programs using a general subgraph-centric API, and features several optimizations for efficient extraction and querying of large graphs.


Speaker:


Amol Deshpande is a Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, adaptive query processing, data streams, graph analytics, and sensor networks. He is a recipient of an NSF Career award, and has received best paper awards at the VLDB 2004, EWSN 2008, and VLDB 2009 conferences.

10.45 - 11.15 Coffee break
11.15 - 12.00 Dan Olteanu, University of Oxford

TBA

Abstract:


Speaker:


Dan Olteanu is a Computer Science professor at Oxford and a computer scientist at LogicBlox. He has also taught at the universities of California Berkeley, Munich, Saarland, and Heidelberg. He received his PhD in Computer Science from University of Munich in 2005. His research interests are in databases and adjacent areas. Dan contributed to XML query processing, incomplete information and probabilistic databases, and more recently to factorized databases, in-database analytics, and the LogicBlox commercial system. He co-authored the book "Probabilistic Databases" (2011). He has served as associate editor for PVLDB'13 and IEEE TKDE, track chair for ICDE'15, group leader for SIGMOD'15, and vice chair for SIGMOD'17. His current research is supported by an ERC consolidator grant and awards from Google, LogicBlox, and Ordnance Survey.

12.00 - 12.15 Pimin Konstantin Kefaloukos (Industrial Talk)

Reverse engineering a social network from web logs


Abstract:


Social networks contain a lot of information that is valuable for advertisers, such as the products purchased, interests, family relations and socio-demographic brackets of the network users. But what do you do if you don't own a social network? Every day advertising networks send billions of web requests to AudienceProject servers, where an individual log lines may contains information about device identifiers, URLs, IP addresses and more. This talks shows how AudienceProject has used novel and existing graph algorithms to reverse engineered a vast social and knowledge graph from web logs and used it to aid adversiters and media agencies. The focus of the talk will be on graph data in the advertisement world as well as some of the graph algorithms innovated or utilised by AudienceProject in this domain.


Speaker:


Pimin Konstantin Kefaloukos works in the ad-tech industry for AudienceProject in Copenhagen. He has a Ph.D. in Computer Science from the University of Copenhagen.







12.15 - 13.30 Lunch
13.30 - 14.15 Hai Jin, Huazhong University of Science and Technology

TBA


Abstract:


TBA


Speaker:


Hai Jin is a Cheung Kung Scholars Chair Professor of computer science and engineering at Huazhong University of Science and Technology (HUST) in China. Jin received his PhD in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz in Germany. Jin worked at The University of Hong Kong between 1998 and 2000, and as a visiting scholar at the University of Southern California between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. Jin is the chief scientist of ChinaGrid, the largest grid computing project in China, and the chief scientists of National 973 Basic Research Program Project of Virtualization Technology of Computing System, and Cloud Security.

Jin is a Fellow of CCF, senior member of the IEEE and a member of the ACM. He has co-authored 22 books and published over 800 research papers. His research interests include computer architecture, virtualization technology, cluster computing and cloud computing, peer-to-peer computing, network storage, and network security.

14.15 - 15.00 Hassan Chafi, Oracle Labs

TBA

TBA

15.00 - 15.30 Coffee break
15.30 - 16.15 Jun Huan, University of Kansas

TBA


Speaker:


Dr. Jun (Luke) Huan is a professor in the Department of Electrical Engineering and Computer Science at the University of Kansas. He directs the Data Science and Computational Life Sciences Laboratory at KU. Dr. Huan holds courtesy appointments at the KU Bioinformatics Center and the KU Bioengineering Program.

Dr. Huan is an internationally recognized investigator in data science. He has published more than 120 peer-reviewed papers in leading conferences and journals and has graduated more than 10 graduate students including seven Ph.D.s. He was a recipient of the National Science Foundation Faculty Early Career Development Award in 2009. His group won the Best Student Paper Award at the IEEE International Conference on Data Mining in 2011 and the Best Paper Award (runner-up) at the ACM International Conference on Information and Knowledge Management in 2009. His work appeared at mass media including Science Daily, R&D magazine, and EurekAlert. Dr. Huan’s editorial memberships have included Springer Journal of Big Data, Elsevier Journal of Big Data Research, and the International Journal of Data Mining and Bioinformatics among others. He regularly serves the program committee of top-tier international conferences on Machine Learning, Data Mining, Big Data, and Bioinformatics.

Since 2015 Dr. Huan has served as a Program Director in the Information and Intelligent Systems division at the US National Science Foundation. At NSF he manages programs such as IIS core, Big Data, and Partnerships for International Research and Education.

16.15 - 17.00 George Fletcher, Eindhoven University of Technology

Schema-driven generation of synthetic graphs and queries with gMark


Abstract:


Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. In this talk, we present the design and engineering principles of gMark, a domain- and query language-independent graph instance and query workload generator. A core contribution of gMark is its ability to target and control the diversity of properties of both the generated instances and the generated workloads coupled to these instances. Further novelties include support for regular path queries, a fundamental graph query paradigm, and schema-driven selectivity estimation of queries, a key feature in controlling workload chokepoints. We illustrate the flexibility and practical usability of gMark by showcasing the framework's capabilities in generating high quality graphs and workloads, and its ability to encode user-defined schemas across a variety of application domains.

This is joint work with colleagues at CNRS Lyon (France), INRIA Lille (France), Université Clermont Auvergne (France), and TU Eindhoven (Netherlands).


Speaker:


George Fletcher (PhD, Indiana University Bloomington) is an associate professor of computer science at Eindhoven University of Technology. His research interests span query language design and engineering, foundations of databases, and data integration. His current focus is on management of massive graphs such as social networks and linked open data. He is a member of the LDBC Graph Query Language Standardization Task Force.

18.00 Conference dinner

22 August

09.00 - 09.45 Semih Salihoglu, University of Waterloo

TBA


Abstract:


TBA


Speaker:


Semih Salihoglu is an assistant professor at University of Waterloo's Cheriton School of Computer Science. He is member of the Data Systems Research Group. Previously, he was a PhD student at Stanford University, advised by Jennifer Widom. His current research focuses graph databases and distributed graph processing engines.

09.45 - 10.30 Milos Nikolic, University of Oxford

TBA


Abstract:


TBA


Speaker:


Milos Nikolic is a departmental lecturer in the Department of Computer Science at the University of Oxford. His research focuses on the design and implementation of data-intensive systems. His work studies the incremental computation of complex analytical queries, such as database queries and machine learning models, in local and distributed streaming environments using novel approaches to query optimization and compilation. He received a Ph.D. in Computer Science from EPFL.

10.30 - 10.45 Tim D. Ward (Industrial Talk)

How to build a decision engine with Neo4j and Machine Learning Techniques


Abstract:


TBA


Speaker:


Tim is an Engineer at CluedIn where they specialize in combining Neo4j and other database technologies with Machine Learning. He has been using the Graph for over 5 years and the team at CluedIn are currently running a Neo4j Graph Database with over 180 Million nodes in production.



10.45 - 11.15 Coffee break
11.15 - 12.00 Ioana Manolescu, INRIA Saclay

TBA

TBA

12.00 - 12.45 Yongluan Zhou, University of Copenhagen

Parallel SPARQL Query Optimization


Abstract:


Existing parallel SPARQL query optimizers assume hash-based data partitioning and adopt plan enumeration algorithms with unnecessarily high complexity. Therefore, they cannot easily accommodate other partitioning methods and only consider an unnecessarily limited plan space. To address these problems, we first define a generic RDF data partitioning model to capture the common structure of various state-of-the-art RDF data partitioning methods. Then we propose a query plan enumeration algorithm that not only has an optimal efficiency, but also accommodates different data partitioning methods. Furthermore, based on a solid analysis of the complexity of the plan enumeration algorithm, we propose two new heuristic methods that can consider a much larger plan space than the existing methods, and at the same time can still confine the search space of the algorithm. An autonomous approach is proposed to choose one of the two methods by considering the structure and the size of a complex SPARQL query. We conduct extensive experiments using synthetic and a real-world dataset, which show the superiority of our algorithms in comparing to existing ones.


Speaker:


Yongluan is an Associate Professor in the Department of Computer Science at the University of Copenhagen. Before joining KU, he has been an Associate Professor at the University of Southern Denmark and a postdoc at Ecole Polytechnique Fédérale de Lausanne (EPFL). He obtained his PhD in Computer Science at the National University of Singapore. He also held visiting positions at several international institutes, including the University of Maryland at College Park, the Advanced Digital Sciences Center at the University of Illinois at Urbana-Champaign and INRIA Rennes. His research interests span database systems and distributed systems. His recent research focuses on scalable streaming data processing and graph data processing. He has published over 50 research articles in international journals and conference proceedings.

12.45 - 14.00 Lunch
14.00 - 14.45 Nikolay Yakovets, Eindhoven University of Technology

Optimization of Regular Path Queries in Graph Databases


Abstract:


Regular path queries offer a powerful navigational mechanism in graph databases. Recently, there has been renewed interest in such queries in the context of the Se- mantic Web. The extension of SPARQL in version 1.1 with property paths offers a type of regular path query for RDF graph databases. While eminently useful, such queries are difficult to optimize and evaluate efficiently, however. We design and implement a cost-based optimizer we call Waveguide for SPARQL queries with property paths. Waveguide builds a query plan—which we call a waveplan (WP)—which guides the query evaluation. There are numerous choices in the construction of a plan, and a number of optimization methods, so the space of plans for a query can be quite large. Execution costs of plans for the same query can vary by orders of magnitude with the best plan often offering excellent performance. A WP’s costs can be estimated, which opens the way to cost-based optimization. We demonstrate that Waveguide properly subsumes existing techniques and that the new plans it adds are relevant. We analyze the effective plan space which is enabled by Waveguide and design an efficient enumerator for it. We implement a prototype of a Waveguide cost-based optimizer on top of an open-source relational RDF store. Finally, we perform a comprehensive performance study of the state of the art for evaluation of SPARQL property paths and demonstrate the significant performance gains that Waveguide offers.


Speaker:


Nikolay Yakovets is an assistant professor of computer science at Eindhoven University of Technology. He obtained his PhD from Lassonde School of Engineering at York University in 2016. He worked on various database topics at IBM CAS Canada and Empress Software Canada. His current focus is on design and implementation of core database technologies, management of massive graph data, and efficient processing of queries on graphs.

15.00 - 17.00 Discussions and walking tour of the city

Venue

Room 4-0-02, Biocenter 
University of Copenhagen
Ole Maaløes Vej 5
2200 Copenhagen N