In the last years, huge graphs of several billions or trillions of edges arose like the Google knowledge graph or the schema.org graph. In order to cope with these huge graphs, distributed graph databases have been developed that combine the storage and computational power of many computers to store and process these graphs. In order to retrieve information from the stored graph, queries can be send to those databases. Those queries regularly consist of selection operations that select the relevant information from the graph, join operations that combine the results from the previous query operations, and projection operations that reduce the number of variables in each result.
When executing a query, a query plan is created first that defines the order in which the different select operation results are joined. This join order has an high impact on the query execution performance. For instance, if the number of intermediate results are reduced by an early join strongly, succeeding join operations will have less work. In order to predict the effort of a query plan, the effort of the individual query operations needs to be estimated using statistics about the stored graph. While the query planning is well-studied for graph databases on a single computer for small or moderate-sized graphs, it is still difficult in the context of huge graphs in a distributed graph database.
In the institute WeST we have developed the experimental distributed graph database Koral that is used to analyze the effect of different algorithms on a distributed database. Up to now the query operations are executed in the order they were defined in the query leading to potentially inefficient query plans. In the context of this master thesis, a query optimization strategy should be developed that tried to optimize the query plan by reducing the actual query execution effort. This includes among others, developing a cost function that is able to estimate the effort of a query plan executed in a distributed setting, defining the required statistical information, implementing it in Koral and evaluating it using huge graphs.
Useful links and literature:
- Overview on linked data
- Schmidt, M., Meier, M., & Lausen, G. (2010). Foundations of SPARQL Query Optimization. In Proceedings of the 13th International Conference on Database Theory (pp. 4–33). New York, NY, USA: ACM.
- Schwarte, A., Haase, P., Hose, K., Schenkel, R., & Schmidt, M. (2011). FedX: Optimization Techniques for Federated Query Processing on Linked Data. In L. Aroyo, C. Welty, H. Alani, J. Taylor, A. Bernstein, L. Kagal, … E. Blomqvist (Eds.), The Semantic Web -- ISWC 2011: 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I (pp. 601–616). Berlin, Heidelberg: Springer Berlin Heidelberg.
- Görlitz, O., & Staab, S. (2010). SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. In Proceedings of the Second International Conference on Consuming Linked Data - Volume 782 (pp. 13–24). Aachen, Germany, Germany: CEUR-WS.org.
- Saleem, M., Ngonga Ngomo, A.-C., Xavier Parreira, J., Deus, H. F., & Hauswirth, M. (2013). DAW: Duplicate-AWare Federated Query Processing over the Web of Data. In H. Alani, L. Kagal, A. Fokoue, P. Groth, C. Biemann, J. X. Parreira, … K. Janowicz (Eds.), The Semantic Web -- ISWC 2013: 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I (pp. 574–590). Berlin, Heidelberg: Springer Berlin Heidelberg.
- Saleem, M., & Ngonga Ngomo, A.-C. (2014). HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation. In V. Presutti, C. D’Amato, F. Gandon, M. D’Aquin, S. Staab, & A. Tordai (Eds.), The Semantic Web: Trends and Challenges: 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25-29, 2014. Proceedings (pp. 176–191). Cham: Springer International Publishing.
- Vidal, M.-E., Ruckhaus, E., Lampo, T., Martínez, A., Sierra, J., & Polleres, A. (2010). Efficiently Joining Group Patterns in SPARQL Queries. In L. Aroyo, G. Antoniou, E. Hyvönen, A. ten Teije, H. Stuckenschmidt, L. Cabral, & T. Tudorache (Eds.), The Semantic Web: Research and Applications (Vol. 6088, pp. 228–242). Springer Berlin Heidelberg.