Cost Estimation for Distributed Joins[go to master theses]
Open Master Thesis - Contact a supervisor for more details!
In order to process large RDF graphs, distributed RDF stores were developed. Distributed RDF stores are databases that store and process an RDF graph on several computers. In order to retrieve information, SPARQL queries can be sent to these RDF stores. The most frequently query operations within these queries consist of select operations (which select the required data) and join operations (which combine the selected data). Thereby, the order in which the joins of a query are executed has a strong impact on the query execution time. The task of the query optimizer is to find a join order with a low (ideally minimal) query execution time. To find such an order, the optimizer uses statistical information about the stored RDF graph to estimate the costs for the individual query operations. In case of centralized RDF stores running on a single computer, the optimization of the join order is well-studied. To estimate the costs of a query operation the number of input and output intermediate results are estimated.
Many distributed RDF stores try to process as much as possible on the individual computers. If data from different computers needs to be combined, the data is sent to one specific computer which joins them. Thereby, the query optimizer can reuse the optimization techniques from the centralized RDF stores.
Joining all data on a single computer may exceed its processing capabilities. Therefor, distributed joins have been developed that join the data on several compute nodes in general. In this master thesis statistics should be identified that allow for estimating the costs for distributed joins.