In distributed RDF stores the strategy how data is distributed over several compute nodes affects query performance. When hash-based data distribution strategies are used, the query workload tends to be equally balanced among all compute nodes while a relatively high number of intermediate results must be transferred between compute nodes. Graph-clustering-based approaches reduce the number of transferred intermediate results while the query workload becomes more imbalanced. This paper presents a novel data distribution strategy that combines the advantages of both strategies. To this end, we collocate the individuals of small sets of closely connected data items on compute nodes. Our experimental evaluation showed that this strategy reduces the number of transferred intermediate results even more than graph clustering strategies do while workload becomes almost as well balanced as when using hash-based strategies. Overall, our new strategy reduces query execution time by up to 66%.
21.06.2018 - 10:15