RDF graphs are directed graphs in which vertices and edges are labelled. In case of big data the graph can consist of several billion or trillion of edges. To handle these huge graphs, distributed graph databases like Koral  were developed that distribute the graph over several computers. In order to query the graph efficiently, statistical information about the occurrences of labels on the individual computers are required. In the current implementation of Koral these information are collected on a single computer for the complete graph and stored in a huge single file that is randomly accessed. To improve the current implementation, several optimizations can be done by parallelizing the statistics collection over several computers, creating other storage layouts to improve the way the hard disk is accessed (e.g., ), using compression techniques (e.g., ), using caches etc.
During this bachelor thesis, an overview of optimization techniques should be given and at least one improved statistics database should be developed by combining several optimization techniques. The improved statistics databases should be evaluated by comparing their performance against the initial implementation as well as against relational databases like SQlite .