Entities are commonly modeled by explicitly describing their features and their relations to other entities. The most extensive collections of such entity-centric information are large scale Knowledge Graphs (KGs) like DBpedia that describe the interdependency of millions of real word entities and abstract concepts.
One benefit of KG entities is that they are universal identifiers and thus provide a way to link content across languages and modalities once their occurrence in images and (multilingual) text has been annotated.
However, without the graph neighborhood and its grounding in real-world occurrences a KG entity is meaningless. In recent year another way of capturing the context of entities has been investigated extensively in the area of representation learning. While representation learning is typically focused on learning of more abstract representations from raw sensory inputs, there has been growing interest in how symbolic knowledge like KG entities can be captured as well. The state-of-the-art neural network models for such graph embeddings build on a large body of previous work on graphical models, graph kernels and tensor methods which all extract latent representation from knowledge graphs. The baseline optimization criteria has since been the prediction of unknown links in the graph.
The dense vector representations obtained by knowledge graph embedding techniques provide three fundamental advantages:
First, they enable the integration of information from different modalities like images and multilingual text with symbolic knowledge into one common representation. Such cross-modal embeddings provide measurable benefits for semantic similarity benchmarks and entity-type prediction tasks.
Second, they allow to transfer knowledge across modalities even for entities that are not represented in the other modalities.
Third, they are key to solving complex AI tasks beyond link prediction, like image captioning or multi-step decision making. Again, the transfer of information from other modalities can be beneficial. E.g., cross-modal knowledge transfer assists the captioning of images which contain visual objects that are unseen in the image-caption parallel training data.
Ultimately, this allows us to tackle several real-world application areas where knowledge-guided representation learning can provide considerable benefits, like media analytics, manufacturing or medical engineering.