Web content management systems as well as web front ends to databases usually use mechanisms based on homogeneous templates for generating and populating HTML documents containing structured, semi-structured or plain text data. Wrapper based information extraction techniques leverage such templates as an essential cornerstone of their functionality but rely heavily on the availability of proper training documents based on the specific template. Thus, structural classification and structural clustering of web documents is an important contributing factor to the success of those methods. We introduce a novel technique to support these two tasks: template fingerprints. Template fingerprints are locality sensitive hash values in the form of short sequences of characters which effectively represent the underlying template of a web document. Small changes in the document structure, as they may occur in template based documents, lead to no or only minor variations in the corresponding fingerprint.
On this page you can find the source code for computing template fingerprints over HTML documents as well as the data sets used for evaluation.
Computation of Template Fingerprints
Evaluation Data Sets
- 50 Templates data set: zip archive
Hachenberg, Christian; Gottron, Thomas (2013): Locality Sensitive Hashing for Scalable Structural Classification and Clustering of Web Documents. In: CIKM'13: Proceedings of 22nd ACM Conference on Information and Knowledge Management.