Documentare: auxiliary intelligence for digital content analysis

Documentare is a software library written in Java including unsupervised clustering tools applicable to :

  • content stored in directories,
  • pictures issued from a text detection and a character segmentation process in a digitized document, which can be applied for building OCR reference bases.

Technological core of this library is the distance measurement of similarities between two sequences of bytes, regardless the coded information. This is a universal distance which can be applied to a large variety of content, given a relevant alignment of the code.

Associated tools consist of a statistical geometry-based method of detection and segmentation of text in a digitized document and the clustering tools themselves, the descriptions of which can be found in README.md on GitLab. The code is published under GNU General Public License v2.0.