Image nodes and line-break terminal nodes are handled in visual block extraction.
Unnecessarily nested composite blocks are omitted.
Font size bug is resolved.
Blocks in the same horizontal line grouped to form a composite block.
Also, content structure detection is separated from visual block extraction and handled in ContentStructureDetection class.
DOM structure detection is separated from segmentation process and handled in DomStructureDetection class.
Unnecessary attributes, methods and operations are cleaned.
14 files changed
tree: 9725406d73a787849c349bfac2dcd30e9192afc1
  1. features/
  2. others/
  3. plugins/