ViRED: Prediction of Visual Relations in Engineering Drawings

1University of Science and Technology of China, 2Tsinghua University, 3Nanyang Technological University
Under Review

Abstract

To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.

Summary

In summary, the contributions of our work are as follows:

  • We present a novel vision-based relation detection approach, named ViRED, to address the issue of predicting relations for non-textual components in complex documents. This approach has been specifically implemented for the purpose of circuit-to-table relation matching in electrical design drawings.
  • We develop a dataset of electrical engineering drawings derived from industrial design data, and we annotate the instances and their relationships within the dataset.
  • We evaluate our method using various metrics on the electrical engineering drawing dataset. Furthermore, we perform comparative analyses with existing approaches and provide a performance comparison between the existing methods and our proposed technique.
  • We perform extensive ablation studies to compare the impact of different model architectures, hyperparameters, and training methods on the overall performance. Moreover, we refined our model architecture based on these comparative analyses.

Methodology

Overview pipeline of ViRED.

  1. Engineering drawings are processed through the Vision Encoder, Object Encoder, Relation Decoder, and Relation Prediction Model.
  2. The Object Encoder converts the instance masks and types into mask and type embeddings, which are then aggregated to form the object tokens.
  3. The Relation Decoder utilizes the object tokens as inputs and integrates them with the image features from the Vision Encoder through a cross-attention mechanism. Residual connections between layers are ignored for simplicity.
  4. While pretraining, the model encodes the document images and position masks, and after decoding through the relation decoder, it predicts the image classification of the position where the mask is located.

Experiments

Comparison to previous works.
Comparison to Previous Works

Ablation for architectures.
Ablation for hyperparameters.
Ablation Study

FLOPs
Efficiency

Examples

BibTeX


TODO