References

moitvivt

Моделирование, оптимизация и информационные технологии

Modeling, Optimization and Information Technology

2310-6018

Издательство

10.26102/2310-6018/2025.48.1.030

1830

Оценка качества полученного результата в задаче генерации исходного кода по изображению

Assessing the quality of the result in the problem of source code generation from an image

Никитин

Илья Владимирович

Nikitin

Ilya Vladimirovich

vic096@yandex.ru aff-1

Российский экономический университет имени Г.В. Плеханова Plekhanov Russian University of Economics

01 01 2026

1 1

10.26102/2310-6018/2025.48.1.030

2026

This work is licensed under a Creative Commons Attribution 4.0 International License

Исследование представляет собой оценку возможности построения системы выполнения функциональных тестов для задачи генерации исходного кода из изображения. Существует много различных метрик для оценки качества предсказанного нейронной сетью текста: от математических, таких как BLEU, Rogue, до таких, которые используют другую модель для оценки, как, например, BERTScore, BLEURT. Однако проблема генерации исходного кода программы состоит в том, что код представляет собой набор инструкций для выполнения определенной задачи. Актуальность состоит в том, что в публикациях, связанных с системой pix2code, отсутствовало упоминание об автоматизированной тестовой среде, которая сможет проверить соответствие полученного кода заданным условиям. В ходе проделанной работы была реализована подсистема, которая в автоматическом режиме может получить информацию о различиях между изображением, основанном на предсказанном коде, и изображении, основанном на эталонном коде. Также результаты работы этой системы сопоставлены с метрикой BLEU. Проведенный эксперимент позволяет сделать вывод о том, что значение BLEU и результаты выполнения тестов не имеют явной зависимости между собой, а значит, функциональные тесты необходимы для дополнительной проверки эффективности работы модели.

This study is an assessment of the feasibility of building a system for executing functional tests for the task of generating source code from an image. There are many different metrics for assessing the quality of text predicted by a neural network: from mathematical ones, such as BLEU, Rogue, and those that use another model for evaluation, such as BERTScore, BLEURT. However, the problem with generating source code for a program is that the code is a set of instructions to perform a specific task. The relevance is that in publications related to the pix2code system, there was no mention of an automated test environment that can check whether the resulting code meets the specified conditions. In the course of the work done, a subsystem was implemented that can automatically obtain information about the differences between an image based on a predicted code and an image based on a reference code. Also, the results of this system are compared with the BLEU metric. As a result of the experiment, we can conclude that the BLEU value and the execution of tests do not have an obvious relationship between them, which means that tests are necessary for additional checks of the efficiency of the model.

кодогенерация изображение машинное обучение BLEU функциональные тесты

code generation image machine learning BLEU functional tests

Исследование выполнено без спонсорской поддержки.

The study was performed without external funding.

References 1

Никитин И.В. Влияние версии библиотеки TensorFlow на качество генерации кода по изображению. Моделирование, оптимизация и информационные технологии. 2024;12(4). https://doi.org/10.26102/2310-6018/2024.47.4.040

Zou D., Wu G. Automatic Code Generation for Android Applications Based on Improved Pix2code. Journal of Artificial Intelligence and Technology. 2024;4(4):325–331. https://doi.org/10.37965/jait.2024.0515

Beltramelli T. pix2code: Generating Code from a Graphical User Interface Screenshot. In: EICS '18: Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems, 19–22 June 2018, Paris, France. New York: Association for Computing Machinery; 2018. https://doi.org/10.1145/3220134.3220135

Zhu Zh., Xue Zh., Yuan Z. Automatic Graphics Program Generation Using Attention–Based Hierarchical Decoder. In: Computer Vision – ACCV 2018: 14th Asian Conference on Computer Vision: Revised Selected Papers: Part VI, 02–06 December 2018, Perth, Australia. Cham: Springer; 2019. pp. 181–196. https://doi.org/10.1007/978-3-030-20876-9_12

Papineni K., Roukos S., Ward T., Zhu W.-J. BLEU: a Method for Automatic Evaluation of Machine Translation. In: ACL '02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 07–12 July 2002, Philadelphia, USA. Stroudsburg: Association for Computational Linguistics; 2002. pp. 311–318. https://doi.org/10.3115/1073083.1073135

Doddington G. Automatic Evaluation of Machine Translation Quality Using N-gram Co-occurrence Statistics. In: HLT '02: Proceeding of the Second International Conference on Human Language Technology Research, 24–27 March 2002, San Diego, USA. San Francisco: Morgan Kaufmann Publishers Inc.; 2002. pp. 138–145. https://doi.org/10.3115/1289189.1289273

Lin Ch.-Ye. ROGUE: A Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, 25–26 July 2004, Barcelona, Spain. Association for Computational Linguistics; 2004. pp. 74–81.

Popović M. chrF++: words helping character n-grams. In: Proceedings of the Second Conference on Machine Translation, 07–08 September 2017, Copenhagen, Denmark. Association for Computational Linguistics; 2017. pp. 612–618. https://doi.org/10.18653/v1/W17-4770

Hendrycks D., Basart S., Kadavath S., et al. Measuring Coding Challenge Competence With APPS. In: 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 06–14 December 2021, Online. https://doi.org/10.48550/arXiv.2105.09938

Zhang T., Kishore V., Wu F., Weinberger K.Q., Artzi Yo. BERTScore: evaluating Text Generation with BERT. In: 8th International Conference on Learning Representations, ICLR 2020, 26–30 April 2020, Addis Ababa, Ethiopia. 2020. https://doi.org/10.48550/arXiv.1904.09675

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 02–07 June 2019, Mineapolis, USA. Association for Computational Linguistics; 2019. pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

Rei R., Stewart C., Farinha A.C., Lavie A. COMET: A Neural Framework for MT Evaluation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 16–20 November 2020, Online. Association for Computational Linguistics; 2020. pp. 2685–2702. https://doi.org/10.18653/v1/2020.emnlp-main.213

Tran N., Tran H., Nguyen S., Nguyen H., Nguyen T. Does BLEU Score Work for Code Migration? In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), 25–26 May 2019, Montreal, USA. IEEE; 2019. pp. 165–176. https://doi.org/10.1109/ICPC.2019.00034

Ren Sh., Guo D., Lu Sh., et al. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv. URL: https://doi.org/10.48550/arXiv.2009.10297 [Accessed 19th February 2025].

Evtikhiev M., Bogomolov E., Sokolov Ya., Bryksin T. Out of the BLEU: How Should We Assess Quality of the Code Generation Models? Journal of Systems and Software. 2023;203. https://doi.org/10.1016/j.jss.2023.111741

The authors declare that there are no conflicts of interest present.