Synthetic Data Enhances Mathematical Reasoning of Language Models Based on Artificial Intelligence

Authors

  • Zeyu Han Data Science and Analytics, Georgetown University, 3700 O St., N.W., Washington, D.C. 20057, United States
  • Weiwei Jiang

DOI:

https://doi.org/10.5755/j01.itc.54.1.39713

Keywords:

AI generated data, artificial intelligence, text classification, data collection cost, mathematical question-answering, downstream task training

Abstract

Current large language models (LLMs) training involves extensive training data and computing resources to handle multiple natural language processing (NLP) tasks. This paper endeavors to assist individuals to compose feasible mathematical question-answering (QA) language models in specific fields. We leveraged Gretel.ai, a feasible data generation platform, to generate high-quality mathematical QA data covering several areas, including definitions, theorems, and calculations related to linear algebra and abstract algebra. After fine- tuning through Open-AI infrastructure, GPT-3 performed significant improvements on accuracy, achieving a roughly 18.2% increase in abstract algebra benchmark, approximately 1.6x improvement on linear algebra theorems benchmark, and approximately 24.0% increase on linear algebra calculations benchmark. And small language models (SLMs) such as LLama-2-7B/13B and Mistral-7B have outstanding around 2x accuracy advancements in linear algebra calculations. This study demonstrates the potential for individuals to develop customized SLMs for specialized mathematical domains using synthetic data generation and fine-tuning techniques.

Downloads

Published

2025-04-01

Issue

Section

Articles