The capabilities of modern artificial intelligence (AI) and machine learning (ML) have grown rapidly in recent years. Machine learning is increasingly being used, especially in thought-intensive and quantitative challenges like mathematics. While there are many mathematics-related resources available at the high school, undergraduate, and graduate levels, there are very few resources that address the difficulties and ambiguities faced by professional mathematicians working on open-ended problems. To address this, the Algebraic Combinatorics Dataset Repository (ACD Repo) has been introduced. This dataset covers the field of mathematics called algebraic combinatorics, which studies discrete structures arising from abstract algebra.
The uniqueness of this collection is that it targets not just results but also the conjecturing process. Each dataset contains a research-level open question and millions of examples (in some cases up to 10 million) related to that question, from which new conjectures can be generated.
Machine Learning and Mathematics
Machine learning is capable of extracting complex patterns from modern datasets. Furthermore, it has been proven that AI systems with high-level intelligence can perform tasks that require serious logical reasoning. In mathematics, it is primarily used in proof writing and mathematical formalization. However, proof construction alone is not enough. To reach new theoretical conclusions, mathematicians must first study many examples to understand patterns and lay the foundation for potential theorems in the future.
Importance of Datasets
Over the past years, machine learning has been used in mathematics primarily in a few areas:
- Toy problems: Those that are already known to be solved and are ideal for the interpretation and research of deep learning.
- Reinforcement learning: Used to find examples that contradict hypotheses.
- ML in mathematical research: In some cases, existing machine learning tools help solve specific problems.
- Foundation models: The use of large models to solve certain mathematical problems.
However, none of these efforts provide comprehensive datasets for open or challenging research-level problems. The ACD Repo addresses this. It includes nine datasets, each containing examples and related questions.
Basics of Algebraic Combinatorics
Algebraic combinatorics is a branch of mathematics that studies discrete structures (such as graphs, posets, partitions, and permutations). It primarily uses combinatorial techniques to solve problems involving abstract algebra, representation theory, and algebraic geometry.
Partitions
A partition is the division of an integer into a sequence such that each part is positive and placed in descending order. For example, the partition of 7 can be visualized as a (3,2,2) Young diagram, with 3 squares in the first row, 2 in the second, and 2 in the third.
Young Tableaux
We create a Young tableaux by writing symbols in each square of a Young diagram. There are two types:
- Standard Young’s Tableau: Numbers increase in each row and column.
- Semistandard Young’s Tableau: Numbers increase with a slight increase in rows.
Permutations
Permutations are widely used in machine learning. They can be written in one-line notation. For example, a permutation of the set {1,2,3,4} that interchanges 1 and 2 and 3 and 4 would be written as 2 1 4 3.
Partial Orders (Posets)
A poset is a set in which the objects are partially ordered. An example is a set of subsets, where the order is determined by inclusion.
Dataset Description
Each dataset in the ACD Repo contains:
- Research-Level Questions
- ML-Friendly Tasks
For example, the Schubert Polynomial Structure Constants task involves estimating a certain coefficient based on three permutations.
Some Key Datasets and Tasks
- Irreducible Symmetric Group Characters: Estimating the symbol associated with two partitions.
- mHeight Function: Estimating the minimum height of 3412 patterns in a permutation.
- Grassmannian Cluster Algebras: Identifying cluster variables from a given Young tableau.
- Kazhdan-Lusztig Polynomials: Estimating coefficients of different degrees for two permutations.
- RSK Correspondence: Extracting a permutation from two Young tableaus.
- Schubert Polynomial Structure Constants: Estimating structural coefficients from three permutations.
- Lattice Paths Partial Orders: Predicting the partial order relationship of two lattice paths.
- Quiver Mutation Equivalence: Identifying the mutation similarity of two quivers.
- Weaving Patterns: Classifying a given matrix as a valid weaving pattern.
Applications and Case Studies of ML
Graph Neural Networks and QuiverMutation
In a case study, the rules of Quiver mutation were understood using Graph Neural Networks and Explainable AI (XAI). This led to the rediscovery of well-known theorems and the development of new conjectures.
Program Synthesis and Schubert Polynomials
In another example, code generation using LLMs led to a near-perfect estimation of Schubert Polynomial Structure Constants. This process proved to be more human-organized and interpretable than traditional models.
Challenges
Creating mathematical datasets is challenging because:
- Imbalance of data
- Complexity and diversity of patterns
- Limited number of analyzable signals
Despite these challenges, the ACD Repo has provided a valuable platform for machine learning-based mathematical research.
Conclusion
The ACD Repo exemplifies the intersection of machine learning and high-level mathematics. This provides an opportunity not only for professional mathematicians but also for ML researchers to study open and challenging mathematical problems. In the future, machine learning models from such datasets could generate more effective predictions and mathematical insights.
