Abstract
A common difficulty in data analysis is how to handle categorical pre-dictors with a large number of levels or categories. Few proposals have been developed to tackle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggre-gation of the categorical levels into larger groups. We represent the categorical predictor by a graph where the nodes are the categories and establish a probability distribution over meaningful partitions of this graph. Condition-ally on the observed data, we obtain a posterior distribution for the levels ag-gregation, allowing the inference about the most probable clustering for the categories. Simultaneously, we extract inference about all the other regression model parameters. We compare our and state-of-art methods showing that it has equally good predictive performance and more interpretable results. Our approach balances out accuracy vs. interpretability, a current important con-cern in statistics and machine learning.
| Translated title of the contribution | Manejo de características categóricas con muchos niveles utilizando un modelo de partición de productos |
|---|---|
| Original language | English (US) |
| Pages (from-to) | 786-814 |
| Number of pages | 29 |
| Journal | Annals of Applied Statistics |
| Volume | 17 |
| Issue number | 1 |
| DOIs | |
| State | Published - Mar 1 2023 |
All Science Journal Classification (ASJC) codes
- Computer Science Applications