Handling categorical features with many levels using a product partition model

Tulio L. Criscuolo, Renato M. Assunção, Rosangela H. Loschi, Wagner Meira Jr, Danna Lesley Cruz Reyes

Research output: Contribution to JournalResearch Articlepeer-review

1 Scopus citations

Abstract

A common difficulty in data analysis is how to handle categorical pre-dictors with a large number of levels or categories. Few proposals have been developed to tackle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggre-gation of the categorical levels into larger groups. We represent the categorical predictor by a graph where the nodes are the categories and establish a probability distribution over meaningful partitions of this graph. Condition-ally on the observed data, we obtain a posterior distribution for the levels ag-gregation, allowing the inference about the most probable clustering for the categories. Simultaneously, we extract inference about all the other regression model parameters. We compare our and state-of-art methods showing that it has equally good predictive performance and more interpretable results. Our approach balances out accuracy vs. interpretability, a current important con-cern in statistics and machine learning.

Translated title of the contributionManejo de características categóricas con muchos niveles utilizando un modelo de partición de productos
Original languageEnglish (US)
Pages (from-to)786-814
Number of pages29
JournalAnnals of Applied Statistics
Volume17
Issue number1
DOIs
StatePublished - Mar 1 2023

All Science Journal Classification (ASJC) codes

  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Handling categorical features with many levels using a product partition model'. Together they form a unique fingerprint.

Cite this