Tuberculosis is an infectious disease that is spread through the air from one person to another and is one of the top ten causes of death in the world according to the World Health Organization. From biomedical engineering, decision support systems based on artificial intelligence have shown advantages for healthcare personnel in tasks such as diagnosis and screening. A specific area of the artificial intelligence is the natural language processing, however, most of these approaches are based on available data. This paper shows the construction of a dataset based on medical records of subjects suspected of tuberculosis. In addition, an initial exploration of the contents of the constructed dataset and how this approach can be followed by a natural language processing to support tuberculosis diagnosis in data demanding scenarios are presented.Clinical Relevance - In some developing countries as Colombia, it is difficult to develop systems based on artificial intelligence due to the availability of data. This proposal holds a strategy to build a dataset to train machine learning models, and to obtain support diagnosis tools, employing natural language from the medical scenario from text written by health professionals in the medical record. In this way, trained models based on this information available can be employed in places where medical infrastructure is precarious.