Data Pre-processing challenges with generic and domain specific solutions: a critical review
DOI:
https://doi.org/10.4314/jobasr.v4i2.42Keywords:
Data preprocessing, Machine learning, Generic and Domain-specific SolutionsAbstract
Data pre-processing is a critical phase in the machine learning and data analysis lifecycle, significantly influencing model accuracy, efficiency, and reliability. While numerous standard techniques such as normalization, encoding, and missing value imputation are widely used, existing literature provides limited guidance on how to address complex, context-dependent challenges that require non-generic solutions. This gap creates uncertainty for practitioners when selecting appropriate preprocessing strategies across diverse data scenarios. This study aims to critically review and systematically categorize data pre-processing challenges by distinguishing between those effectively addressed using generic techniques and those requiring domain-specific or context-aware approaches. A systematic literature review methodology was adopted, synthesizing findings from academic research and industry practices across multiple data modalities, including tabular, textual, image, and time-series data. The findings reveal that generic techniques are effective for routine data issues but are insufficient for handling semantic inconsistencies, complex feature interactions, and context-driven anomalies. To address this, the study proposes a structured, decision-oriented framework that guides practitioners in evaluating data characteristics, identifying preprocessing challenges, and selecting appropriate strategies. This work contributes a practical and unified approach that enhances decision-making in data pre-processing, ultimately improving the robustness, interpretability, and performance of machine learning models.
References
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.