Abstract
Due to a lack of resources and the tokenization issue, it is challenging to identify the languages inscribed in cuneiform symbols. Sumerian and six dialects of the Akkadian language-Old Babylonian, Middle Babylonian Peripheral, Standard Babylonian, Neo-Babylonian, Late Babylonian, and Neo-Assyrian-are among the seven languages and dialects written in cuneiform that need to be identified. This problem is addressed by the Cuneiform Language Identification task in VarDial 2019. This paper presents ten machine learning algorithms derived from four types of machine learning that were used (supervised, ensemble, instance-based, and Artificial Neural Network) learnings. The Support Vector Machine (SVM), Na Bayes (NB), Logistic Regression (LR), and Decision Tree (DT) algorithms within supervised learning, the K-Nearest Neighbors algorithm (KNN) within instance- based learning, the Random Forest (RF), Adaptive Boosting (Adaboost), Extreme Gradient Boosting (XGBoost), and Gradient Boosting (GB) algorithms within ensemble learning. Also, one of the natural language processing algorithms, n-gram, is used to identify the cuneiform dialect. The best result belongs to an ensemble of Random Forest classifiers working on character-level features with a macro averaged F1 score of 96%, and the best outcome for the n-grams algorithm is 0.82% of di-gram.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Iraqi Journal of Information and Communication Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.