Wibowo, Nadhif Ikbar (2022) Pengembangan Model Bahasa Indonesia Untuk Domain Kesehatan Dan Biomedis Menggunakan Teknik Transfer Learning. Other thesis, Institut Teknologi Sepuluh Nopember.
|
Text
05211840000026-Undergraduate_Thesis.pdf Restricted to Repository staff only Download (5MB) | Request a copy |
Abstract
Language Model (LM) yang berkualitas merupakan salah satu kunci utama yang menjadi fondasi atas berbagai macam solusi permasalahan NLP, model bahasa menyediakan kemampuan language representation yang dilatih dengan jumlah data yang besar. Terbatasnya jumlah korpus yang tersedia dan akses terhadap sumber daya komputasi yang besar menjadi salah satu masalah utama dalam pengembangan solusi NLP Bahasa Indonesia. Permasalahan selanjutnya adalah semakin berkurangnya sumber daya yang tersedia ketika dibutuhkan pembatasan domain pada korpus yang dapat digunakan. Penelitian sebelumnya menunjukkan bahwa melakukan metode pelatihan lanjutan yang disebut domain-adaptive pretraining (DAPT) pada broad-domain pretrained language model (PLM) terbukti dapat menghasilkan LM dengan performa yang lebih baik dalam melaksanakan in-domain downstream tasks melalui proses yang membutuhkan sumber daya yang relatif sedikit. Penelitian ini mengembangkan LM Bidirectional Encoder Representations from Transformers (BERT), dengan memanfaatkan kembali IndoBERT-LM yang dilatih menggunakan broad-domain dataset Bahasa Indonesia untuk melakukan proses DAPT untuk menghasilkan LM dengan fokus domain teks biomedis dengan memanfaatkan dataset teks publikasi biomedis. Dihasilkan LM yang mampu mencapai nilai performa terbaik pada dua downstream tasks tanpa memerlukan proses finetuning secara intensif. LM terbaik mampu mencapai nilai akurasi pengujian hingga 47.45% pada dataset klasifikasi teks Ohsumed dan 99.12% pada dataset klasifikasi token.
Kata kunci: Model Bahasa, Transfer Pembelajaran, Adaptasi Domain, BERT, Teks Biomedis.
==============================================================================================================================
A high-quality Language Model (LM) is one of the main keys that provides a solid foundation for the various solution to NLP problems, language model provides language representation capabilities by training a neural network with large amounts of data. However, the limited number of available corpus and access to large computing resources are some of the main problems in developing Indonesian NLP solutions. The next problem is the scarcity of available resources when it's being intentionally limited by specific domain requirements. Previous research has shown that conducting a continuous pretraining using method called domain-adaptive pretraining (DAPT) on a broad-domain pretrained language model (PLM) has proven to be able to produce a model with better performance in solving in-domain downstream tasks by utilizing relatively few available resources. In this research, a Bidirectional Encoder Representations from Transformers (BERT) model is developed by utilizing IndoBERT-a language model that has been pretrained with broad domain Indonesian datasets-DAPT is then performed to adapt the model towards the biomedical domain by utilizing a biomedical unannotated publication text dataset. Finetuned models managed to achieve SOTA results for two downstream tasks without the need for a relatively intensive finetuning process, this indicates that adapted models have been successfully finetuned to perform downstream tasks within the biomedical text domain. The best LM managed to archive a 47.45% accuracy score for Ohsumed text classification and an accuracy score of 99.12% for token classification.
Keywords: Language Model, Transfer Learning, Domain Adaptation, BERT, Biomedical Text.
| Item Type: | Thesis (Other) |
|---|---|
| Additional Information: | RSSI 006.35 Wib p-1 2022 |
| Uncontrolled Keywords: | Model Bahasa. Transfer Pembelajaran. Adaptasi Domain. BERT. Teks Biomedis. Language Model. Transfer Learning. Domain Adaptation. BERT. Biomedical Text. |
| Subjects: | H Social Sciences > HD Industries. Land use. Labor > HD30.213 Management information systems. Dashboards. Enterprise resource planning. |
| Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Information System > 57201-(S1) Undergraduate Thesis |
| Depositing User: | Mr. Marsudiyana - |
| Date Deposited: | 03 Jun 2026 02:33 |
| Last Modified: | 03 Jun 2026 02:33 |
| URI: | http://repository.its.ac.id/id/eprint/133505 |
Actions (login required)
![]() |
View Item |
