Practical Methods Validation For Variables Selection In The High Dimension Data: Application For Three Metabolomics Datasets

Choiruddin, Achmad (2015) Practical Methods Validation For Variables Selection In The High Dimension Data: Application For Three Metabolomics Datasets. Masters thesis, Institut Technology Sepuluh Nopember.

[thumbnail of 1312201905-Master Thesis.pdf]
1312201905-Master Thesis.pdf - Published Version

Download (1MB) | Preview


Background: Variable selection on high throughput metabolomics data are becoming inevitable to select relevant information since they often imply a high degree of multicolinearity, and, as a result, lead to severely ill conditioned problems. Both in supervised classification framework and machine learning algorithms, one solution is to reduce their data dimensionality either by performing features selection, or by introducing artificial variables in order to enhance the generalization performance of a given algorithm as well as to gain some insight about the concept to learned.
Objective: The main objective of this study is to select a set of features from thousands of variables in dataset. We divide this objective into two sides: (1) To identify small sets of features (fewer than 15 features) that could be used for diagnostic purpose in clinical practice, called low-level analysis and (2) We do the identification to a larger set of features (around 50-100 features), called middle-level analysis; this involves obtaining a set of variables that are related to the outcome of interest. Besides that, we would like to compare the performances of several proposed techniques in feature selection procedure for Metabolomics study.
Method: This study is facilitated by four proposed techniques, which are two machine learning techniques (i.e., RSVM and RFFS) and two supervised classification techniques (i.e., PLS-DA VIP and sPLS-DA), to classify our three datasets, i.e., human urines, rat’s urines, and rat’s plasma datasets, which contains two classes sample each dataset.
Results: RSVM-LOO always leads the accuracy performance compare to the other two cross-validation methods, i.e., bootstrap and N-fold. However, this RSVM results is not much better since RFFS could achieve the higher accuracy performance. Another side, PLS-DA and sPLS-DA could reach a good performance either for variability explanation or predictive ability. In biological sense, RFFS and PLS-DA VIP show their performance by finding the more common selected features than RSVM and sPLS-DA compare to previous metabolomics study. This is also confirmed in the statistical comparison that RFFS and PLS-DA could lead the similarity percentage of selected features. Furthermore, RFFS and PLS-DA VIP have their better performance since they could select three metabolites of five confirmed metabolites from previous metabolomics study which couldn’t be achieved by RSVM and sPLS-DA.
Conclusion: RFFS seems to become the most appropriate techniques in features selection study, particularly in low-level analysis when having small sets features is often desirable. Both PLS-DA VIP and sPLS-DA lead to a good performance either for variability explanation or predictive ability, but PLS-DA VIP is slightly better in term of biological insight. Besides it is only limited for two class problem, RSVM unfortunately couldn’t achieve a quite good performance both in statistical and biological interpretation.

Item Type: Thesis (Masters)
Additional Information: RTSt 519.53 Cho p
Uncontrolled Keywords: High dimension data, Features selection, Classification analysis, Metabolomics
Subjects: Q Science > QA Mathematics > QA278 Cluster Analysis. Multivariate analysis. Correspondence analysis (Statistics)
Divisions: Faculty of Mathematics and Science > Statistics > 49101-(S2) Master Thesis
Depositing User: Mr. Tondo Indra Nyata
Date Deposited: 04 Jun 2018 02:14
Last Modified: 04 Jun 2018 02:32

Actions (login required)

View Item View Item