# Data Integration and Dimensionality Reduction: A Machine Learning Perspective

[zur Übersicht]**Minta Thomas**

**Abstract **

In the last few years, machine learning techniques have been successfully applied to many application areas such as information retrieval [1], e-commerce, image processing [2], computational biology, and chemistry. To understand and explore the real datasets, we often apply machine learning techniques such as clustering or classification in a high dimensional space. However, developing these machine learning models on large data sets can be very time-consuming because of its high dimensionality. Dimensionality reduction is the most important technique in unsupervised learning, to get a meaningful structure or previously unknown patterns in the multivariate data.

Principal Component Analysis (PCA) [3] is a widely used approach for dimensionality reduction; however, if the data are concentrated in a nonlinear subspace, PCA will fail to work well. In this case, one may need to consider, nonlinear version of PCA, kernel principal component analysis (KPCA). It has been studied intensively in the last several years in the field of machine learning and has claimed success in many applications [4]. As a kernel method, KPCA suffers from the problem of choosing hyperparameters for kernel functions. No well-founded methods, however, have been established for this based on unsupervised learning. Most of the existing approaches for parameter estimation of KPCA were coupled with the final classifier. In this case, the performance of KPCA obviously depends on the choice of the classifier. This shows the importance of a mathematical technique which selecting the hyperparameters of kernel function based on unsupervised learning. We will discuss a data driven bandwidth selection criterion for KPCA [5].

Data integration is the process of integrating data from multiple sources into meaningful and valuable information. The process has an important role in several situations, including business and scientific domains. The data integration aims at combining selected data sources which form one single comprehensive view of data sources. There are varieties of applications that benefit from data integration. In the area of business intelligence, integrated data sources can be used for querying and reporting on business activities, for statistical analysis, data mining, and machine learning in order to enable forecasting, decision making, and predictions. Incorporation of literature information into gene expression data analysis is an example of such a scenario, which is concerned with the analysis of the actual expression data in conjunction with existing textual information on genes, proteins, diseases, and so on [6].

Although several researchers have already proposed several non-linear data integration models, they are coupled with the selected classifiers. We proposed a kernel-based mathematical framework for data integration and classification: a weighted LS-SVM classifier. Compared with the existing approaches, the proposed approach will be a simple mathematical framework for kernel based data integration. This framework can be applied to any two complex data sources which have a common space and the final goal is to make a prediction or classification based on this common space [7]. This approach could be considered as a standard mathematical framework to produce better classification performance based on heterogeneous data integration.

**References **

1. Sebastiani, F. Machine learning in automated text categorization. ACM computing surveys (CSUR) 34, 1 (2002), 1–47.

2. Lézoray, O., Charrier, C., Cardot, H., and Lefèvre, S. Machine learning in image processing. EURASIP Journal on Advances in Signal Processing 2008, 1 (2008), 1–2.

3. Pearson, K. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2 11 (1901), 559–572.

4. Ng, A., Jordan, M., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, Proceedings of the 2001 (2001), pp. 849–856.

5. Thomas, M., De Brabanter, K., De Moor, B.: New bandwidth selection criterion for Kernel PCA: Approach to Dimensionality Reduction and Classification Problems. BMC Bioinformatics 15 137 (2014).

6. Hamid, J. S., Hu, P., Roslin, N. M., Ling, V., Greenwood, C. M. T., and Beyene, J. Data integration in genetics and genomics: Methods and challenges. Hum Genomics Proteomics (2015)

7. Thomas, M., De Brabanter, K., Suykens, J.A.K., De Moor, B.: Predicting breast cancer using an expression values weighted clinical classifier. BMC Bioinformatics, 15 411 (2014).

03.07.17 - 16:15

B 016