CLUSTERING SUBJECTS IN LAMPUNG PROVINCIAL NATIONAL EXAMINATION OF JUNIOR HIGH SCHOOL THROUGH MAXIMUM SPANNING TREE

National exams carried out every year in Indonesia have produced a lot of data, including in Lampung Province. In the span of 2015 to 2018, approximately 11 million national exam results have been collected. This data can certainly provide a lot of information for the improvement of Indonesian education. This paper discusses the clustering of subjects on junior high school national exams in Lampung Province with the aim of mapping the relationship of learning outcomes in Mathematics, Sciences, Indonesian, and English. The purpose of the clustering is as a further analysis so that educational improvements can also begin to enter the non-technical realm, such as scheduling, adjusting class and assignment time, and also determine prerequisite subjects. Clustering analysis is performed using an algorithm based on maximum spanning tree (MST) graphs with the help of descriptive and inferential statistical techniques. The data used are the results of national exams in Lampung province in 2017 and 2018 totaling 15.876 data. The MST Graph is determined using Kruskal’s algorithm and the coefficient of correlation is determined using Pearson coefficient of correlation value. The experimental results show that Mathematics, Natural Sciences, and English have the strongest correlation. Subject that has the strongest correlation with Indonesian is English. Therefore, the mapping of subjects can be arranged into two groups namely MathematicsSciencesEnglish and English-IndonesianSains. These results can be useful as a basis for decision making on various technical aspects to optimize junior high school student learning outcomes, especially in Lampung Province.


INTRODUCTION
One of the evaluations of learning from the Indonesian government is national examination. Based on the official the Ministry of Education and Culture of Indonesia website, the education assessments center discharge annual analysis of the national examination's results. Those results are set as the basis for improving the quality of education. Based on data from www.unbk.kemendikbud.go.id the school who already took in the national test of computer based in 2015, 2016, and 2017 respectively 555, 4382, and 30.557 (Admin, 2017). That number does not include the schools that have taken national examination on paper. According to www.jpp.go.id that by 2018 the number of schools that took the national examination of computer based into 69.113 with a predicted number of participants 6.291.323 students. Based on this, over 11 million national examination data has been estimated to be collected from 2015 to 2018 as early as elementary school until senior high school. So far, the analysis of the national examination data has only been carried out with the main objective of mapping the students cognitive abilities in Indonesia which is measured based on descriptive statistics numerical extract from data processing on national examination scores (Maskar, 2020). Because there are so many data on national examination results, it would be a shame if the analysis is limited only to find out the average results of national examination scores. This data can actually provide much more in-depth and useful information for the improvement of education in Indonesia if analyzed in depth and precisely.
Processing and analysis techniques have developed significantly one of them is educational data mining (EDM). EDM is a subset of data mining which is a new area in processing educational data by designing and using appropriate algorithms. This area has great potential to produce a variety of benefits in several sectors in the field of education, including as decision making, guidance, giving criticism to students, teachers or scientists, improving administration of learning items, etc. that can be done by the authorities of educational institutions (Hussain et al., 2019;Mishra et al., 2017;Silva & Fonseca, 2017).
The area of EDM consists of data visualization, statistical analysis, data training, machine learning, grouping or clustering, classification, and outlier analysis (Kumar & Pal, 2011;Shingari et al., 2017). The main purpose of EDM is to make large volumes of dataset into useful information and can be used as a basis for decisions making by educational institutions. The method that might be used is a data clustering technique that has been successfully applied in data mining (Cheng et al., 2014). Data clustering can be done with several techniques or algorithms including single link, average link, minimum spanning tree, k-means (Li et al., 2019;Yang, 2017). Data clustering aims to extract relationships in a dataset and to determine interesting patterns based on sameness of the sample by grouping several data objects into groups or clusters so that objects in a cluster have a high degree of similarity and very different from objects in other clusters. Clustering techniques have been studied and applied in the fields of statistics, machine learning, pattern recognition and image processing (Cheng et al., 2014;Li et al., 2019;Yang, 2017). The contribution of this paper is to determine the relationship between junior high school national examination 10. This fact shows information that the average value of the junior high school national examination in Lampung Province is still below the national average. It is interesting that when referring to the national average data, the average value does not even reach 55. This shows that there is a problem with cognitive abilities, at a minimum, not only in Lampung Province, but also nationally.
These educational problems are not only found in the matter of the substance or non-technical aspects, but also on the technical aspects, including scheduling, determining the length of time to study in class, setting assignments for students, and other nontechnical aspects related to the teaching and learning process. The curriculum change to "Kurikulum 2013" certainly has an effect on this aspect as well. These changes also indirectly change the non-technical aspects of the learning process. This is often overlooked in the improvement of education problems Indonesia. The weight of improvement on technical or substance factors still dominates compared to non-technical factors. Therefore, analysis related to factors non-technical problems needs to be popularized and started to be carried out.
Previously Nurviana(2016) and Maskar (2020) have conducted similar research in different places using national examination score data on senior high school level. Nurviana work earlier in one school in Aceh with the results showing mathematics being the center of subjects which means that other subjects follow mathematical learning patterns. Furthermore, Maskar conducts analysis on national examination dataset for sains program in Lampung Province with the results of science and mathematics becoming the center of subjects. The analysis conducted in the two studies used a maximum spanning tree graph model.
Clasterization of subjects can be used as a basis for decision making by educational institutions for some technical matters. The aim is to optimize the learning outcomes of junior high school students, especially in Lampung Province. These technical issues include competency mapping in subjects, preparation of lesson schedules, distribution of subject learning hours and other policies. Effective and efficient policy formulation is one of the keys to the success of the institution to achieve the stated learning goals. Therefore, educational data mining is fundamental as the basis for making decisions to develop policies in educational institutions. ISSN 2089-8703 (Print) Volume 10, No. 4, 2021, 2268-2282ISSN 2442

Research Procedure
This study uses a quantitative research design by processing data on the results of the 2017 and 2018 junior high school national examination with data processing procedures starting with descriptive statistical methods through scatterplot analysis, followed by inferential statistical analysis through correlation coefficient values, and finally using the maximum spanning tree graph model. The three analyzes are then combined to draw conclusions regarding the grouping of subjects according to the abilities of junior high school students in that year.

Dataset
The data of the Lampung province junior secondary school in 2017 and 2018 were obtained from the Lampung provincial education office and website page www.hasilun.puspendik.kemendikbud.g o.id. The data consist of 1,994 and 1,975 schools in Lampung provinces in 2017 and 2018 respectively. Total data collected are 15,876 data consisting of four subjects; Science, Mathematics, Indonesian Language, and English.

Formation of the MST Graph Model
The subject clustering process is done using the maximum spanning tree (MST) graph method. This method is a modification of the minimum spanning tree. Modifications are made because weight of the graph for clustering subjects used correlation coefficient values, because if the correlation coefficient higher (maximum 1), then the correlation between the subjects was better. Minimum spanning tree is a subgraph that does not have a cycle that connects all graph vertices with a minimum weight so that the graph is a graph model that is suitable for determining the relationship between vertices on a graph. The strongest relationship is vertex that has the most branches or degrees (Crobe et al., 2016;Kenneth H. Rosen, 2013).
Minimum spanning tree is one of the graph models that is made using Prim or Kruskal algorithm. The minimum spanning tree is built by all data by giving a weight to each edge of the graph and a number of steps to form a new graph by removing the edge that is greater than the previous weight.
The basic idea is to calculate sum of all the weights on the initial graph and begin with some vertices V on graph given G = (V,E). All possible vertices that are directly connected to the starting point are taken in stages and weights of graph are reduced from the total one by one (Lingaya et al., 2019;Walter & Dubey, 2016).
Maximum Spanning Tree (MST) graph model in the subject clustering technique is formed by interpreting the graph point as a subject, there are four subjects namely Science, Mathematics, Indonesian Language and English. Furthermore, edge of the graph is interpreted as correlation between two subjects with graph weight is the value of correlation coefficient. The correlation coefficient value is obtained by using the Spearman correlation coefficient. As a result, an initial graph can be formed which is a complete graph as a model of the relationships between subjects in the national junior high school exams based on the value of the correlation coefficient as shown in Figure 1. where MATH means mathematics, SCIENCE is science, ENG is english, and IND is indonesian.
The values of f,g,h,i,j, and k are the correlation coefficient values between subjects. Based on the initial graph, the MST graph is formed using the Kruskal's Algorithm from the largest to the smallest coefficient correlation based on the following provisions (Sugiyono, 2015):  (2013), with a slight modification, the formation of the Maximum Spanning Tree graph with the Kruskal's algorithm is as follows: 1) The weight edge of the graph G is sorted from the largest weight to the smallest.
2) Select the edge u ,v which has the maximum weight that does not form the cycle. Add u ,v into T. 3) Repeat step 2 until the Maximum Spanning Tree is formed, that is, when the side in the tree stretches T. number n-1 (n is the number of vertices in the graph).
The clustering process will be determined based on several analyzes including; scaterpolot graphs on descriptive statistical techniques, based on correlation coefficient values, and finally based on neighborhood relations on MST graphs. Analysis is carried out carefully so that the subject cluster can be formed in accordance with the most close relationships and in accordance with good clustering rules.

Basic Analysis
This basic analysis aims to find out initial information from the dataset of the national examination results of Lampung Province in 2017 and 2018. The data is processed using SPSS software. Table 2 shows numerical esence of data processing as initial information for further analysis. There are two important information that can be drawn from the average value and standard deviation of the two data groups. First, the average value of the two data sets is relatively low, below 50. The standard deviation of the two data shows that the average value illustrates the majority of each data item on the results of the national exam which means that the value supports information about the mean values previously mentioned. Second, the two data groups tend to be identical by looking at each item's value in the numerical score in table 2. Therefore, the process of analyzing the subject clustering tends to be easier because the two data groups do not have significant differences even though they are taken in different years.

Scatterplot Analysis
Scatter plot is a graph used to analyze the relationship or linearity between two groups of data. The purpose of the scatterplot analysis is to analyze correlations between subjects using a graphical approach. Following is the scatterplot of the national exams data in Lampung Province in 2017 and 2018. Figure 2   The analysis above shows that the strong relationship with the results of the national exams in Lampung Province in 2017 and 2018 is shown by three subjects, namely Natural Sciences, Mathematics and English. While Indonesian has a correlation with English. This shows the initial description of the subject mapping based on the strongest relationship where the subject mapping is divided into two groups and English is

2274|
contained in that two groups, while Indonesian is in one group with English. Figure 3 shows the initial mapping description based on the results of the scatterplot analysis. The initial cluster in Figure 3 forms the basis of the subsequent analysis using correlation coefficient values and the MST graph model.

Correlation Coefficient analysis
The correlation coefficient used in this paper is the Spearman correlation coefficient.
Spearman correlation coefficient values can be used in groups of data that have a normal distribution or not, that is the reason for using these correlation coefficient values. The determination of the Spearman correlation coefficient is aided by SPSS software is in Table 3.  Table 3 shows some important information, three subjects; Mathematics, Science, and English show dominance as three subjects that have a close relationship. This reinforces the allegation in the previous analysis using scatterplot. As a result, the first cluster remains occupied by these three subjects. Slightly different from the previous second cluster. Table  3 shows that in 2017 and 2018 the relationship of Indonesian Language subjects and natural sciences was relatively consistent (strong) with scores of 0.689 and 0.684 respectively. While the relationship between Indonesian and English increased significantly in 2018 compared to 2017 with a correlation coefficient greater than Indonesian-IPA. The analysis provides a new cluster which is slightly different from the previous analysis with additional IPA in the second cluster, shown in Figure 4.

Maximum Spanning Tree Analysis
Based on the previous correlation coefficient, the initial graph model of the relationship between subjects in the national exams in Lampung province in 2017 and 2018 is shown in Figure 5. The correlation coefficients between subjects listed in Table 3 are sorted from those that have the strongest to lowest relationships for each year, 2017 and 2018. The ordering aims to apply the Kruskal algorithm to form a maximum spanning tree (MST) graph. Table 4 is the correlation coefficient values that have been sorted for 2017. Table 5 is an orderly list of correlation coefficient values between subjects in the national exams in Lampung province in 2018 to form the second MST graph model.  The results of the national junior high school exams were eventually modeled by three different MST graphs, one 2017 data graph and two 2018 graphs as follows: 2276| Figure 6 shows the relationship of subjects in the Lampung province junior high school national examinations in 2017 and 2018 based on the principle of neighborhood on the MST graph. In the 2017 MST graph it is seen that absolute science subjects are the center of subjects compared to other subjects with the most branches or degrees of three, meaning that indirectly the ability of students in science lessons influences their learning outcomes in other subjects at the UN. Note that in the process of forming the 2017 MST graph, there is the MATH-SCIENCE-ENGLISH circuit. As a result, the closest subject relationship is owned by a group consisting of the three subjects previously mentioned (MATH-ENG-SCIENCE) and the subject that has the closest relationship with Indonesian is science with a correlation coefficient value of 0.684 so that subjects science is grouped with Indonesian. Figure 6 on the 2018 MST graph model consists of two graph variations. For the first MST graph found on the left in figure 6, it has a center point in science and English subjects with the closest neighbors to science are mathematics and English, while English is next to science and Indonesian. Then, in the second MST graph (picture to the right), it also has two central points namely mathematics and English where mathematics is next to science and English while English is neighbor to mathematics and Indonesian. Based on the 2018 MST graph model, the first subject cluster is filled by SCIENCE-MATH-ENG and the second cluster is filled by SCIENCE-ENG-IND with the consideration that the relationship between mathematics and Indonesian is relatively more agile than with other subjects.
Based on the analysis of the MST graph model in the two years, the final clusters of subjects in the national exams in Lampung province based on 2017 and 2018 data are shown in Figure  7. Another advantage of clustering using the MST graph model is that it can sort the members of each cluster according to the most influence closely with other subjects. The first cluster does not change from scaterplot analysis, correlation coefficients, and was strengthened by analysis of the MST graph model. The final analysis reinforces the previous argument and adds information related to the order of the members of the subjects in the cluster according to the level of the subjects that most influences, so that the cluster members consist of Natural Sciences, Mathematics and English sequentially.
The members of the second cluster consist of science, English and Indonesian sequentially. The cluster was formed with the consideration that Science has a strong influence on all subjects in the Lampung provincial national exam based on the results of the 2017 MST graph. Therefore, it reinforces the second cluster argument based on the value of the correlation coefficient. In order, Science ranks first with consideration of the 2017 MST graph and other reasons, namely that Indonesian only has a close relationship with English, so that the cluster members consist of IPA, English and Indonesian respectively.

Discussion
The last group formed shows that there is a close relationship in the subjects of Science, Mathematics and English in the first cluster and another relationship is formed by Science, English, and Indonesian in the second cluster. There are interesting things from the two clusters, pay attention to Figure 8, in the first cluster there is one language group subject, namely English, between Mathematics and Science and there is one science subject between two language subjects (English and Indonesian) in second cluster. In this section, a careful analysis of the two clusters is carried out through a theoretical study based on relevant sources.
The first cluster is the cluster that has the closest relationship based on the value of the correlation coefficient in Table 5. In this cluster there is a strong relationship between three subjects, English, Mathematics and Science. In this section, the interesting thing is the relationship between English and Mathematics -Science. In some places, in countries that do not use English as the main language, it shows that the influence of English language skills indirectly has a significant effect on academic achievement in Science and Mathematics of students in Grades 10, 11, and Higher Education at the beginning of the lecture year (Fenoll, 2018;Maskar, 2020;Murray, 2017;P. Ferrer & J. Dela Cruz, 2017;Stoffelsma & Spooren, 2019). The role of learning English in Mathematics is in the section on delivering the concept of Mathematics. Mathematical concepts consisting of definitions, theorems, postulates, and others come from English sources. Therefore, it is necessary to have good English skills in order to understand mathematical concepts well, and it will be easier to understand the arguments on the concept when referring to the main source directly. However, a significant effect on academic achievement can occur if the Mathematics Teacher understands mathematical concepts well and also has good English skills (Stacey, 2016). The 2017 and 2018 Lampung Provincial National Examination results show that this influence has not been able to support student academic achievement. This can be seen from the Mathematics average score on the UN score which is still below 50%, which correlates with the English score which also has an average value below 50%. Furthermore, the role of English in science learning is in the section on understanding the use of scientific terms. In addition, reading skills also have a major effect on increasing academic achievement in science learning(A. Imam, 2016;Hacioğlu et al., 2016). Reading skills in this cluster are certainly not reading skills directly in the learning process in class or taught by teachers, because the language of instruction in the majority of schools in Lampung Province uses Indonesian. Therefore, this reading skill is an additional part for students who like to observe and refer to a wider source, in English. This skill certainly needs to be supported by the financial strength of the student's parents, so that the influence of English reading skills on academic achievement in Science is influenced by other variables, such as facilities, money, home environment, and others (Prinsloo & Harvey, 2020). This influence is also still relatively low when referring to the 2017 and 2018 ISSN 2089-8703 (Print) Volume 10, No. 4, 2021, 2268-2282ISSN 2442 (Maskar, 2018;Maskar & Anderha, 2019;Wulantina & Maskar, 2019). The learning method through this cultural approach requires good skills in reading literacy and other relevant knowledge. Therefore, the right grouping of subjects is important so that learning objectives can be achieved. In addition, the use of technology also greatly supports students in understanding concepts and makes it easier to carry out learning simulations in class or when students are at home. The use of technology has a significant influence in helping teachers and students support the learning process. In learning Mathematics, GeoGebra is one of the technology-based media that can be used for free, but still has an optimal role (Maskar & Dewi, 2020;Maskar & Wulantina, 2019) The second cluster consists of Indonesian, Science, and English. The discussion on the relationship between English and Science has been discussed previously. The relationship between English and Science lies in reading skills outside of the resources provided at school, so that students have more knowledge and can understand scientific terms in science learning through these skills. In this context, Indonesian language lessons should have a more significant effect, considering that the general language of instruction in junior high schools in Lampung Province is Indonesian. In general, the correlation of Indonesian in science learning, which lies in students' reading skills so that they can understand concepts well. Meanwhile, the relationship between English and Science lies in the ability of students to understand scientific terms in Science Learning and the ability of students to be able to refer to wider sources. These skills certainly need to be optimized through the right techniques in science learning. This optimization can be done using a technique developed by Brown and Ryoo (Prinsloo & Harvey, 2020) in 2008 when they conducted research in Africa. The results show that the approach by introducing and teaching science phenomena uses everyday language and then switches to scientific language after that. The technique is called by teaching science concepts linguistically and ethnically. This method is significant in Africa because it can separate the concepts and linguistic components of Science. This technique is also a solution for minority learners or learners with difficult access to learning resources and lack of facilities in the school and home environment.
These clustering can use for determining the learning schedule, determining the number of lesson hours, determining the subjects that are prerequisites, and the group of subjects given every day, as well as other technical matters. In addition, this clustering is also useful for determining which subjects can be collaborated in order to run optimally. If there is a STEM or STEAM trend now, of course with a little modification, based on this reference, the collaboration can be ISSN 2089-8703 (Print) Volume 10, No. 4, 2021, 2268-2282ISSN 2442 rearranged with optimal results because it is in accordance with the situation of students in the education unit. However, referring to existing theoretical studies, the influence on each of these clusters is also influenced by other variables, one of the most important being the facilities and availability of learning resources for students, especially for students whose parents are in a weak economic level.

CONCLUSION AND SUGGESTION
There are two final clusters that were formed based on data on the Lampung Province National Examination (UN) scores in 2017 and 2018 with a total of 15,876 data consisting of the value of Science, Mathematics, English, and Indonesian subjects. The first cluster consists of Mathematics, Science, English, and the second cluster consists of Indonesian, English, and Science subjects.
The study on clustering is expected to be the initial basis for schools, especially in Lampung Province to be able to develop similar research with larger data so that the results can be used as a reference for decision making for technical and nontechnical matters in order to improve student achievement in Lampung Province. Therefore, if this cluster is to be applied to policy making in schools or institutions where the majority are still lacking in terms of facilities and learning resources, there needs to be a solution for procuring these facilities first and then learning will run effectively and optimally by improving every aspect. technical management in education, improvement in competence in the curriculum, as well as increasing the competence of human resources, especially teachers.
The author hopes that this study will continue to be developed with greater data and at every level of education, in primary and secondary schools. Considering that the results of the National Examination produced more than 11 million data from the beginning of its implementation, of course it would be very unfortunate if the data was not processed into useful information for the improvement of education in Indonesia.