promotor: prof. dr. M.G.J. van den Brand (TU/e)
copromotor: dr.ir. L.G.W.A. Cleophas (TU/e)
Eindhoven University of Technology
Date: 20 February 2019
Nowadays, we are witnessing an ever increasing complexity of software and systems, ranging from cyber-physical systems to the Internet of Things. Model-Driven Engineering (MDE) is a methodology which promotes the use of models, metamodels and model transformations as first-class citizens to tackle the complexity of building and maintaining those systems. With increased adoption of Model-Driven Engineering (MDE), however, the number of related artifacts in use, notably models, greatly increases. To confirm this, we present in this thesis quantitative evidence from both academia—in terms of repositories and datasets— and industry—in terms of large domain-specific language ecosystems. To be able to tackle this dimension of scalability in MDE, we propose to treat the artifacts as data, and apply various techniques—ranging from information retrieval to machine learning—to analyze and manage those artifacts in a holistic, scalable and efficient way.
Based on this idea, we have developed a framework called SAMOS (Statistical Analysis of MOdelS) to analyze, compare and visualize large datasets of models. The essence of the approach involves (1) extracting pieces of information (i.e. features) from models such as names, types and structure (e.g. linear chunks of n-grams, or subtrees); (2) defining comparison schemes (e.g. name comparison using Natural Language Processing for typos and synonyms, and structural comparison using edit distance) to obtain a Vector Space Model; and (3) applying distance measures and clustering to discover e.g. groups of similar models or model fragments. Working on different types of models such as EMF metamodels, state charts, feature models and industrial domain-specific models, we have used and evaluated SAMOS in various settings and application areas throughout the thesis.
Chapter 2 introduces the basics of our approach used in this thesis. We make a first step towards the handling of large datasets of models, EMF metamodels in particular, from a statistical perspective. Using VSM-based clustering of models represented as simple features including name and type information only, many characteristics and relations among the metamodels, such as clusters, sub-clusters and outliers, can be analyzed and visualized. We have explored two scenarios, namely model searching and repository exploration, for which we can utilize our approach. Particularly for the first case study, it is clearly noticeable that there are distinct outliers and groupings in the search results. This information can be used to improve the navigation or precision of the search results. The second case study, on the other hand, deals with a heterogeneous set of domains and allows identifying domains, subdomains and also the proximities between related ones. This grouping information can be used for domain model recovery as well as model repository management scenarios.
We have extended SAMOS to incorporate structural context into clustering in Chapter 3. We have indicated a shortcoming of the basic approach as in the previous chapter, i.e. ignoring the context of model elements, and have proposed an n-gram based representation and comparison, which can be considered as the compromise between context-less clustering approaches and advanced pairwise structural techniques. We have evaluated our approach on an Ecore dataset. We have shown that n-grams improve the clustering accuracy on average. Picking an n > 1 is shown to increase complexity (though not monotonically) and using n = 2 is suggested for smaller datasets and precision-oriented tasks.
In Chapter 4, we have introduced how our approach can be used to improve the results of variability mining from models of block-based languages. For this purpose, we have demonstrated and discussed how our cluster and outlier detection can improve the variability information generated by the family mining approach developed by our collaboration partners. Using the presented extension it is possible to remove outliers (e.g., completely unrelated variants) from a set of input models, i.e. state charts, and cluster them into more meaningful sets (e.g., relevant for particular users).
We have presented in Chapter 5 an application of our generic model clustering technique to comparing feature models. With two exploratory case studies on the 1034-model dataset in the S.P.L.O.T. repository, we get (1) a repository overview and major domains therein, (2) very similar models in the repository such as duplicates and clones. Based on the studies, we conclude that our approach can help with the use and maintenance of emerging repositories such as S.P.L.O.T. The clone detection part is properly treated later in Chapter 6. There we have extended SAMOS with additional scoping, feature extraction and comparison schemes, customized distance measures and clustering algorithms in the context of metamodel clone detection. We have evaluated our approach using a variety of case studies involving both synthetic and real data; and identified the strengths and weaknesses of our approach along with two other state-of-the-art clone detectors. We conclude that SAMOS stands out with its higher accuracy while still being substantially scalable.
In Chapter 7, we have applied our approach in an industrial context, with various analyses on ASML’s MDE ecosystems. We have used and extended SAMOS to operate on ASML’s languages and models. We have elaborated the domain-specific extension of SAMOS, specifically for ASML’s ASOME data and control models to enable clone detection on those models. In extensive case studies, we have performed clone detection on ASML’s models, and additionally language-level analyses ranging from cross-DSL conceptual analysis and clone detection to architectural analysis for the CARM2G ecosystem. We have presented our findings along with valuable feedback from the domain experts on the nature of cloning in the ecosystems, and indicated opportunities such as refactoring to support the maintenance and quality of the ever-growing and evolving ecosystems.
Moreover, we have experimented with a distributed data processing back-end for SAMOS (Chapter 8), as a means to improve its scalability further, to help with e.g. larger datasets or more expensive analyses. Along with integration into the Eclipse KNIME data mining ecosystem, this is part of our efforts in transforming SAMOS into a mature open framework to be used and extended in further scenarios.