Big data, business intelligence, data mining and many other similar terms have been the talk of the town for some time. But what exactly does the term data mining actually mean? In this article, you will learn about the benefits and challenges of data mining, what it can achieve and how it is typically used in projects.
Data mining refers to the systematic and computer-aided application of statistical algorithms in order to recognise correlations, patterns, trends and connections in very large data sets (big data/large data sets) and in a highly automated manner. The results are then transferred into usable data structures and made available for further processing.
In a narrower sense, data mining involves the analysis of the "knowledge discovery in databases" process, which is aimed at identifying new relationships in existing data sets. In practice, however, these terms are often used interchangeably to describe not only the actual analysis but also the preparation of the data (e.g. via warehousing/data warehouses), as well as the evaluation and interpretation of the results.
Data mining is a branch of business intelligence (BI) and is also closely linked to predictive analytics, i.e. the prediction of future situations based on existing data.
Data mining is mainly used to analyse existing data sets, to recognise patterns and to make decisions based on the results.
The aim is to make practical predictions about the future, to recognise emerging trends early, to confirm or disprove assumptions about correlations and to improve business processes.
Specific use cases include determining the creditworthiness of customers, calculating available credit limits, discovering purchase patterns and trends (shopping basket analysis such as "product Y is often bought together with product X"), evaluating the connection between diseases and the effectiveness of treatments in drug development, or detecting fraud, for example based on the patterns of credit card transactions.
Depending on the application and the task at hand, data mining software tools employ different algorithms, machine learning and AI to extract information from the data. In practice, a distinction is made between the following mining methods, each of which pursues a specific goal:
This method is aimed at detecting unusual data sets, such as outliers or data errors that require further investigation. Where possible, data errors or unusable anomalies that would impair the results are then hidden during further analysis. In some cases, however, it is precisely these outliers that need to be identified (e.g. when detecting fraud).
In cluster analysis, the aim is to group data records on the basis of their similarities without knowing the underlying data structures/relying on any known structures.
Classification means the allocation of data to certain higher-level classes, e.g. the classification of emails as spam or the division of customers into risk groups according to their creditworthiness.
Association rule learning is used to identify connections and dependencies in the data. One such example is the classic shopping basket analysis, i.e. identifying which products are often purchased in combination with another.
The purpose of regression analysis is to identify relationships between data sets, such as the influence of price and customer purchasing power on sales volume.
As a rule, the data mining process is based on the so-called cross-industry standard process for data mining (CRISP-DM), which a number of well-known industrial companies developed in the framework of an EU-funded project. The aim was to create a standardised process model for data mining that could be used to search and analyse any data stock.
The process model is based on six phases, some of which have to be run several times:
Phase 1 involves the definition of the objectives and business requirements, in order to determine the specific goals and how they are to be achieved.
Once the objectives and the procedure have been determined, the existing data can be analysed. In addition, this phase includes an examination of the data quality and an assessment of whether the quality is sufficient for the stated objectives. Should this not be the case, the objectives and requirements may need to be revised.
As soon as the objectives and the data are available, the data can be prepared for evaluation. Data preparation is usually the phase that takes the most time.
Based on the prepared data, one or more data models can be created by selecting and applying one or more data mining methods. During the modelling phase it often becomes apparent that the preparation of the data needs to be adapted in order to apply the selected methods.
After the data models have been created, they are evaluated to determine whether the stated objectives have been achieved. Either the most suitable model is selected or – if the results prove unsatisfactory – phase 1 is repeated to revise the objectives and requirements.
At the end of the process, the findings are processed and made available in a suitable format.
The evaluation of the data as well as the correlations and insights that have been obtained can be used to discover trends, predict future developments and thus support the management in making decisions.
Efficient analysis of large amounts of data and the information extracted from them can be used to gain a competitive advantage, while the detection of process errors and issues leads to cost reductions.
Business process improvements:
Data mining can be used to confirm or disprove assumptions about problems in business processes and to uncover process weaknesses. Over the years, this has given rise to the special field of process mining, which focuses specifically on the analysis and optimisation of business processes.
Highly qualified data mining experts are required:
Having powerful tools is one thing – using them properly is another. In order to obtain valuable and accurate results with data mining, it is essential that the relevant software is operated by specialists who need to understand the source data to be able to prepare them correctly. Similarly, they also need to be able to assess whether the patterns, connections, interrelationships and results provided by the software are generally accurate and relevant.
Poor data quality:
As with all evaluation methods, the quality of the data is a decisive prerequisite for obtaining valid results. Any error or incomplete data set inevitably leads to a deterioration in the quality of the results, which at worst may prove entirely false. Relying on such poor results in sequence may result in the wrong decisions.
Privacy & security:
The collection of large amounts of data inevitably comes with privacy and security risks. The data sets may contain a lot of user-related data that should not be used or linked to one another. On the other hand, the process also creates opportunities for identifying security risks and breaches and subsequently remedying them.
For companies, data mining can bring about a significant improvement in their operations. The can gain new insights by evaluating the growing volumes of data they collect. Many software products already incorporate BI and thus also data mining, and yet companies worldwide often use them without harnessing the full potential for improvement that they offer. The data mining trend is set to continue, especially in connection with process mining, as it gives companies the opportunity to dramatically optimise their business processes and achieve enormous cost savings.