What is Data Mining?

Charana Sonnadara
3 min readFeb 5, 2021

The word Data Mining is ubiquitous in the present-day. Everyone focuses on Data Mining, bragging about things we can do with it and how important it is, but extracting information from readily available data isn’t easy.

In common the Data Mining is the process of discovering knowledge from data. The knowledge can be interesting patterns that we can recognize in the abundant data stored in some data warehouse, data lake, or dynamic data.

The knowledge we generate from this data can optimize existing processes that we come across in our daily lives. Therefore, it is useful in many industries. Data mining is not merely an expert system; there can be interactions with human knowledge. Hence, Data Mining is must perform with great care avoiding possible errors and biases. These potential adverse can be hugely costly and time-consuming. Consequently, there is a necessity to do data mining systematically to avoid producing erroneous decisions.

There are several sub-processes associated with the knowledge discovery from data.

1) Data cleaning
When the sources generate data, they may not be clean. There could be duplications, missing some fields. It is vital to clean data thoroughly by removing noises and inconsistencies. The stage should carefully and systematically perform before processing, with later stages defined in this article.
Usually, Data Engineers perform this step as a part of the ETL (or EL) in organizations. They should be handle denoising and inconsistencies carefully such that it won’t add biases to the system.

2) Data Integration

Organizations can have multiple data sources. They may either be batch data or streaming data. It is one of the organization’s Data Engineer’s responsibilities to create sustainable data extraction mechanisms to fetch all the necessary data by Data Analysts.

We can categorize the two sub-processes as preprocessing steps. Usually, it is a common practice to do preprocessing steps before injecting data into a Data Warehouse. But it isn’t always mandatory, and it also depends on’s one organization’s objectives.

Before going into other sub-processes, let’s have a quick look at the down drawbacks of ETL and why there is a discussion to move from ETL to EL. In EL, all the extracted data is loaded to Data Warehouse and does the transformation process inside Data Warehouse.

ETL is inflexible. It requires the Data Analysts to know the actions they are going to perform with the data beforehand. There are numerous fields of data available to fetch from data sources, and the ETL process expects Data Analysts to know all the relevant data fields that they require to come up with useful reports. They might want to address the future requirements to avoid costly revamps in the ETL layer.

Hence the Data Engineering field appeals towards moving from ETL to EL. In EL, you do not process the transformation process before data injection to Data Warehouse. So the Data Analysts have access to all the data produced from the data sources.

The ETL might obscure Data Analysts’ visibility to source data. Hence, they could produce inaccurate results without knowing the assumptions made by the Data Engineers.

Some data might want different forms of transformations. Hence, decoupling data would be useful in some use cases.

3) Data selection
Data selection is the process of selecting a subset of the available data from the database.

4) Data transformation

5) Data mining

6) Pattern evaluation

7) Knowledge Presentationwhere visualization and knowledge

Step 1–4 mainly focuses on data mining.

If you feel that the article is missing anything, please start a discussion in the comments section.

Reference
Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques, third edition (3rd ed.). Morgan Kaufmann Publishers.

--

--