“Big Data” has become a more relevant buzzword in recent years, due to its use in data mining information to find relevant patterns or useful trends among users of a company’s product. Because of this relevance, we need a specific process to stick to in order to pull important knowledge from huge amounts of information. This is why some developers use the “KDD Process”. This process consists of 4 main steps. The first is data cleaning, which is essentially a step to make sure we remove bad data points and noise to clean up our dataset. Step 2 consists of data transformation, where we convert the data into a more usable form. This might involve feature transformations. Next, we perform the bulk of the work, data mining. When we data mine we use methods in math and machine learning to find meaningful relationships in the data we have, such as clustering or relevant boundaries. Finally, we have pattern evaluation, where we look at the patterns we have found and look for high level reasoning and knowledge to make from these relationships.
Because of the recent rise of data mining, many people have raised privacy concerns regarding the use of their personal information for a company’s gain. This could be a company using someone’s personal information to form recommendations for them without their consent. Target famously used a teenager’s data without consent to predict their pregnancy, and sent pregnancy-related coupons to her family. This outed her pregnancy to her family when even the teenager didn’t know she was pregnant. Issues such as this led to the rise of PPDM, or privacy preserving data mining. PPDM has two main focuses: sensitive data such as ID or credit card numbers cannot be used, and that the results of mining which might cause a privacy violation should not be published. Because of the rise of Big Data and data mining, PPDM has become more relevant in protecting user data.
Author:
Lei Xu; Chunxiao Jiang; Jian Wang; Jian Yuan; Yong Ren