Classification: Classification is probably the most widely used data mining approach. Most scoring models, such as credit scoring models, lending-risk scoring models, fraud scoring models, etc. are usually based upon classification methods. Classification methods (or classifiers) enable the categorization of records (or entities) into two or more pre-defined classes. The use of classification algorithms involves a training set consisting of pre-classified examples. In the tax audit domain, the two classes could be compliant filings versus non-compliant filings, and the training set would be assembled from historical audits. The classifier calibration algorithm uses the pre-classified examples to determine a set of parameters required for proper discrimination between the classes. The algorithm then encodes these parameters into a model called a classifier. Once such a classifier is calibrated, it can assign new filings to either of the classes. There are many algorithms that can be used for classification, such as decision trees, neural networks, logistic regression, etc.
Clustering: Clustering is an exploratory method used to discover natural groupings within records or entities. Clustering approaches are commonly used for segmentation – for example, identifying natural segments or groups within the taxpayer population. Clustering algorithms allow entities with a large number of attributes to be partitioned into a few distinct groups or “segments”. It is different from the classification which pertains to a known number of classes. While the objective in classification is to assign new observations to one of the classes (that is known apriori), cluster analysis makes no assumption about the number of underlying groups or any other structure.
Association Rules: Association rules are basic types of patterns or regularities that are found in transactional-type data. It has its origins in traditional retail marketing where it can discover affinities between items that occur within a particular shopping trip (for example, what items typically co-occur as contents of a shopping basket). Hence, an alternative name for this type of analysis is “market-basket analysis”. From a set of transaction data (for example tax filings, or insurance claims), association rules can discover characteristics within a transaction that imply the presence of other characteristics in the same transaction. For two sets of characteristics X and Y, an association rule is usually denoted as
to convey that the presence of the characteristic X in a transaction frequently implies the presence of characteristic Y.
Sequential Pattern Detection: Sequential patterns involve mining frequently occurring patterns of activity over a period of time. In many situations, not only may the coexistence of items within a transaction be important (which would be discovered by association rules algorithms), but also the order in which those items appear across ordered transactions, and the amount of time between transactions (which would be discovered by sequential pattern detection algorithms). Thus, sequential pattern detection methods are similar to association rules, except that they look for patterns across time (as opposed to patterns within transactions). This could be a pattern that represents a sequence of tax filings over time, or a sequence of purchases over time, etc.
Change and Deviation Detection: These techniques are useful for identifying significant changes in a data set from previously measured or normative values. Once a deviation is discovered, further analysis can be carried out to determine whether it is due to noise, or due to casual reasons. Deviation detection is typically carried out using simple linear projection of certain measures based on previous values, and comparing these projections against normative values.