Friday, October 30, 2015

The infamous three circles of information architecture


How to Solve Missing Values


  1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percent- age of missing values per attribute varies considerably.
  2. Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.
  3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown” or −∞. If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common—that of “Unknown.” Hence, although this method is simple, it is not foolproof.
  4. Use the attribute mean to fill in the missing value: For example, suppose that the average income of AllElectronics customers is $56,000. Use this value to replace the missing value for income.
  5. Use the attribute mean for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.
  6. Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.

Wednesday, October 28, 2015

Mean, median, and mode of symmetric versus positively and negatively skewed data


Quality decisions must be based on quality data

Data preprocessing is an important step in the knowledge discovery process, because quality decisions must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision making.