Wednesday, November 30, 2016

Steps to developing a usable algorithm.


  • Model the problem.
  • Find an algorithm to solve it. 
  • Fast enough? Fits in memory?
  • If not, figure out why.
  • Find a way to address the problem. 
  • Iterate until satisfied.

Thursday, November 24, 2016

List to Recognize and Measure the Data Quality

Accuracy. The value stored in the system for a data element is the right value for that occurrence of the data element. If you have a customer name and an address stored in a record, then the address is the correct address for the customer with that name. If you find the quantity ordered as 1000 units in the record for order number 12345678, then that quantity is the accurate quantity for that order.

Domain Integrity. The data value of an attribute falls in the range of allowable, defined values. The common example is the allowable values being “male” and “female” for the gender data element.

Data Type. Value for a data attribute is actually stored as the data type defined for that attribute. When the data type of the store name field is defined as “text,” all instances of that field contain the store name shown in textual format and not numeric codes.

Consistency. Theformandcontentofadatafieldisthesameacrossmultiplesourcesys- tems. If the product code for product ABC in one system is 1234, then the code for this product is 1234 in every source system.

Redundancy. Thesamedatamustnotbestoredinmorethanoneplaceinasystem.If,for reasons of efficiency, a data element is intentionally stored in more than one place in a system, then the redundancy must be clearly identified and verified.

Completeness. There are no missing values for a given attribute in the system. For example, in a customer file, there must be a valid value for the “state” field for every customer. In the file for order details, every detail record for an order must be completely filled.

Duplication. Duplicationofrecordsinasystemiscompletelyresolved.Iftheproductfile is known to have duplicate records, then all the duplicate records for each product are identified and a cross-reference created.

Conformance to Business Rules. The values of each data item adhere to prescribed business rules. In an auction system, the hammer or sale price cannot be less than the reserve price. In a bank loan system, the loan balance must always be positive or zero.

Structural Definiteness. Whereveradataitemcannaturallybestructuredintoindividual components, the item must contain this well-defined structure. For example, an indi- vidual’s name naturally divides into first name, middle initial, and last name. Values for names of individuals must be stored as first name, middle initial, and last name. This characteristic of data quality simplifies enforcement of standards and reduces missing values.

Data Anomaly. A field must be used only for the purpose for which it is defined. If the field Address-3 is defined for any possible third line of address for long addresses, then this field must be used only for recording the third line of address. It must not be used for entering a phone or fax number for the customer.

Clarity. Adataelementmaypossessalltheothercharacteristicsofqualitydatabutifthe users do not understand its meaning clearly, then the data element is of no value to the users. Proper naming conventions help to make the data elements well understood by the users.

Timely. The users determine the timeliness of the data. lf the users expect customer dimension data not to be older than one day, the changes to customer data in the source systems must be applied to the data warehouse daily.

Usefulness. Everydataelementinthedatawarehousemustsatisfysomerequirementsof the collection of users. A data element may be accurate and of high quality, but if it is of no value to the users, then it is totally unnecessary for that data element to be in the data warehouse.

Adherence to Data Integrity Rules. The data stored in the relational databases of the source systems must adhere to entity integrity and referential integrity rules. Any table that permits null as the primary key does not have entity integrity. Referential integrity forces the establishment of the parent–child relationships correctly. In a customer-to-order relationship, referential integrity ensures the existence of a customer for every order in the database.