Vital few, trivial many: June 2022

Sunday, June 26, 2022

Attention Rebellion

Cause One: The Increase in Speed, Switching, and Filtering
Cause Two: The Crippling of Our Flow States
Cause Three: The Rise of Physical and Mental Exhaustion
Cause Four: The Collapse of Sustained Reading
Cause Five: The Disruption of Mind-Wandering
Cause Six: The Rise of Technology That Can Track and Manipulate You
Cause Seven: The Rise of Cruel Optimism
Cause Eight: The Surge in Stress and How It Is Triggering Vigilance
Causes Nine and Ten: Our Deteriorating Diets and Rising Pollution
Cause Eleven: The Rise of ADHD and How We Are Responding to It
Cause Twelve: The Confinement of Our Children, Both Physically and Psychologically

Friday, June 24, 2022

Complex Is Better Than Complicated

The Zen of Python—a collection of principles that summarize the core philosophy of the language—has crystal-clear points like these:

Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Readability counts.

If the implementation is easy to explain, it may be a good idea.

Data integrity refers to the physical characteristics of collected data that determine the reliability of the information. Data integrity is based on parameters such as completeness, uniqueness, timeliness, accuracy, and consistency.

Completeness

Data completeness refers to collecting all items necessary to the full description of the states of a considered object or process. A data item is considered complete if its digital description contains all attributes that are strictly required for human or machine comprehension. In other words, it may be acceptable to have missing pieces in the expected records (i.e., no contact information) as long as the remaining data is comprehensive enough for the domain.

For example, when a sensor (e.g., an IoT sensor) is involved, you might want it to sample data at a frequency of 10 minutes or even less if required or appropriate for the scenario. At the same time, you might want to be sure that the timeline is continuous with no gaps in between. If you plan to use that data to predict possible hardware failures, then you need be sure you can keep an eye close enough to the target event and not miss anything along the way.

Completeness results from having no gaps in the data from what was supposed to be collected and what is actually collected. In automatic data collection (i.e., IoT sensors), this aspect is also related to physical connectivity and data availability.

Uniqueness

When large chunks of data are collected and sampled for further use, there’s the concrete risk that some data items are duplicated. Depending on the business requirements, duplicates may or may not be an issue. Poor data uniqueness is an issue if, for example, it could lead to skewed results and inaccuracies.

Uniqueness is fairly easy to define mathematically. It is 100 percent if there are no duplicates. The definition of duplicates, however, depends on the context. For example, two records about Joseph Doe and Joe Doe are apparently unique but may refer to the same individual and then be duplicates that must be cleaned.

Timeliness

Data timeliness refers to the distribution of data records within an acceptable time frame. The definition of an acceptable time frame is also context-specific. It refers to the duration of the time frame and the appropriate timeline.

In predictive maintenance, for example, the timeline varies depending on the industry. Usually, a 10-minute timeline is more than acceptable but not for reliable fault predictions in wind turbines. In this case, a 5-minute interval is debated, and some experts suggest an even shorter rate of data collection.

Duration is the overall time interval for which data collection should occur to ensure reliable analysis of data and satisfactory results. In predictive maintenance, an acceptable duration is on the order of two years’ worth of data.

Accuracy

Data accuracy measures the degree to which the record correctly describes the observed real-world item. Accuracy is primarily about the correctness of the data acquired. The business requirements set the specifications of what would be a valid range of values for any expected data item.

When inaccuracies are detected, some policies should be applied to minimize the impact on decisions. Common practices are to replace out-of-range values with a default value or with the arithmetic mean of values detected in a realistic interval.

Consistency

Data consistency measures the difference between the values reported by data items that represent the same object. An example of inconsistency is a negative value of output when no other value reports failures of any kind. Definitions of data consistency are, however, also highly influenced by business requirements.

Sunday, June 19, 2022

Classifying Objects

The classification problem is about identifying the category an object belongs to. In this context, an object is a data item and is fully represented by an array of values (known as features). Each value refers to a measurable property that makes sense to consider in the scenario under analysis. It is key to note that classification can predict values only in a discrete, categorical set.

Variations of the Problem

The actual rules that govern the object-to-category mapping process lead to slightly different variations of the classification problem and subsequently different implementation tasks.

Binary Classification. The algorithm has to assign the processed object to one of only two possible categories. An example is deciding whether, based on a battery of tests for a particular disease, a patient should be placed in the “disease” or “no-disease” group.

Multiclass Classification. The algorithm has to assign the processed object to one of many possible categories. Each object can be assigned to one and only one category. For example, classifying the competency of a candidate, it can be any of poor/sufficient/good/great but not any two at the same time.

Multilabel Classification. The algorithm is expected to provide an array of categories (or labels) that the object belongs to. An example is how to classify a blog post. It can be about sports, technology, and perhaps politics at the same time.

Anomaly Detection. The algorithm aims to spot objects in the dataset whose property values are significantly different from the values of the majority of other objects. Those anomalies are also often referred to as outliers.

The Dev DataOps Agile cycle

For data science and development teams to work together (and along with domain experts), it is necessary that skills tend to merge: data scientists learning about programming aspects and user experience and, more importantly, developers learning about the intricacies and internal mechanics of machine learning.

Saturday, June 11, 2022

Algorithms are presented in seven groups or kingdoms distilled from the broader fields of study

Stochastic Algorithms that focus on the introduction of randomness into heuristic methods.
Evolutionary Algorithms inspired by evolution by means of natural selection.
Physical Algorithms inspired by physical and social systems.
Probabilistic Algorithms that focus on methods that build models and estimate distributions in search domains.
Swarm Algorithms that focus on methods that exploit the properties of collective intelligence.
Immune Algorithms inspired by the adaptive immune system of vertebrates.
Neural Algorithms inspired by the plasticity and learning qualities of the human nervous system.

Vital few, trivial many