We are in constant communication with data in our daily or business life. Understanding, using, and analyzing data is essential. Data literacy can be used to interpret a political party’s voting rates at the ballot box. There are many examples like this in all areas of life.
Literacy essentially plays a role in understanding, using, and analyzing tasks.
Population: It is the main audience targeted and worked on.
Sample: It is the subset withdrawn from the population.
Observation Unit: Each unit in the sample and each line in the data set we examine is called an observation unit.
Variables and Variable Types: The columns in the data set are called variables. Variable types are divided into Numerical Variables (quantitative) and Categorical Variables (qualitative).
These variables have values that describe a measurable quantity as a number, like how many or how much.
Ratio: It takes zero as the starting point. For example, the scale type of numerical variables includes the number of students, weight/height measurement, and speed.
Interval: The scale type of numeric variables whose starting point is non-zero. For example, the temperature is a numerical variable, the scale type is a range, so it can also take negative values.
Categorical variables can take values that describe a data unit's quality or characteristic, like what type or which category.
Nominal: It is a type of scale where there is no difference between classes in categorical variables. For example, gender and marital status are some categorical variables belonging to this scale type.
Ordinal: It is a type of scale in which there is a difference between classes in categorical variables. For example, educational background and military ranks.
Measures of Central Tendency
Measures of central tendency help you find the middle, or the average, of a data set. The 4 most common measures of central tendency are the mode, median, mean, and quartiles.
Mean: This is a simple arithmetic average, which is computed by taking the aggregated sum of values divided by a count of those values.
The mean is sensitive to outliers in the data. An outlier is the value of a set or column that is highly deviant from the many other values in the same data; it usually has very high or low values.
Median: This is the midpoint of the data and is calculated by either arranging it in ascending or descending order.
If the data is in even numbers, then add the two-middle data and divide by two, and that is the median.
Mode: This is the most repetitive data point in the data.
Quartiles: Quartiles are values that divide your data into quarters.
Measures of Dispersion
The measures of central tendency are not adequate to describe data. Two data sets can have the same mean, but they can be entirely different. Thus to describe data, one needs to know the extent of variability. The measures of dispersion give this. Range, interquartile range, and standard deviation are the three commonly used measures of dispersion.
Range: This is the difference between the maximum and minimum of the value.
Standard Deviation: This is the square root of variance. By applying the square root on variance, we measure the dispersion concerning the original variable rather than the dimension's square.
If the standard deviation is small, it indicates that the data are distributed close to the mean and that a particular variable or population is homogeneous. However, if the standard deviation is large, the data are scattered far from the mean.
Variance: Variance measures how far a data set is spread out. The technical definition is “The average of the squared differences from the mean,” but all it really does is to give you a very general idea of the spread of your data. A value of zero means no variability; All the numbers in the data set are the same.
Skewness: The concept of skewness is baked into our way of thinking. When we look at a visualization, our minds intuitively discern the pattern in that chart.
Skewness Coefficient = 3 (Arithmetic Mean — Median) / Standard Deviation
If ÇK> 0 is skewed in the positive direction
If ÇK = 0 the distribution is symmetrical
If ÇK <0 then skewed in the negative direction
Kurtosis: This shows the kurtosis of the distribution of the variable. If the kurtosis coefficient, which we do not encounter in practice, is equal to 3, the distribution is normal; if it is large, it is pointed, and if it is small, it is said to be flattened.
Statistics for Data Science and Machine Learning - Statistics, Data Science, Machine Learning and…
Statistical models always start with some underlying assumptions for which all the variables should hold, then the…