Understanding DATA

I, Rushi Prajapati, Welcome you, to my another blog in my “Simplifying Series”, in which I’m trying to explain complex topics by simplifying them. In this series, I’ve written Five blogs: Computer Vision, ML-DL, Neural networks ,Activation functions , data the new oil and today I’m presenting you another blog about understanding of data in the domain of Data Science.

Rushi Prajapati
6 min readSep 6, 2023

In data science, data refers to the raw facts, observations, measurements, or records collected about individuals, events, or entities. These data are the foundation of any data analysis, modeling, or machine learning task. Data can come in various forms and types, each of which is important for understanding and drawing insights from the available information.

Data Formats:

  1. Structured Data:- Well-organized data with a predefined schema, typically found in databases and spreadsheets.
  2. Unstructured Data:- Data without a predefined structure, often found in text documents, images, videos, and audio files.
  3. Semi-Structured Data:- Data that does not conform to a strict schema but has some level of organization, like JSON(JavaScript Object Notation) or XML(Extensible Markup Language) files.

Data Sources:

  1. Primary Data:- Data collected directly from the source through surveys, experiments, or observations.
  2. Secondary Data:- Data that has already been collected by other parties for different purposes and is now reused for analysis.

Four Main Categories:

  1. Text Data:- Text data consists of characters, words, sentences, or paragraphs in a human-readable format. It is prevalent in documents, emails, social media posts, articles, books, and web pages.
  2. Numerical Data:- Numerical data consists of numbers and quantitative values. It can be discrete (countable) or continuous (measurable).
  3. Image Data:- Image data represents visual information in the form of pixels. Images can be photographs, graphics, or any visual representation captured by cameras or generated digitally.
  4. Video Data:- Video data is a sequence of images displayed in rapid succession, creating the illusion of motion. It is composed of multiple frames, where each frame is an individual image.

In addition to the four main categories of data mentioned earlier, there are several other data types that are commonly used in various applications:

  • Audio Data: Represents sound information, used in speech recognition, audio processing, and music analysis.
  • Time Series Data: A sequence of data points recorded at successive points in time, used in finance, weather, and sensor readings.
  • Geospatial Data: Data with geographic coordinates for mapping, navigation, and spatial analysis.
  • Categorical Data: Represents qualitative variables organized into categories, used for classification tasks.
  • Binary Data: Consists of two values (0s and 1s), used in computer systems and digital communications.
  • Genetic Data: Represents an individual’s DNA sequence, important in genetics research and medical diagnostics.
  • Sensor Data: Comes from sensors measuring physical properties, used in IoT applications and environmental monitoring.
  • Network Data: Represents relationships between nodes in a network, used in social network analysis and network security.
  • Metadata: Provides information about other data, includes data type, source, and creation date.

Types Of Data

Qualitative Data:

This type of data deals with qualities or characteristics that cannot be easily measured with numbers. Qualitative data is usually non-numeric and is expressed in words, images, symbols etc..

Now, within qualitative data, there are two subcategories:

(a) Nominal Data:- This is data where there is no inherent order or ranking among the categories. It means the categories are just labels, and there’s no particular sequence or hierarchy.

For example, think of data like gender (male, female) or race (White, Black, Asian, etc.). These categories don’t have any natural order, and they are independent of each other.

(b) Ordinal Data:- This is data where there is an order or ranking among the categories, but the differences between the categories may not be uniform or measurable. In other words, we can arrange the data in a series from the lowest to the highest, but we can’t say how much higher one category is compared to another.

For example, if we have a survey where people rate their satisfaction as “Very Satisfied,” “Satisfied,” “Neutral,” “Dissatisfied,” and “Very Dissatisfied,” we can see an order from most satisfied to least satisfied, but we can’t quantify the difference in satisfaction between each category.

Consider the two examples to better understand nominal and ordinal data:

1.How was your job experience?

_______

2. How was your job experience?

•Good

•Neutral

•Bad

The data to be collected from example 1 is a nominal data, while that of 2 is an ordinal data.

Quantitative Data:

Quantitative data is information that has to do with quantities or numbers. It’s all about things we can measure. For example, how far you travel to college or the number of children in a family are examples of quantitative data.

Discrete Data: Discrete data is information that can only take specific, separate values. These values are usually whole numbers and cannot be subdivided any further. Think of it as data that you can count, and there are distinct categories or individual points.

Examples of discrete data:

Number of Siblings: The number of siblings you have is a discrete data point. It can only be a whole number, such as 0, 1, 2, 3, etc. You can’t have 2.5 siblings; it doesn’t make sense in this context.

Rolling a Die: When you roll a standard six-sided die, the result you get is discrete data. The possible values are 1, 2, 3, 4, 5, or 6. You can’t get a value like 2.75 on a single die roll.

Number of Cars in a Parking Lot: If you count the number of cars parked in a parking lot, you’ll get discrete data, such as 10 cars, 50 cars, etc.

Number of Emails Received: You can count the emails you get, and it will be in whole numbers like 20 emails, 45 emails, etc.

Continuous Data: Continuous data is information that can take any value within a range. It’s measured on a scale, and it can be divided into finer and finer increments, including fractions and decimals.

Examples of continuous data:

Height of a Person: The height of a person is continuous data. People’s heights can vary in small increments, such as 160.5 cm, 167 cm, or 175.2 cm. It’s not limited to whole numbers, and you can measure it with greater precision.

Temperature: Temperature is continuous data. It can be measured with a thermometer and can have values like 20.5°C, 24.2°C, or 30.1°C. It’s not restricted to specific values and can change continuously.

Time Taken to Complete a Task: The time taken to complete a task is continuous data. It can be measured with a stopwatch and can have values like 5.25 minutes, 10.5 minutes, etc. It’s not restricted to just whole numbers.

Distance Traveled by a Train: The distance traveled by a train is continuous data. It can be measured and recorded in kilometers or miles, and it can have values like 120.3 km, 235.7 km, and so on.

In essence, discrete data is like counting and has specific, separate values, while continuous data is like measuring on a scale and can have any value within a range, even fractions or decimals.

In practice, datasets may often contain a mix of both discrete and continuous data, along with other types like categorical data. Machine learning algorithms are designed to handle such mixed data and can accommodate both types during training and inference.

It’s important to preprocess the data appropriately based on its type before applying machine learning algorithms. For example, converting categorical variables to numerical representations (e.g., one-hot encoding) and scaling continuous data to a common range can help in improving model performance.

Overall, the choice between using discrete or continuous data depends on the specific problem at hand and the nature of the data collected. A good understanding of the data and the problem domain is essential for selecting appropriate techniques and algorithms in machine learning and data analysis.

Conclusion

Understanding these aspects of data is fundamental for anyone involved in data science and analysis. It allows one to make informed decisions about data handling, preprocessing, and model selection based on the nature of the data they are working with. In the world of data science, data is not just information; it is the key of insights, predictions, and informed decision-making. Its proper management and analysis enable us to unlock the potential of data-driven solutions and innovations across various domains.

I hope this blog provided you with a simplified understanding data formats. Keep an eye out for more blogs in the “Simplifying Series.”

Thank you for reading!!! If you’d like to connect and continue the conversation, feel free to reach out to me on LinkedIn . Let’s explore the fascinating world of data science together!

--

--

Rushi Prajapati
Rushi Prajapati

Written by Rushi Prajapati

Data Science enthusiast trying to explain the tech in simple terms || Machine learning || Deep learning || Data Analytics || Computer Vision || Python