It is common to hear about Big Data, and this is due to the recognition of the importance of the insights that can be obtained through the data. The term is so widespread that it is discussed even in non-specialized media such as Magazines and Portals. In 2012, the US Government announced the use of Big Data as a way to strengthen national security and transform teaching and learning.
But even after this popularization, there are still some misconceptions, the most common is the Amount of data is the only thing that matters. The following are basic concepts associated with Big Data. I will not present tools; the purpose is to understand the fundamentals so we can evaluate when to use them.
Types of Data
The data types are classified into:
- Structured Data: Data with a high degree of organization can be represented by rows and columns that can be quickly sorted and processed by simple algorithms.
- Unstructured Data: Data that does not have an identifiable internal structure, such as pdf files, videos, audios, posts on social media, email, etc … Devices such as sensors, tablets, and cell phones are examples of sources of these types of data. The Internet of Things also tends to contribute considerably to this type of data generation.
A few years ago, we had practically only structured data. Today 85% of the data produced is unstructured.
What Defines Big Data
The term Big Data was used to describe the increase of data generated by society and collected by organizations. This has been used successfully in several areas such as politics, sales, search tools, and sports. After all, it is argued that Germany’s secret to winning the 2014 World Cup was using Big Data tools. The first feature that comes to mind when discussing Big Data is a large amount of data. However, this is based on three dimensions: Volume, Variety, and Speed, known as the 3 V’s.
The TechAmerica Association defines Big Data as:
It is a large data set that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information. (TechAmerica Foundation’s Federal Big Data Commission, 2012)
- The volume corresponds to the magnitude of the data, which may vary depending on the frequency and type of data recorded. Also, what is considered Big today may change shortly due to increased storage capacity. Variety is related to the heterogeneity of the data repository.
- Speed refers to the rate at which data is generated and the speed at which it is to be analyzed and presented. Digital devices, such as smartphones and sensors are responsible for a high growth rate of data creation that requires real-time analysis. Traditional data management systems cannot handle receiving this vast Amount of data instantly. This is where Big Data technologies come in, enabling companies to build real-time intelligence.
In addition to the 3 V’s, other dimensions have been cited:
- Veracity: Corresponds to cases where the reliability of the data is not guaranteed, such as cases of social media messages. This is another Big Data assignment, dealing with uncertain data through tools and analysis for managing and mining this data.
- Variability: The variation in the speed of the data, having moments of high and low rates.
- Value: The data itself does not add anything. The deal is obtained by analyzing a large volume of data.
Data Life Cycle
The potential of Big Data is only harnessed when used to drive decision-making. Therefore, in addition to the data, an efficient process is necessary to obtain significant insights from a large volume of diverse and dynamic data.
The process of extracting information from Big Data can be divided into five phases:
- Acquisition: Data sources generate a significant amount of information, many of which are useless. This step’s great challenge is applying filters that discard useless information without losing the relevant ones. And these filters should be used in real-time, as it would be very costly to store all of this data for later deletion.
- Extraction: The data acquired and filtered usually are not ready for reading. As previously mentioned, the data exists in several formats: audio, video, and text, among others. This requires an Extraction Strategy so that it integrates data originating from different repositories into a form that can be consumed. Extract-Transform-Load (ETL) is the process that covers any stage of collecting data, adjusting it to the appropriate format, and storing it.
- Analysis: Technological advances are making it possible and cost-effective to analyze unstructured data. Distributed computing using easily scalable architecture, frameworks, massive processing of non-relational data, and parallelism in relational databases are redefining governance and data management.
- Interpretation: The most crucial aspect of the success of a Big Data system is the presentation of the data in a bright, friendly and reusable format.
Conclusion
Currently, most of the solutions operate only with structured data, so considering the 3 Vs. does not justify the adoption of Big Data. But as the use of social networks is growing a lot, and in the future, it will be possible to capture and analyze data from these sources. Even if we do not adopt this solution, we consider it relevant to present the knowledge acquired and clarify that Big Data should not be associated only with the Data volume. When thinking about adopting Big Data, it is important to remember at least 3 Vs.: volume, speed, and variety.
How about you? Share some of your experience with Big Data in the comments below!
Contribution is always welcome!