BIG DATA FOR DUMMIES CHEAT SHEET
From Big Data For Dummies
By Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman
Companies must find a practical way to deal with big data to stay competitive — to learn new ways to capture and analyze growing amounts of information about customers, products, and services. Data is becoming increasingly complex in structured and unstructured ways. New sources of data come from machines, such as sensors; social business sites; and website interaction, such as click-stream data. Meeting these changing business requirements demands that the right information be available at the right time.
DEFINING BIG DATA: VOLUME, VELOCITY, AND VARIETY
Big data enables organizations to store, manage, and manipulate vast amounts of disparate data at the right speed and at the right time. To gain the right insights, big data is typically broken down by three characteristics:
Volume: How much data
Velocity: How fast data is processed
Variety: The various types of data
While it is convenient to simplify big data into the three Vs, it can be misleading and overly simplistic. For example, you may be managing a relatively small amount of very disparate, complex data or you may be processing a huge volume of very simple data. That simple data may be all structured or all unstructured.
Even more important is the fourth V, veracity. How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense? Data must be able to be verified based on both accuracy and context. An innovative business may want to be able to analyze massive amounts of data in real time to quickly assess the value of that customer and the potential to provide additional offers to that customer. It is necessary to identify the right amount and types of data that can be analyzed in real time to impact business outcomes.
Big data incorporates all the varieties of data, including structured data and unstructured data from e-mails, social media, text streams, and so on. This kind of data management requires companies to leverage both their structured and unstructured data.
UNDERSTANDING UNSTRUCTURED DATA
Unstructured data is different than structured data in that its structure is unpredictable. Examples of unstructured data include documents, e-mails, blogs, digital images, videos, and satellite imagery. It also includes some data generated by machines or sensors. In fact, unstructured data accounts for the majority of data that’s on your company’s premises as well as external to your company in online private and public sources such as Twitter and Facebook.
In the past, most companies weren’t able to either capture or store this vast amount of data. It was simply too expensive or too overwhelming. Even if companies were able to capture the data, they didn’t have the tools to easily analyze the data and use the results to make decisions. Very few tools could make sense of these vast amounts of data. The tools that did exist were complex to use and did not produce results in a reasonable time frame.
In the end, those who really wanted to go to the enormous effort of analyzing this data were forced to work with snapshots of data. This has the undesirable effect of missing important events because they were not in a particular snapshot.
One approach that is becoming increasingly valued as a way to gain business value from unstructured data is text analytics, the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can then be leveraged in various ways. The analysis and extraction processes take advantage of techniques that originated in computational linguistics, statistics, and other computer science disciplines.
THE ROLE OF TRADITIONAL OPERATIONAL DATA IN THE BIG DATA ENVIRONMENT
Knowing what data is stored and where it is stored are critical building blocks in your big data implementation. It’s unlikely that you’ll use RDBMSs for the core of the implementation, but it’s very likely that you’ll need to rely on the data stored in RDBMSs to create the highest level of value to the business with big data.
Most large and small companies probably store most of their important operational information in relational database management systems (RDBMSs), which are built on one or more relations and represented by tables. These tables are defined by the way the data is stored.The data is stored in database objects called tables — organized in rows and columns. RDBMSs follow a consistent approach in the way that data is stored and retrieved.
To get the most business value from your real-time analysis of unstructured data, you need to understand that data in context with your historical data on customers, products, transactions, and operations. In other words, you will need to integrate your unstructured data with your traditional operational data.
BASICS OF BIG DATA INFRASTRUCTURE
Big data is all about high velocity, large volumes, and wide data variety, so the physical infrastructure will literally “make or break” the implementation. Most big data implementations need to be highly available, so the networks, servers, and physical storage must be resilient and redundant.
Resiliency and redundancy are interrelated. An infrastructure, or a system, is resilient to failure or changes when sufficient redundant resources are in place ready to jump into action. Resiliency helps to eliminate single points of failure in your infrastructure. For example, if only one network connection exists between your business and the Internet, you have no network redundancy, and the infrastructure is not resilient with respect to a network outage.
In large data centers with business continuity requirements, most of the redundancy is in place and can be leveraged to create a big data environment. In new implementations, the designers have the responsibility to map the deployment to the needs of the business based on costs and performance.
MANAGING BIG DATA WITH HADOOP: HDFS AND MAPREDUCE
Hadoop, an open-source software framework, uses HDFS (the Hadoop Distributed File System) and MapReduce to analyze big data on clusters of commodity hardware—that is, in a distributed computing environment.
The Hadoop Distributed File System (HDFS) was developed to allow companies to more easily manage huge volumes of data in a simple and pragmatic way. Hadoop allows big problems to be decomposed into smaller elements so that analysis can be done quickly and cost effectively. HDFS is a versatile, resilient, clustered approach to managing files in a big data environment.
HDFS is not the final destination for files. Rather it is a data “service” that offers a unique set of capabilities needed when data volumes and velocity are high.
MapReduce is a software framework that enables developers to write programs that can process massive amounts of unstructured data in parallel across a distributed group of processors. MapReduce was designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode.
The “map” component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called “reduce” aggregates all the elements back together to provide a result. An example of MapReduce usage would be to determine how many pages of a book are written in each of 50 different languages.
LAYING THE GROUNDWORK FOR YOUR BIG DATA STRATEGY
Companies are swimming in big data. The problem is that they often don’t know how to pragmatically use that data to be able to predict the future, execute important business processes, or simply gain new insights. The goal of your big data strategy and plan should be to find a pragmatic way to leverage data for more predictable business outcomes.
Begin your big data strategy by embarking on a discovery process. You need to get a handle on what data you already have, where it is, who owns and controls it, and how it is currently used. For example, what are the third-party data sources that your company relies on? This process can give you a lot of insights:
You can determine how many data sources you have and how much overlap exists.
You can identify gaps exist in knowledge about those data sources.
You might discover that you have lots of duplicate data in one area of the business and almost no data in another area.
You might ascertain that you are dependent on third-party data that isn’t as accurate as it should be.
Spend the time you need to do this discovery process because it will be the foundation for your planning and execution of your big data strategy.
From Big Data For Dummies
By Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman
Companies must find a practical way to deal with big data to stay competitive — to learn new ways to capture and analyze growing amounts of information about customers, products, and services. Data is becoming increasingly complex in structured and unstructured ways. New sources of data come from machines, such as sensors; social business sites; and website interaction, such as click-stream data. Meeting these changing business requirements demands that the right information be available at the right time.
DEFINING BIG DATA: VOLUME, VELOCITY, AND VARIETY
Big data enables organizations to store, manage, and manipulate vast amounts of disparate data at the right speed and at the right time. To gain the right insights, big data is typically broken down by three characteristics:
Volume: How much data
Velocity: How fast data is processed
Variety: The various types of data
While it is convenient to simplify big data into the three Vs, it can be misleading and overly simplistic. For example, you may be managing a relatively small amount of very disparate, complex data or you may be processing a huge volume of very simple data. That simple data may be all structured or all unstructured.
Even more important is the fourth V, veracity. How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense? Data must be able to be verified based on both accuracy and context. An innovative business may want to be able to analyze massive amounts of data in real time to quickly assess the value of that customer and the potential to provide additional offers to that customer. It is necessary to identify the right amount and types of data that can be analyzed in real time to impact business outcomes.
Big data incorporates all the varieties of data, including structured data and unstructured data from e-mails, social media, text streams, and so on. This kind of data management requires companies to leverage both their structured and unstructured data.
UNDERSTANDING UNSTRUCTURED DATA
Unstructured data is different than structured data in that its structure is unpredictable. Examples of unstructured data include documents, e-mails, blogs, digital images, videos, and satellite imagery. It also includes some data generated by machines or sensors. In fact, unstructured data accounts for the majority of data that’s on your company’s premises as well as external to your company in online private and public sources such as Twitter and Facebook.
In the past, most companies weren’t able to either capture or store this vast amount of data. It was simply too expensive or too overwhelming. Even if companies were able to capture the data, they didn’t have the tools to easily analyze the data and use the results to make decisions. Very few tools could make sense of these vast amounts of data. The tools that did exist were complex to use and did not produce results in a reasonable time frame.
In the end, those who really wanted to go to the enormous effort of analyzing this data were forced to work with snapshots of data. This has the undesirable effect of missing important events because they were not in a particular snapshot.
One approach that is becoming increasingly valued as a way to gain business value from unstructured data is text analytics, the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can then be leveraged in various ways. The analysis and extraction processes take advantage of techniques that originated in computational linguistics, statistics, and other computer science disciplines.
THE ROLE OF TRADITIONAL OPERATIONAL DATA IN THE BIG DATA ENVIRONMENT
Knowing what data is stored and where it is stored are critical building blocks in your big data implementation. It’s unlikely that you’ll use RDBMSs for the core of the implementation, but it’s very likely that you’ll need to rely on the data stored in RDBMSs to create the highest level of value to the business with big data.
Most large and small companies probably store most of their important operational information in relational database management systems (RDBMSs), which are built on one or more relations and represented by tables. These tables are defined by the way the data is stored.The data is stored in database objects called tables — organized in rows and columns. RDBMSs follow a consistent approach in the way that data is stored and retrieved.
To get the most business value from your real-time analysis of unstructured data, you need to understand that data in context with your historical data on customers, products, transactions, and operations. In other words, you will need to integrate your unstructured data with your traditional operational data.
BASICS OF BIG DATA INFRASTRUCTURE
Big data is all about high velocity, large volumes, and wide data variety, so the physical infrastructure will literally “make or break” the implementation. Most big data implementations need to be highly available, so the networks, servers, and physical storage must be resilient and redundant.
Resiliency and redundancy are interrelated. An infrastructure, or a system, is resilient to failure or changes when sufficient redundant resources are in place ready to jump into action. Resiliency helps to eliminate single points of failure in your infrastructure. For example, if only one network connection exists between your business and the Internet, you have no network redundancy, and the infrastructure is not resilient with respect to a network outage.
In large data centers with business continuity requirements, most of the redundancy is in place and can be leveraged to create a big data environment. In new implementations, the designers have the responsibility to map the deployment to the needs of the business based on costs and performance.
MANAGING BIG DATA WITH HADOOP: HDFS AND MAPREDUCE
Hadoop, an open-source software framework, uses HDFS (the Hadoop Distributed File System) and MapReduce to analyze big data on clusters of commodity hardware—that is, in a distributed computing environment.
The Hadoop Distributed File System (HDFS) was developed to allow companies to more easily manage huge volumes of data in a simple and pragmatic way. Hadoop allows big problems to be decomposed into smaller elements so that analysis can be done quickly and cost effectively. HDFS is a versatile, resilient, clustered approach to managing files in a big data environment.
HDFS is not the final destination for files. Rather it is a data “service” that offers a unique set of capabilities needed when data volumes and velocity are high.
MapReduce is a software framework that enables developers to write programs that can process massive amounts of unstructured data in parallel across a distributed group of processors. MapReduce was designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode.
The “map” component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called “reduce” aggregates all the elements back together to provide a result. An example of MapReduce usage would be to determine how many pages of a book are written in each of 50 different languages.
LAYING THE GROUNDWORK FOR YOUR BIG DATA STRATEGY
Companies are swimming in big data. The problem is that they often don’t know how to pragmatically use that data to be able to predict the future, execute important business processes, or simply gain new insights. The goal of your big data strategy and plan should be to find a pragmatic way to leverage data for more predictable business outcomes.
Begin your big data strategy by embarking on a discovery process. You need to get a handle on what data you already have, where it is, who owns and controls it, and how it is currently used. For example, what are the third-party data sources that your company relies on? This process can give you a lot of insights:
You can determine how many data sources you have and how much overlap exists.
You can identify gaps exist in knowledge about those data sources.
You might discover that you have lots of duplicate data in one area of the business and almost no data in another area.
You might ascertain that you are dependent on third-party data that isn’t as accurate as it should be.
Spend the time you need to do this discovery process because it will be the foundation for your planning and execution of your big data strategy.
Comment