Data Engineering Nanodegree - Part 3 - Data Lakes with Spark¶

Data Engineering¶

Data Lakes With Spark¶

Module 4¶

Data Wrangling With Spark¶

Q: In which language is Spark written?

Answer: Scala

Q: How is it possible that Spark programs can be written in Python if Python is not a functional programming language?

Answer:

The PySpark API allows you to write programs in
Spark and ensures that your code uses functional programming practices.
Underneath the hood, the Python code uses py4j to make calls to the Java
Virtual Machine (JVM).

Q: What are resilient distributed datasets (RDD's)?

Answer:

RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster.
This is how Spark stores data.

The Power Of Spark¶

Q: What's the CPU?

Answer: The CPU is the "brain" of the computer. Remarks:

Every process on your computer is eventually handled by your CPU.
This includes calculations and also instructions for the other components of the compute.

Q: What's the memory (RAM)?

Answer:

When your program runs, data gets temporarily stored in memory before getting sent to the CPU.
Memory is ephemeral storage - when your computer shuts down, the data in the memory is lost.

Q: What's the storage (SSD or Magnetic Disk)?

Answer:

Storage is used for keeping data over long periods of time.
When a program runs, the CPU will direct the memory to temporarily load data from long-term storage.

Q: What's the Network (LAN or Internet)?

Answer:

Network is the gateway for anything that you need that isn't stored on your computer.
The network could connect to other computers in the same room (a Local Area Network) or to a computer on the other side of the world, connected over the internet.

Q: Rank the following hardware components in order from fastests to slowest: Memory, Disk Storage, Network, CPU.

Answer:

CPU
Memory (RAM)
Disk Storage (SSD)
Network

Remarks: CPU operations are fastest. Operations in memory (RAM) are the second fastest. Then comes hard disk storage and finally transferring data across a network. Keep these relative speeds in mind. They'll help you understand the constraints when working with big data.

Q: What are the functions of the CPU?

Answer:

It has a few different functions including directing other components of a computer as well as running mathematical calculations.
The CPU can also store small amounts of data inside itself in what are called registers.

Example:

For example, say you write a program that reads in a 40 MB data file and then analyzes the file.
When you execute the code, the instructions are loaded into the CPU.
The CPU then instructs the computer to take the 40 MB from disk and store the data in memory (RAM).
If you want to sum a column of data, then the CPU will essentially take two numbers at a time and sum them together.
The accumulation of the sum needs to be stored somewhere while the CPU grabs the next number. This cumulative sum will be stored in a register.
The registers make computations more efficient: the registers avoid having to send data unnecessarily back and forth between memory (RAM) and the CPU.

Q: What does it mean for a CPU to be 2.5 Gigahertz?

Answer: It means that the CPU processes 2.5 billion operations per second.

Q: Knowing that tweets create approximately 104 billion bytes of data per day, how long would it take the 2.5 GigaHertz CPU to analyze a full day of tweets?

Answer: 104 billion bytes * (1 second / 20 billion bytes) = 5.2 seconds Remarks:

Twitter generates about 6,000 tweets per second, and each tweet contains 200 bytes. So in one day, Twitter generates data on the order of:
(6000 tweets / second) x (86400 seconds / day) x (200 bytes / tweet) = 104 billion bytes / day

Q: What are the limitations of memory (RAM)?

Answer:

It's relatively expensive
It's ephemeral (data stored in RAM gets erased when the computer shuts down)

Remarks: However, it is efficient: operations in RAM are relatively fast compared to reading and writing from disk or moving data across a network.

Q: What is shuffling?

Answer: Moving data back and forth between different nodes of a cluster. Remarks: Since this is very time expensive, Spark tries to reduce shuffling.

Q: List the key ratios of processing speed between the major hardware components.

Answer:

CPU: 200x faster than memory
Memory: 15x faster than SSD
SSD: 20x faster than network

Q: What's a difference between parallel computing and distributed computing?

Answer:

At a high level, distributed computing implies multiple CPUs each with its own memory.
Parallel computing uses multiple CPUs sharing the same memory.

Q: What are the four components of Hadoop?

Answer:

Hadoop -
an ecosystem of tools for big data storage and data analysis. Hadoop is
an older system than Spark but is still used by many companies.

Hadoop MapReduce - a system for processing and analyzing large data sets in parallel.

Hadoop YARN - a resource manager that schedules
jobs across a cluster. The manager keeps track of what computer
resources are available and then assigns those resources to specific
tasks.

Hadoop Distributed File System (HDFS) - a big data storage system that splits data into chunks and stores the chunks across a cluster of computers.

Remarks: The major difference between Spark and Hadoop is how they use memory. Hadoop writes intermediate results to disk whereas Spark tries to keep data in memory whenever possible. This makes Spark faster for many use cases.

Q: How does Spark differ from Hadoop?

Answer:

Spark is generally faster than Hadoop. This is because Hadoop writes intermediate results to disk whereas Spark tries to keep intermediate results in memory whenever possible.
The Hadoop ecosystem includes a distributed file storage system called HDFS (Hadoop Distributed File System). Spark, on the other hand, does not include a file storage system. You can use Spark on top of HDFS but you do not have to. Spark can read in data from other sources as well such as Amazon S3.

Q: What is MapReduce?

Answer: MapReduce is a programming technique for manipulating large data sets. Remarks: "Hadoop MapReduce" is a specific implementation of this programming technique.

Q: How does MapReduce work?

Answer:

The technique works by first dividing up a large dataset and distributing the data across a cluster.
In the map step, each data is analyzed and converted into a (key, value) pair.
Then these key-value pairs are shuffled across the cluster so that all keys are on the same machine.
In the reduce step, the values with the same keys are combined together.

Q: What happens in the shuffle step of MapReduce?

Answer:

The shuffle step finds all of the data across the clusters that have the same key.
And all of those data points with that key are brought into the same network node for further analysis.

Q: What are the four modes to set up Spark?

Answer:

Local
Spark standalone
YARN
Mesos

Module 5¶

Introduction To Data Lakes¶

Q: List differences between Data Warehouses and Data Lakes

Answer:

Data form: tabular vs. all formats
Data value: high only vs. high- or medium-value, or to be discoverd
Ingestion: ETL vs. ELT
Data model: Star & Snowflake with conformed dimensions or data-marts and OLAP cubes vs. Star & Snowflakes and OLAP are also possible but other ad-hoc represenations are possible
Schema: Known before ingestion (schema-on-write) vs. On-the-fly at the time of analysis (schema-on-read)
Technology: Expensive MPP databases with expensive disks and connectivity vs. Commodity hardware with parallelism as first principle.
Data Quality: High with effort for consistency and clear rules for accessibility vs. mixed, some data remain in raw format, some data is transformed to higher quality
Users: Busines analysts vs. Data scientists, business analysts & ML engineers
Analytics: Reports and business intelligence visualizations vs. Machine Learning, graph analytics and data exploration.

Q: What is schema-on-read?

Answer:

Schema on read refers to an innovative data analysis strategy in new data-handling tools like Hadoop and other more involved database technologies.
In schema on read, data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes in.

Q: Difference between data warehouses and data lakes

Answer:

Characteristics	Data Warehouse	Data Lake
Data	Relational from transactional systems, operational databases, and line of business applications	Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications
Schema	Designed prior to the DW implementation (schema-on-write)	Written at the time of analysis (schema-on-read)
Price/Performance	Fastest query results using higher cost storage	Query results getting faster using low-cost storage
Data Quality	Highly curated data that serves as the central version of the truth	Any data that may or may not be curated (ie. raw data)
Users	Business analysts	Data scientists, Data developers, and Business analysts (using curated data)
Analytics	Batch reporting, BI and visualizations	Machine Learning, Predictive analytics, data discovery and profiling

Q: What is a data lake?

Answer:

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Example:

Q: Describe the bottled water vs. data lake analogy

Answer:

A data warehouse is like a producer of water where you are handed bottled water in a particular size and shape
In contrast, a data lake is a lake where many water streams flow into it and everyone is free to choose the water in the way they want to.

Q: What are issues of data lakes?

Answer: Data lakes are prone to being a chaotic data garbage dump ("Datensumpf). To prevent this, detailed metadata (e.g. a data catalog) should be put in place.

Acronyms¶

RDD: Resilient Distributed Dataset

JVM: Java Virtual Machine

CPU: Central Processing Unit

RAM: Random Access Memory

SSD: Solid State Drive

LAN: Local Area Network

HDFS: Hadoop Distributed File System