Skip to content

Data Engineering Nanodegree - Part 3 - Data Lakes with Spark

Data Engineering

Data Lakes With Spark

Module 4

Data Wrangling With Spark

Q: In which language is Spark written?Answer: Scala

Q: How is it possible that Spark programs can be written in Python if Python is not a functional programming language?Answer:
  • The PySpark API allows you to write programs in
    Spark and ensures that your code uses functional programming practices. 
  •  Underneath the hood, the Python code uses py4j to make calls to the Java
    Virtual Machine (JVM).

Q: What are resilient distributed datasets (RDD's)?Answer:
  • RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster.
  • This is how Spark stores data.

The Power Of Spark

Q: What's the CPU?Answer: The CPU is the "brain" of the computer.  Remarks:
  • Every process on your computer is eventually handled by your CPU. 
  • This includes calculations and also instructions for the other components of the compute.

Q: What's the memory (RAM)?Answer:
  • When your program runs, data gets temporarily stored in memory before getting sent to the CPU. 
  • Memory is ephemeral storage - when your computer shuts down, the data in the memory is lost.

Q: What's the storage (SSD or Magnetic Disk)?Answer:
  • Storage is used for keeping data over long periods of time. 
  • When a program runs, the CPU will direct the memory to temporarily load data from long-term storage.

Q: What's the Network (LAN or Internet)?Answer:
  • Network is the gateway for anything that you need that isn't stored on your computer. 
  • The network could connect to other computers in the same room (a Local Area Network) or to a computer on the other side of the world, connected over the internet.

Q: Rank the following hardware components in order from fastests to slowest: Memory, Disk Storage, Network, CPU.Answer:
  1. CPU
  2. Memory (RAM)
  3. Disk Storage (SSD)
  4. Network
Remarks: CPU operations are fastest. Operations in memory (RAM) are the second fastest. Then comes hard disk storage and finally transferring data across a network. Keep these relative speeds in mind. They'll help you understand the constraints when working with big data.

Q: What are the functions of the CPU?Answer:
  • It has a few different functions including directing other components of a computer as well as running mathematical calculations. 
  • The CPU can also store small amounts of data inside itself in what are called registers.
Example:
  • For example, say you write a program that reads in a 40 MB data file and then analyzes the file. 
  • When you execute the code, the instructions are loaded into the CPU. 
  • The CPU then instructs the computer to take the 40 MB from disk and store the data in memory (RAM). 
  • If you want to sum a column of data, then the CPU will essentially take two numbers at a time and sum them together. 
  • The accumulation of the sum needs to be stored somewhere while the CPU grabs the next number. This cumulative sum will be stored in a register. 
  • The registers make computations more efficient: the registers avoid having to send data unnecessarily back and forth between memory (RAM) and the CPU.

Q: What does it mean for a CPU to be 2.5 Gigahertz?Answer: It means that the CPU processes 2.5 billion operations per second.

Q: Knowing that tweets create approximately 104 billion bytes of data per day, how long would it take the 2.5 GigaHertz CPU to analyze a full day of tweets?Answer: 104 billion bytes * (1 second / 20 billion bytes) = 5.2 seconds Remarks:
  • Twitter generates about 6,000 tweets per second, and each tweet contains 200 bytes. So in one day, Twitter generates data on the order of:
  • (6000 tweets / second) x (86400 seconds / day) x (200 bytes / tweet) = 104 billion bytes / day

Q: What are the limitations of memory (RAM)?Answer:
  1. It's relatively expensive
  2. It's ephemeral (data stored in RAM gets erased when the computer shuts down)
Remarks: However, it is efficient: operations in RAM are relatively fast compared to reading and writing from disk or moving data across a network.

Q: What is shuffling?Answer: Moving data back and forth between different nodes of a cluster. Remarks: Since this is very time expensive, Spark tries to reduce shuffling.

Q: List the key ratios of processing speed between the major hardware components.Answer:
  1. CPU: 200x faster than memory
  2. Memory: 15x faster than SSD
  3. SSD: 20x faster than network

Q: What's a difference between parallel computing and distributed computing?Answer:
  • At a high level, distributed computing implies multiple CPUs each with its own memory.
  • Parallel computing uses multiple CPUs sharing the same memory.

Q: What are the four components of Hadoop?Answer:
  1. Hadoop -
    an ecosystem of tools for big data storage and data analysis. Hadoop is
    an older system than Spark but is still used by many companies. 


  2. Hadoop MapReduce - a system for processing and analyzing large data sets in parallel.


  3. Hadoop YARN - a resource manager that schedules
    jobs across a cluster. The manager keeps track of what computer
    resources are available and then assigns those resources to specific
    tasks.


  4. Hadoop Distributed File System (HDFS) - a big data storage system that splits data into chunks and stores the chunks across a cluster of computers. 

Remarks: The major difference between Spark and Hadoop is how they use memory. Hadoop writes intermediate results to disk whereas Spark tries to keep data in memory whenever possible. This makes Spark faster for many use cases.

Q: How does Spark differ from Hadoop?Answer:
  1. Spark is generally faster than Hadoop. This is because Hadoop writes intermediate results to disk whereas Spark tries to keep intermediate results in memory whenever possible.
  2. The Hadoop ecosystem includes a distributed file storage system called HDFS (Hadoop Distributed File System). Spark, on the other hand, does not include a file storage system. You can use Spark on top of HDFS but you do not have to. Spark can read in data from other sources as well such as Amazon S3.

Q: What is MapReduce?Answer: MapReduce is a programming technique for manipulating large data sets. Remarks:  "Hadoop MapReduce" is a specific implementation of this programming technique.

Q: How does MapReduce work?Answer:
  • The technique works by first dividing up a large dataset and distributing the data across a cluster. 
  • In the map step, each data is analyzed and converted into a (key, value) pair
  • Then these key-value pairs are shuffled across the cluster so that all keys are on the same machine. 
  • In the reduce step, the values with the same keys are combined together.

Q: What happens in the shuffle step of MapReduce?Answer:
  • The shuffle step finds all of the data across the clusters that have the same key.
  • And all of those data points with that key are brought into the same network node for further analysis.

Q: What are the four modes to set up Spark?Answer:
  1. Local
  2. Spark standalone
  3. YARN
  4. Mesos

Module 5

Introduction To Data Lakes

Q: List differences between Data Warehouses and Data LakesAnswer:
  1. Data form: tabular vs. all formats
  2. Data value: high only vs. high- or medium-value, or to be discoverd
  3. Ingestion: ETL vs. ELT
  4. Data model: Star & Snowflake with conformed dimensions or data-marts and OLAP cubes vs. Star & Snowflakes and OLAP are also possible but other ad-hoc represenations are possible
  5. Schema: Known before ingestion (schema-on-write) vs. On-the-fly at the time of analysis (schema-on-read)
  6. Technology: Expensive MPP databases with expensive disks and connectivity vs. Commodity hardware with parallelism as first principle.
  7. Data Quality: High with effort for consistency and clear rules for accessibility vs. mixed, some data remain in raw format, some data is transformed to higher quality
  8. Users: Busines analysts vs. Data scientists, business analysts & ML engineers
  9. Analytics: Reports and business intelligence visualizations vs. Machine Learning, graph analytics and data exploration.

Q: What is schema-on-read?Answer:
  • Schema on read refers to an innovative data analysis strategy in new data-handling tools like Hadoop and other more involved database technologies.
  • In schema on read, data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes in.

Q: Difference between data warehouses and data lakesAnswer:
Characteristics Data Warehouse Data Lake
Data Relational from transactional systems, operational databases, and line of business applications Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications
Schema Designed prior to the DW implementation (schema-on-write) Written at the time of analysis (schema-on-read)
Price/Performance Fastest query results using higher cost storage Query results getting faster using low-cost storage
Data Quality
Highly curated data that serves as the central version of the truth Any data that may or may not be curated (ie. raw data)
Users Business analysts Data scientists, Data developers, and Business analysts (using curated data)
Analytics Batch reporting, BI and visualizations Machine Learning, Predictive analytics, data discovery and profiling

Q: What is a data lake?Answer:
  • A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
  • You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Example:

Q: Describe the bottled water vs. data lake analogyAnswer:
  • A data warehouse is like a producer of water where you are handed bottled water in a particular size and shape
  • In contrast, a data lake is a lake where many water streams flow into it and everyone is free to choose the water in the way they want to.

Q: What are issues of data lakes?Answer: Data lakes are prone to being a chaotic data garbage dump ("Datensumpf). To prevent this, detailed metadata (e.g. a data catalog) should be put in place.

Acronyms

RDD: Resilient Distributed Dataset Wiki link

JVM: Java Virtual Machine Wiki link

CPU: Central Processing Unit Wiki link

RAM: Random Access Memory Wiki link

SSD: Solid State Drive Wiki link

LAN: Local Area Network Wiki link

HDFS: Hadoop Distributed File System Wiki link