Q: What's the CPU?Answer: The CPU is the "brain" of the computer.
Remarks:
Every process on your computer is eventually handled by your CPU.
This includes calculations and also instructions for the other components of the compute.
Q: What's the memory (RAM)?Answer:
When your program runs, data gets temporarily stored in memory before getting sent to the CPU.
Memory is ephemeral storage - when your computer shuts down, the data in the memory is lost.
Q: What's the storage (SSD or Magnetic Disk)?Answer:
Storage is used for keeping data over long periods of time.
When a
program runs, the CPU will direct the memory to temporarily load data
from long-term storage.
Q: What's the Network (LAN or Internet)?Answer:
Network is the gateway for anything that you need that isn't stored on
your computer.
The network could connect to other computers in the same
room (a Local Area Network) or to a computer on the other side of the
world, connected over the internet.
Q: Rank the following hardware components in order from fastests to slowest: Memory, Disk Storage, Network, CPU.Answer:
CPU
Memory (RAM)
Disk Storage (SSD)
Network
Remarks: CPU operations are fastest. Operations in memory (RAM) are the second
fastest. Then comes hard disk storage and finally transferring data
across a network. Keep these relative speeds in mind. They'll help you
understand the constraints when working with big data.
Q: What are the functions of the CPU?Answer:
It has a few different functions including directing other
components of a computer as well as running mathematical calculations.
The CPU can also store small amounts of data inside itself in what are
called registers.
Example:
For example, say you write a program that reads in a 40 MB data file
and then analyzes the file.
When you execute the code, the instructions
are loaded into the CPU.
The CPU then instructs the computer to take the
40 MB from disk and store the data in memory (RAM).
If you want to sum a
column of data, then the CPU will essentially take two numbers at a
time and sum them together.
The accumulation of the sum needs to be
stored somewhere while the CPU grabs the next number. This cumulative sum will be stored in a register.
The registers make
computations more efficient: the registers avoid having to send data
unnecessarily back and forth between memory (RAM) and the CPU.
Q: What does it mean for a CPU to be 2.5 Gigahertz?Answer: It means that the CPU processes 2.5 billion operations per second.
Q: Knowing that tweets create approximately 104 billion bytes of data per
day, how long would it take the 2.5 GigaHertz CPU to analyze a full day
of tweets?Answer: 104 billion bytes * (1 second / 20 billion bytes) = 5.2 seconds
Remarks:
Twitter generates about 6,000 tweets per second, and each tweet
contains 200 bytes. So in one day, Twitter generates data on the order
of:
(6000 tweets / second) x (86400 seconds / day) x (200 bytes / tweet) = 104 billion bytes / day
Q: What are the limitations of memory (RAM)?Answer:
It's relatively expensive
It's ephemeral (data stored in RAM gets erased when the computer shuts down)
Remarks: However, it is efficient: operations in RAM are relatively fast compared to reading and writing from disk or moving data across a network.
Q: What is shuffling?Answer: Moving data back and forth between different nodes of a cluster.
Remarks: Since this is very time expensive, Spark tries to reduce shuffling.
Q: List the key ratios of processing speed between the major hardware components.Answer:
CPU: 200x faster than memory
Memory: 15x faster than SSD
SSD: 20x faster than network
Q: What's a difference between parallel computing and distributed computing?Answer:
At a high level, distributed computing implies multiple CPUs each with
its own memory.
Parallel computing uses multiple CPUs sharing the same
memory.
Q: What are the four components of Hadoop?Answer:
Hadoop - an ecosystem of tools for big data storage and data analysis. Hadoop is an older system than Spark but is still used by many companies.
Hadoop MapReduce - a system for processing and analyzing large data sets in parallel.
Hadoop YARN - a resource manager that schedules jobs across a cluster. The manager keeps track of what computer resources are available and then assigns those resources to specific tasks.
Hadoop Distributed File System (HDFS) - a big data storage system that splits data into chunks and stores the chunks across a cluster of computers.
Remarks: The major difference between Spark and Hadoop is how they use memory. Hadoop writes intermediate results to disk whereas Spark tries to keep data in memory whenever possible. This makes Spark faster for many use cases.
Q: How does Spark differ from Hadoop?Answer:
Spark is generally faster than Hadoop. This is because Hadoop writes
intermediate results to disk whereas Spark tries to keep intermediate
results in memory whenever possible.
The Hadoop ecosystem includes a distributed file storage system called
HDFS (Hadoop Distributed File System). Spark, on the other hand, does
not include a file storage system. You can use Spark on top of HDFS but
you do not have to. Spark can read in data from other sources as well
such as Amazon S3.
Q: What is MapReduce?Answer: MapReduce is a programming technique for manipulating large data sets.
Remarks: "Hadoop MapReduce" is a specific implementation of this programming technique.
Q: How does MapReduce work?Answer:
The technique works by first dividing up a large dataset and
distributing the data across a cluster.
In the map step, each data is
analyzed and converted into a (key, value) pair.
Then these key-value
pairs are shuffled across the cluster so that all keys are on the same
machine.
In the reduce step, the values with the same keys are combined
together.
Q: What happens in the shuffle step of MapReduce?Answer:
The shuffle step finds all of the data across the clusters that have the
same key.
And all of those data points with that key are brought into
the same network node for further analysis.
Q: What are the four modes to set up Spark?Answer:
Q: List differences between Data Warehouses and Data LakesAnswer:
Data form: tabular vs. all formats
Data value: high only vs. high- or medium-value, or to be discoverd
Ingestion: ETL vs. ELT
Data model: Star & Snowflake with conformed dimensions or data-marts and OLAP cubes vs. Star & Snowflakes and OLAP are also possible but other ad-hoc represenations are possible
Schema: Known before ingestion (schema-on-write) vs. On-the-fly at the time of analysis (schema-on-read)
Technology: Expensive MPP databases with expensive disks and connectivity vs. Commodity hardware with parallelism as first principle.
Data Quality: High with effort for consistency and clear rules for accessibility vs. mixed, some data remain in raw format, some data is transformed to higher quality
Users: Busines analysts vs. Data scientists, business analysts & ML engineers
Analytics: Reports and business intelligence visualizations vs. Machine Learning, graph analytics and data exploration.
Q: What is schema-on-read?Answer:
Schema on read refers to an innovative data analysis strategy in new data-handling tools like Hadoop and other more involved database technologies.
In schema on read, data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes in.
Q: Difference between data warehouses and data lakesAnswer:
Characteristics
Data Warehouse
Data Lake
Data
Relational from transactional systems, operational databases, and line of business applications
Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications
Schema
Designed prior to the DW implementation (schema-on-write)
Written at the time of analysis (schema-on-read)
Price/Performance
Fastest query results using higher cost storage
Query results getting faster using low-cost storage
Data Quality
Highly curated data that serves as the central version of the truth
Any data that may or may not be curated (ie. raw data)
Users
Business analysts
Data scientists, Data developers, and Business analysts (using curated data)
Analytics
Batch reporting, BI and visualizations
Machine Learning, Predictive analytics, data discovery and profiling
Q: What is a data lake?Answer:
A data lake is a centralized repository that allows you to store all
your structured and unstructured data at any scale.
You can store your
data as-is, without having to first structure the data, and run
different types of analytics—from dashboards and visualizations to big
data processing, real-time analytics, and machine learning to guide
better decisions.
Example:
Q: Describe the bottled water vs. data lake analogyAnswer:
A data warehouse is like a producer of water where you are handed bottled water in a particular size and shape
In contrast, a data lake is a lake where many water streams flow into it and everyone is free to choose the water in the way they want to.
Q: What are issues of data lakes?Answer: Data lakes are prone to being a chaotic data garbage dump ("Datensumpf). To prevent this, detailed metadata (e.g. a data catalog) should be put in place.