Opleiding: PySpark for Big Data

In the course PySpark for Big Data participants learn to use Apache Spark from Python.

Spark Architecture

The course PySpark for Big Data discusses the architecture of Spark, the Spark Cluster Manager and the difference between Batch and Stream Processing.

Hadoop

After a discussion of the Hadoop Distributed File System, parallel operations and working with RDDs, Resilient Distributed Datasets are discussed in the course PySpark for Big Data. The configuration of PySpark applications via SparkConf and SparkContext is also explained.

MapReduce en SQL

Extensive consideration is given to the possible operations on RDDs, including map and reduce. The use of SQL in Spark is also discussed. The GraphX library is discussed and DataFrames is discussed. Iterative algorithms are also treated.

Mlib library

Finally the course PySpark for Big Data pays attention to machine learning with the Mlib library.

Audience PySpark for Big Data

The course PySpark for Big Data is intended for developers and upcoming Data Analysts who want to learn how to use Apache Spark from Python.

Prerequisites training PySpark for Big Data

To participate in this course, some experience with programming is beneficial for understanding. Prior knowledge of Python or big data handling with Apache Spark is not required.

Realization course PySpark for Big Data

The theory is treated on the basis of presentations. Illustrative demos are used to clarify the concepts discussed. There is ample opportunity to practice and alternate theory and practice. The course times are from 9.30 am to 4.30 pm.

Certification course PySpark for Big Data

Participants receive an official certificate PySpark for Big Data after successful completion of the course.

Modules

Module 1 : Python Primer

  • Python Syntax
  • Python Data Types
  • List, Tuples, Dictionaries
  • Python Control Flow
  • Functions and Parameters
  • Modules and Packages
  • Comprehensions
  • Iterators and Generators
  • Python Classes
  • Anaconda Environment
  • Jupyter Notebooks

Module 2 : Spark Intro

  • What is Apache Spark?
  • Spark and Python
  • PySpark
  • Py4j Library
  • Data Driven Documents
  • RDD's
  • Real Time Processing
  • Apache Hadoop MapReduce
  • Cluster Manager
  • Batch versus Stream Processing
  • PySpark Shell

Module 3 : HDFS

  • Hadoop Environment
  • Environment Setup
  • Hadoop Stack
  • Hadoop Yarn
  • Hadoop Distributed File System
  • HDFS Architecture
  • Parallel Operations
  • Working with Partitions
  • RDD Partitions
  • HDFS Data Locality
  • DAG (Direct Acyclic Graph)

Module 4 : SparkConf

  • SparkConf Object
  • Setting Configuration Properties
  • Uploading Files
  • SparkContext.addFile
  • Logging Configuration
  • Storage Levels
  • Serialize RDD
  • Replicate RDD partitions
  • DISK_ONLY
  • MEMORY_AND_DISK
  • MEMORY_ONLY

Module 5 : SparkContext

  • Main Entry Point
  • Executor
  • Worker Nodes
  • LocalFS
  • SparkContext Parameters
  • Master
  • RDD serializer
  • batchSize
  • Gateway
  • JavaSparkContext instance
  • Profiler

Module 6 : RDD’s

  • Resilient Distributed Datasets
  • Key-Value pair RDDs
  • Parallel Processing
  • Immutability and Fault Tolerance
  • Transformation Operations
  • Filter, groupBy and Map
  • Action Operations
  • Caching and persistence
  • PySpark RDD Class
  • count, collect, foreach,filter
  • map, reduce, join, cache

Module 7 : Spark Processing

  • SQL support in Spark
  • Spark 2.0 Dataframes
  • Defining tables
  • Importing datasets
  • Querying data frames using SQL
  • Storage formats
  • JSON / Parquet
  • GraphX
  • GraphX library overview
  • GraphX APIs

Module 8 : Broadcast and Accumulator

  • Performance Tuning
  • Serialization
  • Network Traffic
  • Disk Persistence
  • MarshalSerializer
  • Data Type Support
  • Python’s Pickle Serializer
  • DStreams
  • Sliding Window Operations
  • Multi Batch and State Operations

Module 9 : Algorithms

  • Iterative Algorithms
  • Graph Analysis
  • Machine Learning API
  • mllib.classification
  • Random Forest
  • Naive Bayes
  • Decision Tree
  • mllib.clustering
  • mllib.linalg
  • mllib.regression
Meer...
€2.450
ex. BTW
Aangeboden door
SpiralTrain
Onderwerp
Big Data
Niveau
Duur
3 dagen
Looptijd
18 dagen
Taal
en
Type product
cursus
Lesvorm
Klassikaal
Aantal deelnemers
Max: 12
Tijdstip
Overdag
Tijden en locaties
Amsterdam
ma 15 jun. 2026
Eindhoven
ma 15 jun. 2026
Houten
ma 15 jun. 2026
Rotterdam
ma 15 jun. 2026
Utrecht
ma 15 jun. 2026
Zwolle
ma 15 jun. 2026
Amsterdam
ma 17 aug. 2026
Eindhoven
ma 17 aug. 2026
Houten
ma 17 aug. 2026
Rotterdam
ma 17 aug. 2026
Utrecht
ma 17 aug. 2026
Zwolle
ma 17 aug. 2026
Amsterdam
ma 12 okt. 2026
Eindhoven
ma 12 okt. 2026
Houten
ma 12 okt. 2026
Rotterdam
ma 12 okt. 2026
Utrecht
ma 12 okt. 2026
Zwolle
ma 12 okt. 2026
Amsterdam
ma 14 dec. 2026
Eindhoven
ma 14 dec. 2026
Houten
ma 14 dec. 2026
Rotterdam
ma 14 dec. 2026
Utrecht
ma 14 dec. 2026
Zwolle
ma 14 dec. 2026
Amsterdam
ma 15 feb. 2027
Eindhoven
ma 15 feb. 2027
Houten
ma 15 feb. 2027
Rotterdam
ma 15 feb. 2027
Utrecht
ma 15 feb. 2027
Zwolle
ma 15 feb. 2027
Amsterdam
ma 12 apr. 2027
Eindhoven
ma 12 apr. 2027
Houten
ma 12 apr. 2027
Rotterdam
ma 12 apr. 2027
Utrecht
ma 12 apr. 2027
Zwolle
ma 12 apr. 2027
Amsterdam
ma 14 jun. 2027
Eindhoven
ma 14 jun. 2027
Houten
ma 14 jun. 2027
Rotterdam
ma 14 jun. 2027
Utrecht
ma 14 jun. 2027
Zwolle
ma 14 jun. 2027
Amsterdam
ma 16 aug. 2027
Eindhoven
ma 16 aug. 2027
Houten
ma 16 aug. 2027
Rotterdam
ma 16 aug. 2027
Utrecht
ma 16 aug. 2027
Zwolle
ma 16 aug. 2027
Amsterdam
ma 11 okt. 2027
Eindhoven
ma 11 okt. 2027
Houten
ma 11 okt. 2027
Rotterdam
ma 11 okt. 2027
Utrecht
ma 11 okt. 2027
Zwolle
ma 11 okt. 2027
Amsterdam
ma 13 dec. 2027
Eindhoven
ma 13 dec. 2027
Houten
ma 13 dec. 2027
Rotterdam
ma 13 dec. 2027
Utrecht
ma 13 dec. 2027
Zwolle
ma 13 dec. 2027
Amsterdam
ma 14 feb. 2028
Eindhoven
ma 14 feb. 2028
Houten
ma 14 feb. 2028
Rotterdam
ma 14 feb. 2028
Utrecht
ma 14 feb. 2028
Zwolle
ma 14 feb. 2028
Amsterdam
ma 17 apr. 2028
Eindhoven
ma 17 apr. 2028
Houten
ma 17 apr. 2028
Rotterdam
ma 17 apr. 2028
Utrecht
ma 17 apr. 2028
Zwolle
ma 17 apr. 2028
Amsterdam
ma 12 jun. 2028
Eindhoven
ma 12 jun. 2028
Houten
ma 12 jun. 2028
Rotterdam
ma 12 jun. 2028
Utrecht
ma 12 jun. 2028
Zwolle
ma 12 jun. 2028
Keurmerken aanbieder
NRTO
UWV scholingsvoucher