Azure Data Engineer Associate Boot Camp
DreamsPlus offers a comprehensive Azure Data Engineer Associate Boot Camp in Chennai and online, designed to provide hands-on experience and prepare you for the Microsoft certification in data engineering.
Design and implement data storage (15–20%)
Implement a partition strategy
- Create a file partition strategy;
- create a partition strategy for analytical workloads;
- create a streaming workload strategy;
- create a partition strategy for Azure Synapse Analytics;
- Determine which instances of Azure Data Lake Storage Gen2 require partitioning.
Design and implement the data exploration layer
- Utilising a computational solution that makes use of Spark cluster and SQL serverless, create and run queries.
- It is recommended to utilise Azure Synapse Analytics database templates and to put them into practice.
- Upload updated or new data lineage to Microsoft Purview.
- Use the Microsoft Purview Data Catalogue to browse and search metadata.
Develop data processing (40–45%)
Ingest and transform data
- Create and apply incremental loads
- The following can be used to transform data
- clean data,
- handle duplicate data,
- handle missing data,
- handle late-arriving data,
- split data,
- shred JSON,
- encode and decode data,
- configure error handling for a transformation,
- transform data using Apache Spark,
- transform data using Transact-SQL (T-SQL), ingest and transform data using
- Azure Synapse Pipelines transform data using Azure Stream Analytics.
Develop a batch processing solution
- Utilise Azure Data Lake Storage,
- Azure Databricks,
- Azure Synapse Analytics,
- Azure Data Factory to create batch processing solutions.
- Data pipelines can be created,
- Resources can be scaled,
- Batch sizes can be specified,
- Jupyter or Python notebooks can be integrated into a data pipeline,
- data can be upserted,
- data can be reverted to a previous state,
- exception handling can be configured,
- batch retention can be configured,
- data can be read from and written to a delta lake.
Develop a stream processing solution
- Use Azure Event Hubs and Stream Analytics to create a stream processing solution.
- Using Spark structured streaming
- process data
- create windowed aggregates
- handle schema drift
- process time series data
- process data across partitions
- process within a single partition
- scale resources
- create tests for data pipelines
- optimise pipelines for analytical or transactional purposes
- manage interruptions; configure exception handling
- upsert data
- replay archived stream data
Manage batches and pipelines
- Manage data pipelines in Azure Data Factory or Azure Synapse Pipelines
- Handle failed batch loads
- Validate batch loads
- Arrange data pipelines using Azure Synapse Pipelines or Data Factory.
- Manage Spark jobs in a pipeline
- Apply version control for pipeline artefacts
Secure, monitor, and optimize data storage and data processing (30–35%)
Implement data security
- Put data masking into practice.
- Implement row-level and column-level security
- enable Azure role-based access control (RBAC)
- create POSIX-like access control lists (ACLs) for Data Lake Storage Gen2
- encrypt data both in transit and at rest
- Establish a strategy for data preservation
- create secure endpoints, both public and private
- add resource tokens to Azure Databricks
- load sensitive data into a Data Frame write encrypted
- data to tables or Parquet files
- handle sensitive data.
Monitor data storage and data processing
- Install the logging that Azure Monitor uses.
- Monitoring service configuration,
- stream processing monitoring,
- data movement performance measurement,
- system-wide data statistics monitoring and updating,
- data pipeline performance monitoring,
- query performance measurement,
- pipeline test scheduling and monitoring,
- Azure interpretation Track logs and metrics Put in place a pipeline alerting plan.
Optimize and troubleshoot data storage and data processing
- Compact tiny files
- Optimise resource management
- manage data skew
- manage data spill
- tune queries using indexers
- tune queries using cache
- Investigate a failed Spark task.
- Investigate a failed pipeline run, encompassing actions carried out in external services