cloudera 授权 Apache Hadoop 数据分析员培训

Cloudera Data Analyst Training:
Using Pig, Hive and Impala with Hadoop

cloudera-hadoop-training

Cloudera 的 Apache Hadoop 培训和认证使你的知识迈上新台阶

Cloudera University 三天的数据分析员课程会教你把传统数据分析和商业智能技术应用于大数据。Cloudera 展示专业数据工具来访问，操纵和分析使用 SQL 和常见脚本语言的复杂数据集。

提升你的生态系统技术

Apache Hive 使得分析师、数据库管理员和其他无 Java 编程技术的人能使用多结构数据。Apache Pig 把常见脚本语言的基础原则应用于 hadoop 集群。Cloudera Impala 使得实时交互式数据分析通过本地 SQL 存储在 hadoop。

使用 Hadoop

通过指导性的讨论互动和实践，学员将使用 Hadoop 生态系统学习到如下内容：
Apache Hadoop 基础及数据 ETL（包括数据提取、转换及加载）、Hadoop 相关工具使用
如何使用 Apache Pig 对多数据集进行 join 操作以及分析独立数据
如何使用 Apache Hive：通过定义合适的表来组织数据、执行各种数据变换、简化复杂查询
如何使用 Impala 来对存储在 HDFS 里的大规模数据进行实时和交互式的分析查询
如何根据数据分析任务来选择合适的数据分析工具

参训相关信息

课程时长：3天

学员基础

本课程适合于具备 SQL、基本 UNIX/Linux 命令经验的数据分析员、商业分析员以及系统管理员，无需 Apache Hadoop 经验

授课形式

采取教师讲解和学员上机操作相结合的形式。上机实验有机地穿插在重要课题讲解后，学员能马上学以致用，巩固刚刚所学的概念和知识，转化为自身的技能应用到实战中。我们鼓励学员在课堂上大胆自由地提问，和授课教师进行互动，获得最大的收益。

课程内容纲要

Hadoop 基础

The Motivation for Hadoop
Hadoop Overview
HDFS
MapReduce
The Hadoop Ecosystem
Lab Scenario Explanation
Hands-On Exercise: Data Ingest with

Pig 介绍

What Is Pig?
Pig’s Features
Pig Use Cases
Interacting with Pig

使用 Pig 进行简单数据分析

Pig Latin Syntax
Loading Data
Simple Data Types
Field Definitions
Data Output
Viewing the Schema
Filtering and Sorting Data
Commonly-Used Functions
Hands-On Exercise: Using Pig for ETL Processing

使用 Pig 处理复杂数据

Storage Formats
Complex/Nested Data Types
Grouping
Built-in Functions for Complex Data
Iterating Grouped Data
Hands-On Exercise: Analyzing Ad Campaign Data with Pig

使用 Pig 分析处理多数据集

Techniques for Combining Data Sets
Joining Data Sets in Pig
Set Operations
Splitting Data Sets
Hands-On Exercise: Analyzing Disparate Data Sets with Pig

扩展 Pig

Adding Flexibility with Parameters
Macros and Imports
UDFs
Contributed Functions
Using Other Languages to Process Data with Pig
Hands-On Exercise: Extending Pig with Streaming and UDFs

Pig 排错和优化

Troubleshooting Pig
Logging
Using Hadoop’s Web UI
Optional Demo: Troubleshooting a Failed Job with the Web UI
Data Sampling and Debugging
Performance Overview
Understanding the Execution Plan
Tips for Improving the Performance of Your Pig Jobs

Hive 介绍

What Is Hive?
Hive Schema and Data Storage
Comparing Hive to Traditional Databases
Hive vs. Pig
Hive Use Cases
Interacting with Hive

使用 Hive 进行数据分析

Hive Databases and Tables
Basic HiveQL Syntax
Data Types
Joining Data Sets
Common Built-in Functions
Hands-On Exercise: Running Hive
Queries on the Shell, Scripts, and Hue

Hive 数据管理

Hive Data Formats
Creating Databases and Hive-Managed Tables
Loading Data into Hive
Altering Databases and Tables
Self-Managed Tables
Simplifying Queries with Views
Storing Query Results
Controlling Access to Data
Hands-On Exercise: Data Management

使用 Hive 分析处理文本数据

Overview of Text Processing
Important String Functions
Using Regular Expressions in Hive
Sentiment Analysis and N-Grams
Hands-On Exercise (Optional): Gaining Insight with Sentiment
Analysis

Hive 优化

Understanding Query Performance
Controlling Job Execution Plan
Partitioning
Bucketing
Indexing Data

扩展 Hive

SerDes
Data Transformation with Custom Scripts
User-Defined Functions
Parameterized Queries
Hands-On Exercise: Data Transformation with Hive

Impala 介绍

What is Impala?
How Impala Differs from Hive and Pig
How Impala Differs from Relational Databases
Limitations and Future Directions
Using the Impala Shell

使用 Impala 进行数据分析

Basic Syntax
Data Types
Filtering, Sorting, and Limiting Results
Joining and Grouping Data
Improving Impala Performance
Hands-On Exercise: Interactive Analysis with Impala

如何选取数据分析工具

Comparing MapReduce, Pig, Hive, Impala, and Relational Databases
Which to Choose?