Big Data - concepts, Technology and Architecture. 1
Book Description.. 11
1.1 Understanding Big Data. 13
1.2 Evolution of Big Data. 14
1.3 Failure of Traditional database in handling Big Data. 15
1.3 (a) Data Mining Vs Big Data. 16
1.4 3 Vs of Big Data. 17
1.4.1 Volume. 17
1.4.2 Velocity. 18
1.4.3 Variety. 19
1.5 Sources of Big Data. 19
1.6 Different Types of Data. 21
1.6.1 Structured Data. 22
1.6.2 Unstructured Data. 22
1.6.3 Semi-Structured Data. 23
1.7 Big Data Infrastructure. 24
1.8 Big Data Life Cycle. 25
1.8.1 Big Data Generation. 26
1.8.2 Data Aggregation. 26
1.8.3 Data Preprocessing. 27
1.7.3Big Data Analytics. 31
1.7.4 Visualizing Big Data. 32
1.8 Big Data Technology. 32
1.8.1 Challenges faced by Big Data technology. 34
1.8.1 Heterogeneity and incompleteness. 34
1.8.2 Volume and velocity of the Data. 35
1.8.3 Data Storage. 35
1.8.4 Data Privacy. 36
1.9 Big Data Applications. 36
1.10 Big Data Use Cases. 37
1.9. 1 Healthcare. 37
1.9.2 Telecom.. 38
1.9.3 Financial Services. 39
Chapter 1 refresher: 40
Conceptual short Questions with answers. 43
Frequently asked Interview questions. 45
Chapter Objective. 46
Big Data Storage Concepts. 46
2.1 Cluster computing. 47
2.1.1 Types of cluster. 49
2.1.1.1 High availability cluster. 50
2.1.1.2 Load balancing cluster. 50
2.1.2 Cluster structure. 51
2.3 Distribution Models. 53
2.3.1 Sharding. 54
2.3.2 Data Replication. 56
2.3.2.1 Master-Slave model 57
2.3.2.2 Peer-to-Peer model 58
2.3.3 Sharding and Replication. 59
2.4 Distributed file system.. 60
2.5 Relational and Non Relational Databases. 61
CoursesOffered. 62
Figure 2.12 Data divided across multiple related tables. 62
2.4.2 RDBMS Databases. 63
2.4.3 NoSQL Databases. 63
2.4.4 NewSQL Databases. 64
2.5 Scaling Up and Scaling Out Storage. 65
Chapter 2 refresher. 67
Conceptual short questions with answers. 69
Chapter Objective. 72
3.1 Introduction to NoSQL. 72
3.2 Why NoSQL. 72
3.3 CAP theorem.. 73
3.4 ACID.. 75
3.5 BASE. 76
3.6 Schemaless Database. 77
3.7 NoSQL (Not Only SQL) 77
3.7.1 NoSQL Vs RDBMS. 78
3.7.2Features of NoSQL database. 79
3.7.3Types of NoSQL Technologies. 80
3.7.3.1 Key-Value store database. 81
3.7.3.2 Column-store database. 82
3.7.3.3 Document Oriented Database. 84
3.7.3.4 Graph-oriented Database. 86
3.7.4 NoSQL Operations. 93
3.9 Migrating from RDBMS to NoSQL. 98
Chapter 3 refresher. 99
Conceptual short questions with answers. 102
Chapter Objective. 104
4.1 Data Processing. 104
4.2 Shared Everything Architecture. 106
4.2.1 Symmetric multiprocessing architecture. 107
4.2.2 Distributed Shared memory. 108
4.3 Shared nothing architecture. 109
4.4 Batch Processing. 110
4.5 Real-Time Data Processing. 111
4.6 Parallel Computing. 112
4.7 Distributed Computing. 113
4.8 Big Data Virtualization. 113
4.8.1 Attributes of Virtualization. 114
4.8.1.1 Encapsulation. 115
4.8.1.2 Partitioning. 115
4.8.1.3 Isolation. 115
4.8.2Big Data Server Virtualization. 116
4.9 Introduction. 116
4.10 Cloud computing types. 118
4.11Cloud Services. 120
4.12 Cloud Storage. 121
4.12.1 Architecture of GFS. 121
4.12.1.1 Master. 123
4.12.1.2 Client. 123
4.13 Cloud Architecture. 127
Cloud Challenges. 129
Chapter 4 Refresher. 130
Conceptual short questions with answers. 133
Chapter Objective. 139
5.1 Apache Hadoop. 139
5.1.1 Architecture of Apache Hadoop. 140
5.1.2Hadoop Ecosystem Components Overview.. 140
5.2 Hadoop Storage. 142
5.2.1HDFS (Hadoop Distributed File System). 142
5.2.2Why HDFS?. 143
5.2.3HDFS Architecture. 143
5.2.4HDFS Read/Write Operation. 146
5.2.5Rack Awareness. 148
5.2.6Features of HDFS. 149
5.2.6.1Cost-effective. 149
5.2.6.2Distributed storage. 149
5.2.6.3Data Replication. 149
5.3 Hadoop Computation. 149
5.3.1MapReduce. 149
5.3.1.1Mapper. 151
5.3.1.2Combiner. 151
5.3.1.3 Reducer. 152
5.3.1.4 JobTracker and TaskTracker. 153
5.3.2 MapReduce Input Formats. 154
5.3.3 MapReduce Example. 156
5.3.4 MapReduce Processing. 157
5.3.5 MapReduce Algorithm.. 160
5.3.6 Limitations of MapReduce. 161
5.4Hadoop 2.0. 161
5.4.1Hadoop 1.0 limitations. 162
5.4.2 Features of Hadoop 2.0. 163
5.4.3 Yet Another Resource Negotiator (YARN). 164
5.4.3 Core components of YARN.. 165
5.4.3.1 ResourceManager. 165
5.4.3.2 NodeManager. 166
5.4.4 YARN Scheduler. 169
5.4.4.1 FIFO scheduler. 169
5.4.4.2 Capacity Scheduler. 170
5.4.4.3 Fair Scheduler. 170
5.4.5 Failures in YARN.. 171
5.4.5.1ResourceManager failure. 171
5.4.5.2 ApplicationMaster failure. 172
5.4.5.3 NodeManagerFailure. 172
5.4.5.4 Container Failure. 172
5.3 HBASE. 173
5.4 Apache Cassandra. 176
5.5 SQOOP. 177
5.6 Flume. 179
5.6.1 Flume Architecture. 179
5.6.1.1 Event. 180
5.6.1.2 Agent. 180
5.7 Apache Avro. 181
5.8 Apache Pig. 182
5.9 Apache Mahout. 183
5.10 Apache Oozie. 183
5.10.1 Oozie Workflow.. 184
5.10.2 Oozie Coordinators. 186
5.10.3 Oozie Bundles. 187
5.11 Apache Hive. 187
5.11 Apache Hive. 187
Hive Architecture. 189
Hadoop Distributions. 190
Chapter 5refresher. 191
Conceptual short questions with answers. 194
Frequently asked Interview Questions. 199
Chapter Objective. 200
6.1 Terminologies of Big Data Analytics. 201
Data Warehouse. 201
Business Intelligence. 201
Analytics. 202
6.2 Big Data Analytics. 202
6.2.1 Descriptive Analytics. 204
6.2.2 Diagnostic Analytics. 205
6.2.3 Predictive Analytics. 205
6.2.4 Prescriptive Analytics. 205
6.3 Data Analytics Lifecycle. 207
6.3.1 Business case evaluation and Identify the source data. 208
6.3.2 Data preparation. 209
6.3.3 Data Extraction and Transformation. 210
6.3.4 Data Analysis and visualization. 211
6.3.5 Analytics application. 212
6.4 Big Data Analytics Techniques. 212
6.4.1 Quantitative Analysis. 212
6.4.3 Statistical analysis. 214
6.4.3.1 A/B testing. 214
6.4.3.2 Correlation. 215
6.4.3.3 Regression. 218
6.5 Semantic Analysis. 220
6.5.1 Natural Language Processing. 220
6.5.2 Text Analytics. 221
6.7 Big Data Business Intelligence. 222
6.7.1 Online Transaction Processing (OLTP). 223
6.7.2 Online Analytical Processing (OLAP). 223
6.7.3 Real-Time Analytics Platform (RTAP). 224
6.6Big Data Real Time Analytics Processing. 225
6.7 Enterprise Data Warehouse. 227
Chapter 6 Refresher. 228
Conceptual short questions with answers. 230
Chapter Objective. 233
7.1 Introduction to Machine learning. 233
7.2 Machine learning use cases. 234
7.3 Types of Machine learning. 235
7.3.1 Supervised machine learning algorithm.. 236
7.3.1.1 Classification. 237
7.3.1.2 Regression. 238
Support vector machines (SVM). 239
Big Data Analytics Practical Application. 244
Chapter 7 Refresher. 245
Conceptual short questions with answers. 247
Chapter Objective. 249
8.1 Itemset Mining. 249
8.2 Association Rules. 255
8.3 Frequent itemset generation. 259
8.4 Itemset Mining Algorithms. 260
8.4.1 Apriori Algorithm.. 260
8.4.1.2 Frequent Itemset generation using Apriori Algorithm.. 266
8.4.2 Eclat Algorithm - Equivalence Class Transformation Algorithm.. 268
8.4.3 FP growth algorithm.. 271
8.5 Maximal and Closed Frequent Itemset. 278
Mining Closed Frequent Itemsets: Charm Algorithm.. 284
CHARM Algorithm implementation. 285
Data Mining Methods. 287
8.8 Prediction. 288
8.8.2 Classification techniques. 289
8.8.2.1 Bayesian Network. 289
8.8.2.2 K- Nearest Neighbor Algorithm.. 294
8.8.2.2.1 The Distance metric. 296
8.8.2.2.2 The parameter selection - cross validation. 296
8.8.2.3 Decision tree classifier. 297
Density based clustering algorithm.. 299
DBSCAN.. 299
Kernel Density Estimation. 303
8.9.3 Artificial Neural Network. 303
The Biological Neural Network. 303
8.11 Mining Data Streams. 305
Time Series Forecasting. 306
9.1Clustering. 308
Application of Hierarchical methods. 315
Kernel k-means clustering. 321
Expectation Maximization Clustering Algorithm.. 323
Methods of determining the Number of clusters: 327
Outlier detection. 327
Types of Outliers. 329
Outlier detection techniques. 332
Training dataset based outlier detection. 332
Assumption based outlier detection. 333
Applications of outlier detection. 334
9.6.3 Optimization Algorithm.. 335
Choosing the Number of Clusters. 339
Bayesian Analysis of Mixtures. 342
Fuzzy Clustering. 342
10.1 Big Data Visualization. 345
10.2 Conventional Data Visualization Techniques. 346
10.2.1 Line Chart. 346
10.2.2 Bar Chart. 347
10.2.3 Pie Chart. 348
10.2.4 Scatter Plot. 349
10.2.5 Bubble plot. 350
Tableau. 350
Connecting to data. 354
Connecting to data in Cloud. 355
Connect to a file. 356
Scatter plot in tableau. 362
Histogram using Tablaeu. 365
Bar chart in tableau. 365
Line Chart. 367
Pie chart. 368
Bubble chart. 369
Box Plot. 370
Tableau Use Cases. 371
Airlines. 371
Office Supplies. 372
Sports. 374
Science - Earthquake Analysis. 375
Tableau is used to analyze the magnitude of earth quake and the frequency of occurrence over the years. 375
Installing R and Getting Ready. 377
R Basic commands. 378
Assigning value to a variable. 378
Data Structures in R. 379
Vector. 379
Coercion. 380
Length, Mean and median. 381
Matrix. 382
Arrays. 385
Data frames. 387
Lists. 390
Importing data from a file. 392
Importing data from a delimited text file. 394
Control Structures in R. 394
If-else. 395
Nested if-else. 395
for loops. 396
Example. 396
[1] 4. 397
while loops. 397
Break. 398
Basic Graphs in R. 398
Pie Charts. 398
3D - Pie Charts. 399
Bar Charts. 400
Boxplots. 401
Histograms. 402
Line charts. 403
Scatter plots. 405