Latest Certified Success Dumps Download

CISCO, MICROSOFT, COMPTIA, HP, IBM, ORACLE, VMWARE
CCD-470 Examination questions (September)

Achieve New Updated (September) Cloudera CCD-470 Examination Questions 51-60

September 24, 2015

Ensurepass

 

QUESTION 51

What is the difference between a failed task attempt and a killed task attempt?

 

A.

A failed task attempt is a task attempt that threw an unhandled exception. A killed task attempt is one that was terminated by the JobTracker.

B.

A failed task attempt is a task attempt that did not generate any key value pairs. A killed

 

 

 

 

task attempt is a task attempt that threw an exception, and thus killed by the execution framework.

C.

A failed task attempt is a task attempt that completed, but with an unexpected status value. A killed task attempt is a duplicate copy of a task attempt that was started as part of speculative execution.

D.

A failed task attempt is a task attempt that threw a RuntimeException (i.e., the task fails). A killed task attempt is a task attempt that threw any other type of exception (e.g., IOException); the execution framework catches these exceptions and reports them as killed.

 

Answer: C

Explanation: Note:

* Hadoop uses “speculative execution.” The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.

 

Failed tasks are tasks that error out.

 

* There are a few reasons Hadoop can kill tasks by his own decisions:

a) Task does not report progress during timeout (default is 10 minutes) b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler).

c) Speculative execution causes results of task not to be needed since it has completed on other place.

 

Reference: Difference failed tasks vs killed tasks

 

 

QUESTION 52

For each input key-value pair, mappers can emit:

 

A.

As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).

B.

As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.

C.

One intermediate key-value pair, of a different type.

D.

One intermediate key-value pair, but of the same type.

E.

As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.

 

 

 

 

 

Answer: E

Explanation: Mapper maps input key/value pairs to a set of intermediate key/value pairs.

 

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

 

Reference: Hadoop Map-Reduce Tutorial

 

 

QUESTION 53

What happens in a MapReduce job when you set the number of reducers to zero?

 

A.

No reducer executes, but the mappers generate no output.

B.

No reducer executes, and the output of each mapper is written to a separate file in HDFS.

C.

No reducer executes, but the outputs of all the mappers are gathered together and written to a single file in HDFS.

D.

Setting the number of reducers to zero is invalid, and an exception is thrown.

 

Answer: B

Explanation: * It is legal to set the number of reduce-tasks to zero if no reduction is desired.

 

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

* Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

 

 

QUESTION 54

You have user profile records in your OLPT database, that you want to join with web logs

 

 

 

 

you have already ingested into the Hadoop file system. How will you obtain these user records?

 

A.

HDFS command

B.

Pig LOAD command

C.

Sqoop import

D.

Hive LOAD DATA command

E.

Ingest with Flume agents

F.

Ingest with Hadoop Streaming

 

Answer: B

Explanation: Apache Hadoop and Pig provide excellent tools for extracting and analyzing data from very large Web logs.

 

We use Pig scripts for sifting through the data and to extract useful information from the Web logs. We load the log file into Pig using the LOAD command. raw_logs = LOAD ‘apacheLog.log’ USING TextLoader AS (line:chararray);

 

Note 1:

Data Flow and Components

* Content will be created by multiple Web servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers; these are machines that collect data intermediately using collectors and finally push that data to HDFS.

* Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job solution). These scripts actually analyze the logs on various dimensions and extract the results. Results from Pig are by default inserted into HDFS, but we can use storage implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please see the implementation section). Pig Scripts can either push this data to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts can push this data into HBase directly. In this article, we use scripts to push data onto HDFS, as we are showcasing the Pig framework applicability for log analysis at large scale.

* The database HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing.

* The data-access Web service is a REST-based service that eases the access and integrations with data clients. The client can be in any language to access REST-based API. These clients could be BI- or UI-based clients.

 

 

 

 

 

Note 2:

The Log Analysis Software Stack

* Hadoop is an open source framework that allows users to process very large data in parallel. It’s based on the framework that supports Google search engine. The Hadoop core is mainly divided into two modules:

1. HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster.

2. Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS.

* The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which usesHDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets — HBase can save very large tables having million of rows. It’s a distributed database and can also keep multiple versions of a single row.

* The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map- Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language.

* Flume is a distributed, reliable and available service for collecting, aggregating and moving a large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for further processing. It’s a data flow solution, where there is an originator and destination for each node and is divided into Agent and Collector tiers for collecting logs and pushing them to destination storage.

 

Reference: Hadoop and Pig for Large-Scale Web Log Analysis

 

 

QUESTION 55

Which MapReduce daemon runs on each slave node and participates in job execution?

 

A.

TaskTracker

B.

JobTracker

 

 

 

 

C.

NameNode

D.

Secondary NameNode

 

Answer: A

Explanation: Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.

 

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?

 

http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html(See answer to question no. 5)

 

 

QUESTION 56

Indentify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?

 

A.

Oozie

B.

Sqoop

C.

Flume

D.

Hadoop Streaming

E.

mapred

 

Answer: D

Explanation: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Reference:http://hadoop.apache.org/common/docs/r0.20.1/streaming.html(Hadoop Streaming, second sentence)

 

 

QUESTION 57

Which of the following best describes the map method input and output?

 

 

 

 

 

A.

It accepts a single key-value pair as input and can emit only one key-value pair as output.

B.

It accepts a list of key-value pairs as input hut run emit only one key value pair as output.

C.

It accepts a single key-value pair as input and emits a single key and list of corresponding values as output

D.

It accepts a single key-value pair as input and can emit any number of key-value pairs as output, including zero.

 

Answer: D

Explanation: public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> extends Object

Maps input key/value pairs to a set of intermediate key/value pairs.

 

Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.

 

Reference: org.apache.hadoop.mapreduce

 

Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

 

 

QUESTION 58

The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:

 

A.

Combine

B.

IdentityMapper

C.

IdentityReducer

D.

Default Partitioner

E.

Speculative Execution

 

Answer: E

Explanation: Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be

 

 

 

 

 

reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.

By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers toabandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

 

Reference: Apache Hadoop, Module 4: MapReduce

 

Note:

 

* Hadoop uses “speculative execution.” The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.

 

Failed tasks are tasks that error out.

 

* There are a few reasons Hadoop can kill tasks by his own decisions:

 

a) Task does not report progress during timeout (default is 10 minutes)

 

b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler).

 

c) Speculative execution causes results of task not to be needed since it has completed on other place.

 

Reference: Difference failed tasks vs killed tasks

 

 

QUESTION 59

 

You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key- values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.

 

A.

There is no difference in output between the two settings.

B.

With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS.

C.

With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.

D.

With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.

 

Answer: D

Explanation: * It is legal to set the number of reduce-tasks to zero if no reduction is desired.

 

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

* Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

 

Note:

Reduce

In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.

 

The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(WritableComparable, Writable).

 

Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.

 

The output of the Reducer is not sorted.

 

 

 

 

 

 

QUESTION 60

How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?

 

A.

Keys are presented to reducer in sorted order; values for a given key are not sorted.

B.

Keys are presented to reducer in sorted order; values for a given key are sorted in ascending order.

C.

Keys are presented to a reducer in random order; values for a given key are not sorted.

D.

Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.

 

Answer: A

Explanation: Reducer has 3 primary phases:

 

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.

 

2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

 

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

 

SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.

 

3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

 

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

 

The output of the Reducer is not re-sorted.

 

Reference: org.apache.hadoop.mapreduce, Class

 

 

 

 

Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

 

Free VCE & PDF File for Cloudera CCD-470 Real Exam

Instant Access to Free VCE Files: CompTIA | VMware | SAP …
Instant Access to Free PDF Files: CompTIA | VMware | SAP …