Latest Certified Success Dumps Download

CISCO, MICROSOFT, COMPTIA, HP, IBM, ORACLE, VMWARE
DS-200 Examination questions (September)

Achieve New Updated (September) Cloudera DS-200 Examination Questions 11-20

September 24, 2015

Ensurepass

 

QUESTION 11

You have a large m x n data matrix

M.You decide you want to perform dimension reduction/clustering on your data and have decide to use the singular value decomposition (SVD; also called principal components analysis PCA)

 

Forthe moment, assumethat your data matrixM is 500 x 2.The figure belowshows a plotof the data.

 

clip_image001

 

 

 

 

Which line representsthe secondprincipalcomponent?

 

A.

Blue

B.

Yellow

 

Answer: A

 

 

QUESTION 12

There are 20 patientswith acute lymphoblasticleukemia(ALL)and 32 patientswith acutemyeloidleukemia(AML),both variantsof a blood cancer.

 

The makeup of the groups as follows:

 

clip_image002

 

Each individual has anexpression valuefor each of10000differentgenes. Theexpression valuefor eachgene is a continuousvalue between -1 and 1.

 

You’vebuilt yourmodel for discriminatingbetween AML and ALLpatientsand you findthat it worksquite well onyour current data.One month later, acollaborationtells you she hasfreshdata from100 new AML/ALLpatients.You run the samples through yourmodel,and turns out your model has very poorpredictive accuracyon the new samples;specifically, your model predictsthat all males have ALL.What is the most reliableway to fixthis problem?

 

 

 

 

 

A.

Change the distance metric

B.

Reduce the number of dimensions

C.

Use a Gibbs sampler on a Bayesian network

D.

Perform matched sampling across other provided variables

 

Answer: D

 

 

QUESTION 13

You need to analyze 60,000,000 images stored in JPEG format, each of which is approximately 25 KB.Because yourHadoop cluster isn’t optimized for storing and processing many small files you decide todo the following actions:

 

1.Group the individual images into a set of larger files

 

2.Use the set of larger files as input for a MapReduce job that processes them directly with Python using Hadoop streaming

 

Which data serialization system gives you the flexibility to do this?

 

A.

CSV

B.

XML

C.

HTML

D.

Avro

E.

Sequence Files

F.

JSON

 

Answer: BF

 

 

QUESTION 14

From historical data, you know that 50%of students who takeCloudera’sIntroduction to Data Science: Building RecommendersSystems training course pass this exam,whileonly 25%of students who did nottake the training coursepass thisexam.You also knowthat 50%of this exam’s candidates also take Cloudera’sIntroductiontoData Science: Building Recommendations Systems training course.

 

If we know that a person haspassed this exam, what is the probabilitythat they took cloudera’sintroductionto Data Science: BuildingRecommender Systems training course?

 

 

 

 

 

A.

2/3

B.

1/2

C.

3/4

D.

3/5

 

Answer: B

 

 

QUESTION 15

You want to understand more about how users browse your public website. For example, you war know which pages they visit prior to placing an order. You have a server farm of 200 web serverhosting your website. Whichis the most efficient process to gather these web serversaccess logs intoyour Hadoop cluster for analysis?

 

A.

Sample the web server logs web servers and copy them into HDFS using curl

B.

Channel these click streams into Hadoop using Hadoop Streaming

C.

Write a MapReduce job with the web servers for mappers and the Hadoop cluster nodes for reducers

D.

Import all user clicks from your OLTP databases Into Hadoop using Sqoop

E.

Ingest the server web logs into HDFS using Flume

 

Answer: C

 

 

QUESTION 16

There are 20 patients with acute lymphoblastic leukemia (ALL) and 32 patients with acute myeloid leukemia (AML), both variants of a blood cancer.

 

The makeup of the groups as follows:

 

 

 

 

 

clip_image002[1]

 

Each individual has an expression value for each of 10000 different genes. The expression value for each gene is a continuous value between -1 and 1.

 

You want to use the data from the 52 patientsin the scenarioto improvethe abilityof doctorsbeing able to distinguishbetween ALL and AML. What type ofdata scienceproblem is this?

 

A.

Classification

B.

Regression

C.

Clustering

D.

Filtering

 

Answer: D

 

 

QUESTION 17

A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week.

 

One engineerpoints out that some bugs are moredifficult to fixthan others. Whatmetric shouldyou use to estimatehow hard aparticularbug is to fix?

 

 

 

 

 

A.

The tech lead’s estimate of how many hours would be needed to fix the bug.

B.

The priority of the bug according to the project manager

C.

The number of years that the engineer who was assigned the bug has worked at the company

D.

The number of bugs that had been found in each sub-component of the project

 

Answer: D

 

 

QUESTION 18

What is the best way todetermine the learningrate parameters for stochasticgradient descentwhen the distribution of the inputdata shifts over time?

 

A.

The learning rate should be adjusted periodically based on the setting that optimizes the objective function over a sample of recent observations

B.

The learning rate should be fixed number that decays as the number of observations in the data set increases

C.

The learning rate should be the value that optimizes the value of the objective function over the first N samples in the dataset

D.

The learning rate should be a fixed number with a constant decay factor

E.

The learning rate should be continuously adjusted based on the value that optimizes the objective function for the most recent observation from the input data

 

Answer: C

 

 

QUESTION 19

You have acquired a new datasource of millionsof customer records, andyou’ve this data intoHDFS. Priorto analysis, you want tochange all customerregistrationto the same date format,make all addresses uppercase, and remove allcustomer names (for anonymization).Which process will accomplishall three objectives?

 

A.

Adapt the data cleansing module in Mahout to your data, and invoke the Mahout library when you run your analysis

B.

Pull this data into an RDBMS using sqoop and scrub records using stored procedures

C.

Write a script that receives records on stdin, corrects them, and then writes them to stdout. Then, invoke this script in a map-only Hadoop Streaming Job

D.

Write a MapReduce job with a mapper to change words to uppercase and to reduce different forms of dates to a single form

 

 

 

 

 

Answer: C

 

 

QUESTION 20

You have a large m x n datamatrix M.Youdecide you want toperformdimensionreduction/clusteringon your data and havedecideto use the singularvaluedecomposition(SVD;also called principalcomponents analysis PCA)

 

You performedsingular value decomposition(SVD;also called principalcomponents analysisor PCA)on you data matrix butyou did notcenter your data first.What doesyour first singularcomponentdescribe?

 

A.

The mean of the data set

B.

The variance of the data set

C.

The standard deviation of the data set

D.

The maximum of the data set

E.

The median of the data set

 

Answer: C

Free VCE & PDF File for Cloudera DS-200 Real Exam

Instant Access to Free VCE Files: CompTIA | VMware | SAP …
Instant Access to Free PDF Files: CompTIA | VMware | SAP …