Cloudera cca175 practice test

CCA Spark and Hadoop Developer Exam


Question 1

Problem Scenario 96 : Your spark application required extra Java options as below. -
XX:+PrintGCDetails-XX:+PrintGCTimeStamps
Please replace the XXX values correctly
./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=talse --conf
XXX hadoopexam.jar

Answer:

Solution
+PrintGCTimeStamps"
./bin/spark-submit \
--class <maln-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
-conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Here, conf is used to pass the Spark related contigs which are required for the application to run like
any specific property(executor memory) or if you want to override the default property which is set
in Spark-default.conf.

Discussions

Question 2

Problem Scenario 95 : You have to run your Spark application on yarn with each executor Maximum
heap size to be 512MB and Number of processorcores to allocate on each executor will be 1 and Your
main application required three values as input arguments V1 V2 V3.
Please replace XXX, YYY, ZZZ
./bin/spark-submit -class com.hadoopexam.MyTask --master yarn-cluster--num-executors 3 --driver-
memory 512m XXX YYY lib/hadoopexam.jarZZZ

Answer:

Solution
-executor-cores 1
V1 V2 V3
spark-submit on yarn options Option Description
archives Comma-separated list of archives to be extracted into the working directory of each
executor. The path must be globally visible inside your cluster; see Advanced Dependency
Management.
executor-cores Number of processor cores to allocate on each executor. Alternatively, you can use
the spark.executor.cores property, executor-memory Maximum heap size to allocate to each
executor. Alternatively, you can use the spark.executor.memory-property. num-executors Total
number of YARN containers to allocate for this application. Alternatively, you can use the
spark.executor.instances property. queue YARN queue to submit to. For more information, see
default.

Discussions

Question 3

Problem Scenario 94 : You have to run your Spark application on yarn with each executor 20GB and
number of executors should be 50.Please replace XXX, YYY, ZZZ
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
-class com.hadoopexam.MyTask \
xxx\
-deploy-mode cluster \ # can be client for client mode
YYY\
222 \
/path/to/hadoopexam.jar \
1000

Answer:

Solution
-master yarn
-executor-memory 20G
-num-executors 50

Discussions

Question 4

Problem Scenario 93 : You have to run your Spark application with locally 8 thread or locally on 8
cores. Replace XXX with correct values.
spark-submit
--class
com.hadoopexam.MyTask
XXX
\
-deploy-mode
cluster
SSPARK_HOME/lib/hadoopexam.jar 10

Answer:

Solution
-master local[8]

Master URL Meaning
local Run Spark locally with one worker thread (i.e. no parallelism at all}.
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your
machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
PORT Connect to the given Spark standalone cluster master. The port must be
whichever one your master is configured to use, which is 7077 by default.
PORT Connect to the given Mesos cluster. The port must be whichever one your is
configured to use, which is 5050 by default.Or, for a Mesoscluster using ZooKeeper, use
PORT should be configured to
connect to the MesosClusterDispatcher.
yarn Connect to a YARN cluster in client or cluster mode depending on the value of -deploy-mode.
The cluster location will be found based onthe HADOOP CONF DIR or YARN CONF DIR variable.

Discussions

Question 5

Problem Scenario 92 : You have been given a spark scala application, which is bundled in jar named
hadoopexam.jar.
Your application class name is com.hadoopexam.MyTask
You want that while submitting your application should launch a driver on one of the cluster node.
Please complete the following command to submit the application.
spark-submit XXX -master yarn \
YYY SSPARK HOME/lib/hadoopexam.jar 10

Answer:

Solution
-class com.hadoopexam.MyTask
--deploy-mode cluster

Discussions

Question 6

Problem Scenario 91 : You have been given data in json format as below.
{"first_name":"Ankit", "last_name":"Jain"}
{"first_name":"Amir", "last_name":"Khan"}
{"first_name":"Rajesh", "last_name":"Khanna"}
{"first_name":"Priynka", "last_name":"Chopra"}
{"first_name":"Kareena", "last_name":"Kapoor"}
{"first_name":"Lokesh", "last_name":"Yadav"}
Do the following activity
1. create employee.json tile locally.
2. Load this tile on hdfs
3. Register this data as a temp table in Spark using Python.
4. Write select query and print this data.
5. Now save back this selected data in json format.

Answer:

create employee.json tile locally.
vi employee.json (press insert) past the content.
Upload this tile to hdfs, default location hadoop fs -put employee.json
val employee = sqlContext.read.json("/user/cloudera/employee.json")
employee.write.parquet("employee. parquet")
val parq_data = sqlContext.read.parquet("employee.parquet")
parq_data.registerTempTable("employee")
val allemployee = sqlContext.sql("SELeCT' FROM employee")
all_employee.show()
import org.apache.spark.sql.SaveMode prdDF.write..format("orc").saveAsTable("product ore table"}
//Change the codec.
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
employee.write.mode(SaveMode.Overwrite).parquet("employee.parquet")

Discussions

Question 7

Problem Scenario 90 : You have been given below two files
course.txt
id,course
1,Hadoop
2,Spark
3,HBase
fee.txt
id,fee
2,3900
3,4200
4,2900
Accomplish the following activities.
1. Select all the courses and their fees , whether fee is listed or not.
2. Select all the available fees and respective course. If course does not exists still list the fee
3. Select all the courses and their fees , whether fee is listed or not. However, ignore records having
fee as null.

Answer:

hdfs dfs -mkdir sparksql4
hdfs dfs -put course.txt sparksql4/
hdfs dfs -put fee.txt sparksql4/
Now in spark shell
// load the data into a new RDD
val course = sc.textFile("sparksql4/course.txt")
val fee = sc.textFile("sparksql4/fee.txt")
// Return the first element in this RDD
course.fi
rst()
fee.fi
rst()

Integer)
// create an RDD of Product objects
val courseRDD = course.map(_.split(",")).map(c => Course(c(0).tolnt,c(1)))
val feeRDD =fee.map(_.split(",")).map(c => Fee(c(0}.tolnt,c(1}.tolnt))
courseRDD.first()
courseRDD.count(}
feeRDD.first()
feeRDD.countQ
// change RDD of Product objects to a DataFrame val courseDF = courseRDD.toDF(} val feeDF =
feeRDD.toDF{)
// register the DataFrame as a temp table courseDF. registerTempTable("course") feeDF.
registerTempTablef'fee")
// Select data from table
val results = sqlContext.sql(......SELECT' FROM course """ )
results. showQ
val results = sqlContext.sql(......SELECT' FROM fee......)
results. showQ
val results = sqlContext.sql(......SELECT * FROM course LEFT JOIN fee ON course.id = fee.id......)
results-showQ
val results ="sqlContext.sql(......SELECT * FROM course RIGHT JOIN fee ON course.id = fee.id "MM )
results. showQ
val results = sqlContext.sql(......SELECT' FROM course LEFT JOIN fee ON course.id = fee.id where
fee.id IS NULL"
results. show()

Discussions

Question 8

Problem Scenario 89 : You have been given below patient data in csv format,
patientID,name,dateOfBirth,lastVisitDate
1001,Ah Teck,1991-12-31,2012-01-20
1002,Kumar,2011-10-29,2012-09-20
1003,Ali,2011-01-30,2012-10-21
Accomplish following activities.
1. Find all the patients whose lastVisitDate between current time and '2012-09-15'
2. Find all the patients who born in 2011
3. Find all the patients age
4. List patients whose last visited more than 60 days ago
5. Select patients 18 years old or younger

Answer:

hdfs dfs -mkdir sparksql3
hdfs dfs -put patients.csv sparksql3/
Now in spark shell
// SQLContext entry point for working with structured data
val sqlContext = neworg.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.impIicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// load the data into a new RDD
val patients = sc.textFilef'sparksqIS/patients.csv")
// Return the first element in this RDD
patients.first()
//define the schema using a case class
String)
// create an RDD of Product objects
val patRDD = patients.map(_.split(M,M)).map(p => Patient(p(0).tolnt,p(1),p(2),p(3)))
patRDD.first()
patRDD.count(}
// change RDD of Product objects to a DataFrame val patDF = patRDD.toDF()
// register the DataFrame as a temp table patDF.registerTempTable("patients"}
// Select data from table
val results = sqlContext.sql(......SELECT* FROM patients '.....)
// display dataframe in a tabular format
results.show()
//Find all the patients whose lastVisitDate between current time and '2012-09-15'
val
results
=
sqlContext.sql(......SELECT
*
FROM
patients
WHERE
TO_DATE(CAST(UNIX_TIMESTAMP(lastVisitDate, 'yyyy-MM-dd') AS TIMESTAMP))BETWEEN '2012-09-
15' AND current_timestamp() ORDER BY lastVisitDate......)
results.showQ
/.Find all the patients who born in 2011
val
results
=
sqlContext.sql(......SELECT
*
FROM
patients
WHERE
YEAR(TO_DATE(CAST(UNIXJTlMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS TIMESTAMP))) = 2011 ......)
results. show()
//Find all the patients age
val
results
=
sqlContext.sql(......SELECT
name,
dateOfBirth,
datediff(current_date(),
TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 'yyyy-MM-dd')AS TlMESTAMP}}}/365 AS age
FROM patients
Mini >
results.show()
//List patients whose last visited more than 60 days ago
-- List patients whose last visited more than 60 days ago
val results = sqlContext.sql(......SELECT name, lastVisitDate FROM patients WHERE
datediff(current_date(),
TO_DATE(CAST(UNIX_TIMESTAMP[lastVisitDate,
'yyyy-MM-dd')
AS
T1MESTAMP))) > 60......);
results. showQ;
-- Select patients 18 years old or younger
SELECT' FROM patients WHERE TO_DATE(CAST(UNIXJTlMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS
TIMESTAMP}) > DATE_SUB(current_date(),INTERVAL 18 YEAR);
val
results
=
sqlContext.sql(......SELECT'
FROM
patients
WHERE
TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth,
'yyyy-MM--dd')
AS
TIMESTAMP))
>
DATE_SUB(current_date(), T8*365)......);
results. showQ;
val results = sqlContext.sql(......SELECT DATE_SUB(current_date(), 18*365) FROM patients......);
results.show();

Discussions

Question 9

Problem Scenario 88 : You have been given below three files
product.csv (Create this file in hdfs)
productID,productCode,name,quantity,price,supplierid
1001,PEN,Pen Red,5000,1.23,501
1002,PEN,Pen Blue,8000,1.25,501
1003,PEN,Pen Black,2000,1.25,501
1004,PEC,Pencil 2B,10000,0.48,502
1005,PEC,Pencil 2H,8000,0.49,502
1006,PEC,Pencil HB,0,9999.99,502
2001,PEC,Pencil 3B,500,0.52,501
2002,PEC,Pencil 4B,200,0.62,501
2003,PEC,Pencil 5B,100,0.73,501
2004,PEC,Pencil 6B,500,0.47,502
supplier.csv
supplierid,name,phone
501,ABC Traders,88881111
502,XYZ Company,88882222
503,QQ Corp,88883333
products_suppliers.csv
productID,supplierID
2001,501
2002,501
2003,501
2004,502
2001,503
Now accomplish all the queries given in solution.
1. It is possible that, same product can be supplied by multiple supplier. Now find each product, its
price according to each supplier.
2. Find all the supllier name, who are supplying 'Pencil 3B'
3. Find all the products , which are supplied by ABC Traders.

Answer:

It is possible that, same product can be supplied by multiple supplier. Now find each product,
its price according to each supplier.
val results = sqlContext.sql(......SELECT products.name AS Product Name', price, suppliers.name AS
Supplier Name'
FROM products_suppliers
JOIN products ON products_suppliers.productlD = products.productID JOIN suppliers ON
products_suppliers.supplierlD = suppliers.supplierlD
null t
results.show()
Find all the supllier name, who are supplying 'Pencil 3B'
val results = sqlContext.sql(......SELECT p.name AS 'Product Name", s.name AS "Supplier Name'
FROM products_suppliers AS ps
JOIN products AS p ON ps.productID = p.productID
JOIN suppliers AS s ON ps.supplierlD = s.supplierlD
WHERE p.name = 'Pencil 3B"",M )
results.show()
Find all the products , which are supplied by ABC Traders.
val results = sqlContext.sql(......SELECT p.name AS 'Product Name", s.name AS "Supplier Name'
FROM products AS p, products_suppliers AS ps, suppliers AS s WHERE p.productID = ps.productID
AND ps.supplierlD = s.supplierlD
AND s.name = 'ABC Traders".....)
results. show()

Discussions

Question 10

Problem Scenario 87 : You have been given below three files
product.csv (Create this file in hdfs)
productID,productCode,name,quantity,price,supplierid
1001,PEN,Pen Red,5000,1.23,501
1002,PEN,Pen Blue,8000,1.25,501
1003,PEN,Pen Black,2000,1.25,501
1004,PEC,Pencil 2B,10000,0.48,502
1005,PEC,Pencil 2H,8000,0.49,502
1006,PEC,Pencil HB,0,9999.99,502
2001,PEC,Pencil 3B,500,0.52,501
2002,PEC,Pencil 4B,200,0.62,501
2003,PEC,Pencil 5B,100,0.73,501
2004,PEC,Pencil 6B,500,0.47,502
supplier.csv
supplierid,name,phone
501,ABC Traders,88881111
502,XYZ Company,88882222
503,QQ Corp,88883333
products_suppliers.csv
productID,supplierID
2001,501
2002,501
2003,501
2004,502
2001,503
Now accomplish all the queries given in solution.
Select product, its price , its supplier name where product price is less than 0.6 using SparkSQL

Answer:

hdfs dfs -mkdir sparksql2
hdfs dfs -put product.csv sparksq!2/
hdfs dfs -put supplier.csv sparksql2/
hdfs dfs -put products_suppliers.csv sparksql2/
Now in spark shell
// this Is used to Implicitly convert an RDD to a DataFrame.
import sqlContext.impIicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// load the data into a new RDD
val products = sc.textFile("sparksql2/product.csv")
val supplier = sc.textFileC'sparksq^supplier.csv")
val prdsup = sc.textFile("sparksql2/products_suppliers.csv"}
// Return the first element in this RDD
products.fi
rst()
supplier.first{).
prdsup.first()
//define the schema using a case class
Float,
lnteger)
String)
Integer)
// create an RDD of Product objects
val
prdRDD
=
products.map(_.split('\")).map(p
=>
Product(p(0).tolnt,p(1),p(2),p(3).tolnt,p(4).toFloat,p(5).toint))
val supRDD = supplier.map(_.split(",")).map(p => Suplier(p(0).tolnt,p(1),p(2)))
val prdsupRDD = prdsup.map(_.split(",")).map(p => PRDSUP(p(0).tolnt,p(1}.tolnt}}
prdRDD.first()
prdRDD.count()
supRDD.first() supRDD.count()
prdsupRDD.first() prdsupRDD.count(}
// change RDD of Product objects to a DataFrame
val prdDF = prdRDD.toDF()
val supDF = supRDD.toDF()
val prdsupDF = prdsupRDD.toDF()
// register the DataFrame as a temp table prdDF.registerTempTablef'products")
supDF.registerTempTablef'suppliers")
prdsupDF.registerTempTablef'productssuppliers"}
//Select product, its price , its supplier name where product price is less than 0.6
val results = sqlContext.sql(......SELECT products.name, price, suppliers.name as sup_name FROM
products JOIN suppliers ON products.supplierlD= suppliers.supplierlD WHERE price < 0.6......]
results. show()

Discussions
To page 2