You want to Ingest log files Into HDFS, which tool would you use?
Which of the following tool was designed to import data from a relational database into HDFS?
Which HDFS command copies an HDFS file named foo to the local filesystem as localFoo?
Which HDFS command displays the contents of the file x in the user's HDFS home directory?
What is a SequenceFile?
A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable
B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable
C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects,
in sorted order.
D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must
be the same type. Each value must be same type.
SequenceFile is a flat file consisting of binary key/value pairs.
Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in 'blocks' separately and
compressed. The size of the 'block' is configurable.
Your clusters HDFS block size in 64MB. You have directory containing 100 plain text files, each of
which is 100MB in size. The InputFormat for your job is TextInputFormat. Determine how many
Mappers will run?
Each file would be split into two as the block size (64 MB) is less than the file size (100 MB), so 200
mappers would be running.
If you're not compressing the files then hadoop will process your large files (say 10G), with a number
of mappers related to the block size of the file.
Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~=
10G). Depending on how CPU intensive your mapper logic is, this might be an
acceptable blocks size, but if you find that your mappers are executing in sub minute times, then you
might want to increase the work done by each mapper (by increasing the block size to 128, 256,
512m - the actual size depends on how you intend to process the data).
(first answer, second paragraph)
You want to run Hadoop jobs on your development workstation for testing before you submit them
to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a
production cluster while using a single machine?
You want to perform analysis on a large collection of images. You want to store this data in HDFS and
process it with MapReduce but you also want to give your data analysts and data scientists the ability
to process the data directly from HDFS with an interpreted high-level programming language like
Python. Which format should you use to store this data in HDFS?
Reference: Hadoop binary files processing introduced by image duplicates finder
When can a reduce class also serve as a combiner without affecting the output of a MapReduce
A. When the types of the reduce operations input key and input value match the types of the
reducers output key and output value and when the reduce operation is both communicative and
B. When the signature of the reduce method matches the signature of the combine method.
C. Always. Code can be reused in Java since it is a polymorphic object-oriented programming
D. Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to
E. Never. Combiners and reducers must be implemented separately because they serve different
You can use your reducer code as a combiner if the operation performed is commutative and
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are
combiners? When should I use a combiner in my MapReduce Job?
Which best describes what the map method accepts and emits?
A. It accepts a single key-value pair as input and emits a single key and list of corresponding values as
B. It accepts a single key-value pairs as input and can emit only one key-value pair as output.
C. It accepts a list key-value pairs as input and can emit only one key-value pair as output.
D. It accepts a single key-value pairs as input and can emit any number of key-value pair as output,
public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks which transform input records into a intermediate records. The
transformed intermediate records need not be of the same type as the input records. A given input
pair may map to zero or many output pairs.