Key Big Data Terms You Should Know

Given below is a listing of key Big Data terms that you should know and a very brief explanation of what it is in simple language. Hope you find it useful.

1. Hadoop: System for processing very large data sets
2. HDFS or Hadoop Distributed File System: For storage of large volume of data (key elements – Datanodes, Namenode and Tasktracker)
3. MapReduce: Think of it as Assembly level language for distributed computing. Used for computation in Hadoop
4. Pig: Developed by Yahoo. It is a higher level language than MapReduce
5. Hive: Higher level language developed by Facebook with SQL like syntax
6. Apache HBase: For real-time access to Hadoop data
7. Accumulo: Improved HBase with new features like cell level security
8. AVRO: New data serialization format (protocol buffers etc.)
9. Apache ZooKeeper: Distributed co-ordination system
10. HCatalog: For combining meta store of Hive and merging with what Pig does
11. Oozie: Scheduling system developed by Yahoo
12. Flume: Log aggregation system
13. Whirr: For automating hadoop cluster processing
14. Sqoop: For transfering structured data to Hadoop
15. Mahout: Machine learning on top of MapReduce
16: Bigtop: Integrate multiple Hadoop  sub-systems into one that works as a whole
17. Crunch:  Runs on top of MapReduce, Java API for tedious tasks like joining and data aggregation.
18. Giraph: Used for large scale distributed graph processing

Also, embedded below is an excellent TechTalk by Jakob Homan of LinkedIn on the subject explaining these tech terms.