Friday 22 August 2014

BIG DATA...What's The Big Deal?

Big Data...the most renowned term in the IT industry these days. What is it? Whats in it for a software professional? Have a glance here!

The Definition

Gartner defines this word as "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization". Confused? Well, lets understand it.

High volumes of data

Big data is a science of information processing when the data is in huge volumes like some peta or exa bytes. (1 petabyte = 1024 terabytes and 1 exabyte = 1024 petabytes). In other words, one exabyte is equal to a 100 crore GB of data. As an individual we may know only about terabytes (TB) of data. At an enterprise level data sizes vary from petabytes to exabytes. Wondering which enterprise generates this much data? Think about facebook. Millions of people keep posting into their timelines at any point of time at a volume of single comment to an album upload of their personal event. Combining all this data will give a tons of load to facebook servers. Why should facbook worry? Keep the huge size clusters available to store all this data. No, its not just about storing data. Facebook has to analyze this data, find the trends like what people are talking about, what is the most discussed news/ event, etc. It is required to take the business decisions at organization level based on such analysis. If you think about the remote sensing data, genome code etc, the situation is still more complex.

DBMS/ RDMBS systems

Since the existing data solutions can process the structured/ organised information, there is a need of new techniques to process unstructured data. Also process of data in DBMS/ RDBMS happens only when the complete set of data is available. In this competitive world, business decisions to be taken so quickly which requires the processing of data while it is flowing in. 

Big Data

Hence big data works on huge sets of structured/ unstructured data, while it is flowing in, if required and process it and provides the analysis and/ or reports as required.  Different phases of data processing in Big Data includes capture (collecting the data), curation (extract the important data and preserve it), storage (save data on to the disk/ cluster), search (finding the required patterns), sharing and analysis (generate the data trends).

Some big facts ...

  • 90% of data in the world is unstructured
  • Human genome code took 10 years to process. Using Big data, it can be processed in one week
  • Data production in 2009 would increase to 4400% by 2020
  • In a day,
    1. The largest telescope that is expected to be operational by 2024 would gather 14 exabytes of data
    2. Surveillance cameras around the globe collects 413 petabytes of video data
    3. 46 petabytes of data flows through AT&T's networks
    4. Google processes 24 petabytes
    5. Walmart generates 2.5 petabytes of data
    6. Facebook processes 0.5 petabytes
Technical Frameworks

There are many big data frameworks available and Hadoop is the one and most famous framework. It is an open source framework and widely used in, across different products and enterprise editions of different companies. Other frameworks include Spark, Storm, HPCC etc.

Programming Languages used in Big Data

One of the widely used language for data processing is R. NoSQL and Python are also playing a key role as data processing languages. 

Enterprise solutions

Most of the global IT giants utilized this opportunity and converted the Big Data science into solutions to fulfill the needs of different organizations based on their data analysis model. Few of such solutions are:

  1. Google BigQuery
  2. SAP Hana
  3. IBM Infosphere 
  4. Microsoft Azure
  5. Oracle Big Data Appliance
  6. EMC Alpine, Pivotal
  7. HP Vertica, HAVEn
and many more!


Job Opportunities

Being an emerging technology, big data has lot of scope to create lot of job opportunities in IT space. The roles include Data Engineer, Data Scientist, Data Analytics Engineer, Data Architect, Data Warehouse Analyst, Business Intelligence Analyst, Data Modeler etc.

Certifications

Here is the list of top 10 certifications from the best institutions:

  1. Certified Analytics Professional - The Institute for Operations Research and the Management Sciences (INFORMS)
  2. Certification of Professional Achievement in Data Sciences - Columbia University
  3. Certificate in Engineering Excellence Big Data Anlytics and Optimization - International School of Engineering (INSOFE)
  4. Mining Massive Data Sets Graduate Certificate - Stanford University
  5. Certificate in Analytics: Optimizing Big Data - University of Delaware
  6. EMC Data Scientist Associate (EMCDSA) - EMC
  7. Cloudera Certified Professional : Data Scientist - Cloudera Inc
  8. Cloudera Certified Developer for Apache Hadoop - Cloudera Inc
  9. Cloudera Certified Administrator for Apache Hadoop
  10. Revolution R Enterprise Professional - Revolution Analytics

Other Certifications

1. HP Vertica Certification
http://www.vertica.com/customer-experience/certification/client-certification

2. Oracle Business Intelligence Foundation Suite 11g Certified Implementation Specialist
http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=654&get_params=p_id:166

3. SAS Certified Predictive Modeler using SAS Enterprise Miner 7 Credential
http://support.sas.com/certify/creds/pm.html

For more info ...

Big Data Wiki
https://en.wikipedia.org/wiki/Big_data

R Data Processing Language
www.r-project.org

Apache Hadoop
http://hadoop.apache.org




1 comment: