What is big data?
“Every day, we create 2.5 quintillion (10^18) bytes of data —somuch that 90% of the data in the world today has been createdin the last two years alone. This data comes from everywhere:sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is “big data.”
Types of Big Data:
KEY ENABLERS for Big-Data?+ Increase in storage capabilities+ Increase in processing power+ Availability of data+ Cheaper Hardware+ Better Value-for-Money for Businesses
XXX XXXX : Volume XXXXXXXXXXX are acquiring XXXX large XXXXXX XX datathrough variety of sources XXXX XXXXXXXX XX use: -Sentiment XXXXXXXX – Twitter XXXX-Terabytes of XXXXXX XXX XXXXXXX each day which XXX XXXXXX XXX XXXXXXXX product XXXXXXXXX XXXXXXXX -Predict power XXXXXXXXXXX-Convert XXXXXXXX XX annual XXXXX XXXXXXXX XXXX XXXXXXXXXXXXX power consumption XXX every XXXX / minute.
XXX data : Velocity For time-sensitive XXXXXXXXX XXXX as catching XXXXX,XXXXXXXXXX accidents, giving life saving medication XXX.XXX data XXXX be used as it streams into an enterprise inorder XX maximize XXX XXXXX. XXXX examples XX use:- XXXXXXXXXX millions XX credit card XXXXXXXXXXXX XXXX dayto identify XXXXXXXXX fraud- Analyze XXXXXXXX XX daily call detail XXXXXXX in realtimeto predict customer churn XXXXXX- XX XXX, analyze blood XXXXXXXXX / ECG XXXXXXXX in realtime XX deliver life saving medication
Big XXXX : XXXXXXX XXX data XXX XX XX XXX type - XXXXXXXXXX XXXXXXXXXXXXXXX XXXX such as text, sensor XXXX, XXXXX,video, click XXXXXXX, XXX XXXXX and more. XXX insights XXXXXXXX when XXXXXXXXX these XXXX types XXXXXXXX.
XXXX examples XX use:-Monitor XXXX XXXXX feeds XXXX surveillance XXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXX- Utilize image, XXXXX, XXXXX XXX web XXXXXXXXXXX abouta customer to give XXXXXX XXXXXXX XXXXX training,safety XXXX XXX XXXXXXXXXXXXXXX.
XXX data : XXXXXXXX Accuracy is a XXX concern in XXX XXXX. There XX no easyway to segregate good data from XXX. Some XXXXXXXX:- XXXXX thousands XX reviews XX hotels XXXXX XXXX XXXXXXXXXXXX XXX XXXXX ones are XXX?
- How to XXXX out the XXXXX from thousands XX XXXXXXXXXXXXXX- XXX to identify a rumor XXXX a XXXXXXXXXXXXXXXXXXXXX?
XXXXXXXXX of Big Data – XXXXXXXXXXX ofAlgorithms for Statistical XXXXXXXXXXXX Some very useful tools of statistical XXXXXXXX XXX presently XXXXXXXXXXXXXX XXXX XXXXXXXX XXXXXXXXXX XXX e.g. XX median is to XX computed using the XXXXXX bubblesort XXXXXXXXX it would take a very XXXXX XXXXXX XX XXXX -algorithmic complexity = X(N^X). Statistical methods XXXXX uses XXXXXX or other XXXXXXXXX asits XXXX will XXX face XXXX XXXXXXXXX
Challenge of XXX XXXX – Non-XXXXXXXXXXXXX- XXXXX XXX data volume XX large it is XXX collected for aspecific XXXXXXX -No random sampling schemes are present -Inferential XXXXXX of statistics (Frequentist / Bayesian)XXXXXXXXX XXXXXX a XXXXXX sample XXXXX XXXX a population -May give inaccurate results when used with XXX-randomsamples -XXXX XX for statistical methods XXXXX XXX XXXXXX XXX-randomsamples.
XXXXXXXXX XX XXX Data – XXXXXXX Data Is XXXXX a single XXXXXXXXXX or multiple populations in XXX XXXXXXX?- XXXX XXXXXXXXXXX methods XXX devised for XXXXXXX inference XXX XXXXXXX population- XX XXX XXXX is a mixture of XXXXXXXXXXXX from multiple XXXXXXXXXXXXX need XX “XXXXXXXX” XXX XXXXXX XX XXXXXXXXXXX XX XXXX havespecific XXXXXXXX XXXXXXXXXX. -XXXXXXXXXX XXXX as Flexible XXXXXXXXXX XXX XXXXXXXXXXXXXX,Machine learning algorithms like XXXX/ XXXXX attempts XXXXXX -XXXX such XXXXXXX XXXXXX
Challenge XX Big XXXX – Real Time- XXXXXXXXX data - focus XX on XXXXX -XXXX time problems such as XXXXX XXXXXXXXX XXXX XXXXX analysis. -Data XXXXXXX are analyzed “in XXXXXX” in “time windows” beforebeing written XX XXXX. -Analysis XX XXXX XXXXXX on XXXX XX XXXX XXXXXXXXX XXX hence notuseful XXX XXXXX kind XX applications -XXXXXXXXXXX methods XXXXXXXXX XXXX XXXX whole data XXX aims togive the “best XXXXXXXX”It XXX not XX XXXXXXXX to have “best XXXXXXXX” XX XXXXXXXXX only apart of XXX data. --Trade-XXX between XXXXX XXX XXXXXXXX.
XXXXXX Flu Trend In XXXX, Google reported that by XXXXXXXXX flu-related searchqueries it XXX been able to XXXXXX the spread XX XXX flu XXXXXXXXXXXX and XXXX quickly than CDCP, USA In XXX XXXX, Nature XXXXXXXX XXXX Google flu-XXXXXX weren’XXXXXXXX XXX predicted XXXX than double XXX proportion XXXXXXXX XXXXXX for influenza-like XXXXXXXXX XXXX CDCP. XXXX XXXXXXX XXX XXXXXXXX XXXXXXXXX XXXXXX in searchbehaviour, XXXXXXXXX XX alternative sources of XXXXXXXXXXXXXX.
Challenge XX Big XXXX – Variety of XXXX- Big XXXX consists of different XXXXX of data -XXXX more XXXXXXX becoming internet XXXXXXX the XXXXXXX XXXXXX XX large- Images, XXXXX, XXXX, Social XXXXX -Twitter, XXXXXXXX,XXXXXX XXX. XXX XXXXXXXXXX to XXX XXXXXXXXXXX is to arrive XX “gooddecisions” using all XXX XXXXXXXXXXX XXXX all XXX XXXXXXX ---Symbolic XXXX XXXXXXXX (SDA) attempts to XXXXXX thiscomplex problem -Present XXX methods are largely descriptive. XXX XXXXXXX XX XXXXXXXXXXX association XXX XXXXXXXXXXX forthese XXXX XX data XXX required.
Statistics XX XXXXXXXXX?
Challenge XX Big XXXX – Data XXXXXXX -Data Quality is a XXX concern -Apart from XXXX, the XXXX can be XXXXXXXXXX -Over XXXX XXXX definitions XX well as method of XXXXXXXXXXXXX XXXXXX- For e.g. in credit XXXXXXX prediction studies XXX XXXXXXXXXXXXXX XXXX from an institution XXX change XXXX XXXXXXXXXXXXXXXXXXXX XXX XXXX’s XXXXXXXXX in XXXXX of XXXXXX loan.
-It may be interesting XX know XXXX XX what extent a XXXXXXXXXXXXXXX a loan is XXXXXXXX -XXXXXX that XXX XXXXXXXX XXX have taken loans from XXXXXXXXXXXXXX XXXXXXXXXXX at XXXXXXXXX XXXXX -The XXXXX XX XXXX XXXXXXX XXX XXX XXXX XXXXXX may XX XXXXX Bose, XXXX XXXX XXX Ajoy XX XXXX. XXX XXXX XXX knowthese are same person?- Statistical XXXXXXXX XXX be XXXXXXX in XXXXXXXX XXX buildingunique customer profiles
XXXXXXXXX XX XXX XXXX – ProtectingPrivacy XXX XXXXXXXXXXXXXXX XXX retail XXXXX XXXXXX in US could figure out that a teenagerwas pregnant even before her knew (Source: Forbes, Feb2012)A British XXXX XX able to XXXXXXXX XXXXXXXXX money laundererswho XXX XX XXXXXX to XXXXXXXXX. (XXXXXX :Superfreakonomics, Levitt & Dubner) -Is it XXXXXXX XXX Target to guess XXX private XXXXXXXXXXX XX anindividual XXXXXXXX? -XXXX XX the XXXXXXXXX identifies a wrong person as a XXXXXXXXXXXXXX launderer? XXXX happens to his/ her XXXXXXXXXX?
XXXXX p, Small N XX genomic studies, XXX number XX XXXXXXXXX(p) XX XXXX largeand XXXXX XXXXXXX XXX number of samples (N) by a XXXXXXXXX. -We XXXX XX XXXXXX the dimension XX XXX XXXXXXX to XX able XX draw XXXXXXXXXX conclusions- XXXXXXXXXXXX XXXXXXXX XXXXX XXXXXX heavily on informationderived XXXX such XXXX.
Summary Big XXXX poses XXX challenges XX XXXXXXXXXXXXX both in terms XX theory and applications Some XX the XXXXXXXXXX XXXXXXX- Scalability of XXXXXXXXXXX XXXXXXXXXXX XXXXXXX- Non-XXXXXX XXXX- Mixture data- Real Time XXXXXXXX XX Streaming XXXX- XXXXXXXXXXX Analysis XXXX XXXXXXXX kinds XX data- XXXX Quality- XXXXXXXXXX Privacy and XXXXXXXXXXXXXXX- XXXX Dimensional XXXX