What is Data Science | More Info | Notesale | Buy and Sell Study Notes Online | Extra Student Income | University Notes

Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

Buy These Notes

You have nothing in your shopping cart yet.

Title: What is Data Science
Description: If you are interested in the field of information security and penetration, you are now in the right place, where Soharkk collection of e-books that will help you balance the cognitive development of this area.

Buy These Notes Preview

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above

Make Data Work
strataconf
...

n

n

n

Learn business applications of
data technologies
Develop new skills through
trainings and in-depth tutorials
Connect with an international
community of thousands who
work with data

Job # 15420

What Is Data Science?

Mike Loukides

Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo

What Is Data Science?
by Mike Loukides
Published by O’Reilly Media, Inc
...

Revision History for the :
See http://oreilly
...
csp?isbn=9781449327552 for release details
...
1
The future belongs to the companies and people that turn data
into products
What is data science?
Where data comes from
Working with data at scale
Making data tell its story
Data scientists

1
1
4
7
10
11

iii

What is data science?

The future belongs to the companies and people that
turn data into products
We’ve all heard it: according to Hal Varian, statistics is the next sexy job
...
0, Tim O’Reilly said that “data is the next Intel
Inside
...

What is data science?
The web is full of “data-driven apps
...
There’s a database behind a web front end, and
middleware that talks to a number of other databases and data services (credit
card processing companies, banks, and so on)
...
” A data application acquires its value
from the data itself, and creates more data as a result
...
Data science enables the creation of data products
...
The
developers of CDDB realized that any CD had a unique signature, based on
the exact length (in samples) of each track on the CD
...
If you’ve ever used iTunes to rip a CD, you’ve taken
advantage of this database
...
If you have a
CD that’s not in the database (including a CD you’ve made yourself), you can
create an entry for an unknown album
...
Their business is fundamentally different from selling music, sharing music, or analyzing musical tastes (though these can also be “data products”)
...

Google is a master at creating data products
...
Google’s PageRank algorithm was among
the first to use data outside of the page itself, in particular, the number of
links pointing to a page
...

• Spell checking isn’t a terribly difficult problem, but by suggesting corrections to misspelled searches, and observing what the user clicks in response, Google made it much more accurate
...

• Speech recognition has always been a hard problem, and it remains difficult
...

• During the Swine Flu epidemic of 2009, Google was able to track the
progress of the epidemic by following searches for flu-related topics
...

2 | What is data science?

Google isn’t the only company that knows how to use data
...
Amazon
saves your searches, correlates what you search for with what other users
search for, and uses it to create surprisingly appropriate recommendations
...
They come about because Amazon understands that
a book isn’t just a book, a camera isn’t just a camera, and a customer isn’t just
a customer; customers generate a trail of “data exhaust” that can be mined
and put to use, and a camera is a cloud of data that can be correlated with the
customers’ behavior, the data they leave every time they visit the site
...
Whether that data is search terms, voice
samples, or product reviews, the users are in a feedback loop in which they
contribute to the products they use
...

In the last few years, there has been an explosion in the amount of data that’s
available
...
And it’s not just companies using their own data, or the data contributed by their users
...
“Data Mashups in R” analyzes mortgage foreclosures in Philadelphia County by taking a public report from the county sheriff’s office, extracting addresses and using Yahoo to convert the addresses to latitude and longitude, then using the geographical data to place the foreclosures on a map (another data source), and group them by neighborhood, valuation, neighborhood per-capita income, and other socio-economic factors
...

Using data effectively requires something different from traditional statistics,
where actuaries in business suits perform arcane but fairly well-defined kinds
of analysis
...
We’re increasingly finding data in the wild, and data
scientists are involved with gathering data, massaging it into a tractable form,
making it tell its story, and presenting that story to others
...

What is data science? | 3

Where data comes from
Data is everywhere: your government, your web server, your business partners,
even your body
...
At O’Reilly, we frequently
combine publishing industry data from Nielsen BookScan with our own sales
data, publicly available Amazon data, and even job data to see what’s happening in the publishing industry
...
Factual enlists users to update
and improve its datasets, which cover topics as diverse as endocrinologists to
hiking trails
...
It has a 5 MB capacity and
it’s stored in a cabinet roughly the size of a luxury refrigerator
...
5 gram
...
Disk drive on display at IBM Almaden Research

4 | What is data science?

Much of the data we currently work with is the direct consequence of Web
2
...
The web has people spending more
time online, and leaving a trail of data wherever they go
...
Point-of-sale
devices and frequent-shopper’s cards make it possible to capture all of your
retail transactions, not just the ones you make online
...
Since
the early ‘80s, processor speed has increased from 10 MHz to 3
...

But we’ve seen much bigger increases in storage capacity, on every level
...
Hitachi
made the first gigabyte disk drives in 1982, weighing in at roughly 250 pounds;
now terabyte drives are consumer equipment, and a 32 GB microSD card
weighs about half a gram
...

The importance of Moore’s law as applied to data isn’t just geek pyrotechnics
...
The more storage is available,
the more data you will find to put into it
...
Increased
storage capacity demands increased sophistication in the analysis and use of
that data
...

So, how do we make that data useful? The first step of any data analysis project
is “data conditioning,” or getting data into a state where it’s usable
...
But old-style screen scraping hasn’t died,
and isn’t going to die
...
They
aren’t well-behaved XML files with all the metadata nicely in place
...
This data was presented as an HTML
file that was probably generated automatically from a spreadsheet
...

Data conditioning can involve cleaning up messy HTML with tools like Beautiful Soup, natural language processing to parse plain text in English and other
languages, or even getting humans to do the dirty work
...
It would be nice if
Where data comes from | 5

there was a standard set of tools to do the job, but there isn’t
...
Scripting languages, such as Perl and Python, are essential
...
Data is frequently missing or incongruous
...
If data is incongruous, do you decide that something is wrong with badly behaved data (after
all, equipment fails), or that the incongruous data is telling its own story, which
may be more interesting? It’s reported that the discovery of ozone layer depletion was delayed because automated data collection tools discarded readings that were too low1
...
It’s usually impossible to get “better” data, and you have no
alternative but to work with the data at hand
...
Roger Magoulas, who runs the data analysis group
at O’Reilly, was recently searching a database for Apple job listings requiring
geolocation skills
...
To
do it well you need to understand the grammatical structure of a job posting;
you need to be able to parse the English
...
Try using Google Trends to figure out what’s happening
with the Cassandra database or the Python language, and you’ll get a sense of
the problem
...

Disambiguation is never an easy task, but tools like the Natural Language
Toolkit library can make it simpler
...
That’s where services like Amazon’s Mechanical
Turk come in
...
For example, if you’re looking at job listings, and want to know which
originated with Apple, you can have real people do the classification for
roughly $0
...
If you have already reduced the set to 10,000 postings with
the word “Apple,” paying humans $0
...

1
...
” Whether humans or software decided to ignore anomalous
data, it appears that data was ignored
...
Oil
companies, telecommunications companies, and other data-centric industries
have had huge datasets for a long time
...
” The most meaningful definition I’ve heard: “big data” is when the size
of the data itself becomes part of the problem
...
At some point, traditional techniques for working with data run out of steam
...
Information platforms are similar to traditional data warehouses,
but different
...
They
accept all data formats, including the most messy, and their schemas evolve
as the understanding of the data changes
...
Traditional relational database systems stop being effective at this scale
...
The need to
define a schema in advance conflicts with reality of multiple, unstructured data
sources, in which you may not know what’s important until after you’ve analyzed the data
...
While rock-solid consistency is crucial to many applications, it’s not really necessary for the kind of analysis we’re discussing here
...
Most data analysis is comparative: if you’re asking whether sales
to Northern Europe are increasing faster than sales to Southern Europe, you
aren’t concerned about the difference between 5
...
93 percent
...

These are frequently called NoSQL databases, or Non-Relational databases,
though neither term is very useful
...
Many of these databases are
the logical descendants of Google’s BigTable and Amazon’s Dynamo, and are
designed to be distributed across many nodes, to provide “eventual consis2
...
While
there are two dozen or so products available (almost all of them open source),
a few leaders have established themselves:
• Cassandra: Developed at Facebook, in production use at Twitter, Rackspace, Reddit, and other large sites
...
It has a very flexible data
model
...

• HBase: Part of the Apache Hadoop project, and modelled on Google’s
BigTable
...
Along with Hadoop,
commercial support is provided by Cloudera
...
Data is only useful
if you can do something with it, and enormous datasets present computational
problems
...
In the “map” stage, a programming task
is divided into a number of identical subtasks, which are then distributed
across many processors; the intermediate results are then combined by a single
reduce task
...
It’s easy to distribute a search
across thousands of processors, and then combine the results into a single set
of answers
...

The most popular open source implementation of MapReduce is the Hadoop
project
...
Many of the key Hadoop developers have found a home at Cloudera,
which provides commercial support
...
You
can allocate and de-allocate processors as needed, paying only for the time you
use them
...
It incorporates
HDFS, a distributed filesystem designed for the performance and reliability
requirements of huge datasets; the HBase database; Hive, which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig; and other components
...

8 | What is data science?

Hadoop has been instrumental in enabling “agile” data analysis
...
Traditional data
analysis has been hampered by extremely long turn-around times
...
But Hadoop (and
particularly Elastic MapReduce) make it easy to build clusters that can perform
computations on long datasets quickly
...
It’s
easer to consult with clients to figure out whether you’re asking the right
questions, and it’s possible to pursue intriguing possibilities that you’d otherwise have to drop for lack of time
...
Hadoop processes
data as it arrives, and delivers intermediate results in (near) real-time
...
These features only require soft real-time; reports on trending topics don’t
require millisecond accuracy
...
According to Hilary Mason (@hmason), data scientist at
bit
...

Machine learning is another essential tool for the data scientist
...
You don’t have to look at many modern web applications to see
classification, error detection, image matching (behind Google Goggles and
SnapTell) and even face detection -- an ill-advised mobile application lets you
take someone’s picture with a cell phone, and look up that person’s identity
using photos available online
...

There are many libraries available for machine learning: PyBrain in Python,
Elefant, Weka in Java, and Mahout (coupled to Hadoop)
...
For computer vision, the
OpenCV library is a de-facto standard
...
Machine learning
almost always requires a “training set,” or a significant body of known data
with which to develop and tune the application
...
Once you’ve collected your training data (perhaps a
large collection of public photos from Twitter), you can have humans classify
Working with data at scale | 9

them inexpensively -- possibly sorting them into categories, possibly drawing
circles around faces, cars, or whatever interests you
...
Even a relatively large job only costs a few hundred dollars
...
According to Mike Driscoll (@dataspora), statistics is the “grammar of data science
...
” We’ve all heard the joke that eating pickles causes death,
because everyone who dies has eaten pickles
...
More to the point, it’s easy to notice that
one advertisement for R in a Nutshell generated 2 percent more conversions
than another
...
Data science isn’t just about the existence
of data, or making guesses about what that data might mean; it’s about testing
hypotheses and making sure that the conclusions you’re drawing from the data
are valid
...
Statistics has
become a basic skill
...

While there are many commercial statistical packages, the open source R language -- and its comprehensive package library, CRAN -- is an essential tool
...
It has excellent graphics facilities; CRAN includes parsers for many kinds of data; and newer extensions extend R into
distributed computing
...

Making data tell its story
A picture may or may not be worth a thousand words, but a picture is certainly
worth a thousand numbers
...
To understand what the numbers mean,
the stories they are really telling, you need to generate a graph
...
But that’s not really
what concerns us here
...
According to Martin Wattenberg (@wattenberg, founder of Flowing Media), visualization is key to data conditioning: if you want to find out just how
bad your data is, try plotting it
...
Hilary Mason says that when she gets a new data set, she starts by

10 | What is data science?

making a dozen or more scatter plots, trying to get a sense of what might be
interesting
...

There are many packages for plotting and presenting data
...
At IBM’s Many Eyes, many
of the visualizations are full-fledged interactive applications
...
One of my favorites is this animation of the growth of Walmart over
time
...
Does it look like the spread of
cancer throughout a body? Or the spread of a flu virus through a population?
Making data tell its story isn’t just a matter of presenting results; it involves
making connections, then going back to other data sources to verify them
...
There was insufficient computing power, the
data was all locked up in proprietary sources, and the tools for working with
the data were insufficient
...

Data scientists
Data science requires skills ranging from traditional computer science to
mathematics to art
...
on any given day, a team member could author a multistage processing
pipeline in Python, design a hypothesis test, perform a regression analysis over
data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses
to other members of the organization3

Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors
...
They
have to think about the big picture, the big problem
...
“Information Platforms as Dataspaces,” by Jeff Hammerbacher (in Beautiful Data)

Data scientists | 11

as clean as you’d like
...
You need some creativity for when the story the data is telling isn’t what you think it’s telling
...

Patil described the process of creating the group recommendation feature at
LinkedIn
...
But the process worked quite differently: it started out with a
relatively small, simple program that looked at members’ profiles and made
recommendations accordingly
...
It then branched out incrementally
...
Then at books members had in their
libraries
...
It started small, and added value iteratively
...

Hiring trends for data science

It’s not easy to get a handle on jobs in data science
...
This
graph shows the increase in Cassandra jobs, and the companies listing Cassandra
positions, over time
...
CDDB is
a great example of data jiujitsu: identifying music by analyzing an audio stream
directly is a very difficult problem (though not unsolvable—see midomi, for
example)
...
Computing a signature based on
track lengths, and then looking up that signature in a database, is trivially
simple
...
Patil’s first flippant answer to
“what kind of person are you looking for when you hire a data scientist?” was
“someone you would start a company with
...
We don’t yet know
what those products are, but we do know that the winners will be the people,
and the companies, that find those products
...
Her job as scientist at bit
...
ly is generating, and find out how to build interesting products from it
...
In addition to being physicists, mathematicians, programmers, and artists, they’re entrepreneurs
...
They are inherently interdiscplinary
...
They can think outside the box to come up with new
ways to view the problem, or to work with very broadly defined problems:
“here’s a lot of data, what can you make from it?”
The future belongs to the companies who figure out how to collect and use
data successfully
...
They were the
vanguard, but newer companies like bit
...
Whether
it’s mining your personal biology, building maps from the shared experience
of millions of travellers, or studying the URLs that people pass to others, the
next generation of successful businesses will be built around data
...

Data is indeed the new Intel Inside

Buy These Notes Preview

Notesale: Turn your study into money

Already a Member? >

Search for notes by fellow students, in your own course and all over the country.

My Basket

Document Preview