Data Structures and Algorithms
Data
- Data today comes in all types of formats, both structured and unstructured (e.g., text documents, email, video, audio, and twitts).
- Big data is a popular term used to describe the exponential growth and availability of data. Big data is important to business (e.g., the recommender systems) and society (big data from medical and biological sciences is helping scientists find cures to many diseases), as more data may lead to more accurate analyses.
In 2001, industry analyst Doug Laney articulated the now mainstream definition of big data as the 3Vs: volume, velocity and variety (referring to the amount of data, the speed of data processing, and the number of types of data, respectively):
- Volume. Many factors contribute to the increase in data volume. Think about all the social media.
- Velocity. Data streams in at unprecedented speed and needs to be dealt with in a timely manner.
- Variety. Data can be structured, or unstructured, involving many different types of data.
(Image from: http://whatis.techtarget.com/definition/3Vs)
- Data Science
- Opportunities and challenges
- Most powerful Big Data companies: IBM, HP, Teradata, Oracle, SAP, EMC, Amazon, Microsoft, Google, VMware, Cloudera, Hortonworks, Splunk, 10Gen, and MapR(slides)
- Google BigQuery is a web service that lets users do interactive analysis of massive datasets--up to billions of rows, on demand.
The Need for Data Structures
- What's data structure? A data structure is any data representation and its associated operations. Even an integer or floating point number on the computer can be viewed as a simple data structure.
- Abstract data types & data structures
- A type is a collection of values. E.g., the Boolean type consists of the values true and false.
- A data type is a type together with a collection of operations to manipulate the type.
- An abstract data type (ADT) is the realization of a data type as a software component.
An ADT does NOT specify how the data type is implemented. These implementation details are hidden from the user of the ADT and protected from outside access, a concept referred to as encapsulation. - A data structure is the implementation for an ADT. In Java, an ADT and its implementation together make up a class. Each operation associated with the ADT is implemented by a member function or method.
- Example: An ADT for a list of integers might specify the following operations (Example 1.4):
- Insert a new integer at a particular position in the list.
- Return true if the list is empty.
- Reinitialize the list.
- Return the number of integers currently in the list.
- Delete the integer at a particular in the list.
- A type is a collection of values. E.g., the Boolean type consists of the values true and false.
- How to select a data structure?
- Determine the basic operations that must be supported (e.g., inserting/deleting/finding a data item).
- Quantify the resource constraints (total space available to store the data and the time allowed to perform each subtask).
- Select the data structure that best meets these requirements.
- Where is the data? Main memory, on a single disk, or distributed among computers? They may also arrive as one or more continuous data streams. Algorithms for processing the data may vary, depending on where the data lives. For example, external sorting algorithms may be required for sorting data too large to fit in main memory. There are also streaming algorithms that produce approximate answers based on a summary or "sketch" of the data stream in memory.
Algorithms
- Problem, algorithm, and program A computational problem is a funtion or a mapping of inputs to outputs.
Example 1: Input: A list of names of people Output: The same list sorted alphabetically Example 2: Input: A list of items that a person purchased online Output: A recommendation (the next item the person is likely to purchase) Example 3: Input: Two strings Output: The minimum number of operations (insertion, deletion, and replacement) needed to transform one string into the other one.An algorithm is an exact specification of how to solve a computational problem. The steps for sovling the problem must be concrete and unambiguous. There can be many different algorithms for each computational problem.
A program is an instantiation of an algorithm in a programming language.
- Space & running time: Big O notation; trade-off between space and time
- Efficiency and accuracy