Pandas and the dataframe
Data is often labeled, and often data scientists work with multiple streams of data at the same time.
Pandas is a library that handles series of labeled - tabular - data. It often the core library used to
import data from common formats such as .csv, .tsv, and proprietary formats such as from Microsoft
Excel to enable programatic processing. The core datatype that Pandas provides for this is
the DataFrame.
For this and following slides, we will be using download speed data provided by Netflix back in 2015.
For convenience, the dataset is provided in this repository as data/us-graph-2015-02.csv
"The Netflix ISP Speed Index." 27 06 2018. Netflix. csv. 27 06 2018 <https://ispspeedindex.netflix.com/country/us>
A cursory glance at the raw data
Provider ,Nov-12,Dec-12,Jan-13,Feb-13,Mar-13,Apr-13,May-13,Jun-13,Jul-13,Aug-13,Sep-13,Oct-13,Nov-13,Dec-13,Jan-14,Feb-14,Mar-14,Apr-14,May-14,Jun-14,Jul-14,Aug-14,Sep-14,Oct-14,Nov-14,Dec-14,Jan-15,Feb-15
Comcast ,2.17 ,2.1 ,2.01 ,2.06 ,2.09 ,2.11 ,2.13 ,2.1 ,2.09 ,2.04 ,2.11 ,2.07 ,1.82 ,1.63 ,1.51 ,1.68 ,2.5 ,2.77 ,2.72 ,2.61 ,2.82 ,2.9 ,2.92 ,3.05 ,3.11 ,3.24 ,3.28 ,3.36
Cox ,2.07 ,2 ,1.96 ,2.12 ,2.25 ,2.31 ,2.36 ,2.39 ,2.44 ,2.47 ,2.5 ,2.56 ,2.56 ,2.62 ,2.69 ,2.81 ,2.84 ,2.9 ,2.94 ,2.99 ,3 ,3.03 ,3.04 ,3.09 ,3.11 ,3.21 ,3.32 ,3.38
AT&T - U-verse,1.94 ,1.92 ,1.83 ,1.91 ,1.97 ,2 ,1.99 ,1.91 ,1.9 ,1.92 ,1.97 ,1.87 ,1.75 ,1.75 ,1.59 ,1.66 ,1.73 ,1.76 ,1.7 ,1.5 ,1.44 ,2.61 ,2.77 ,2.86 ,2.94 ,3.02 ,3.03 ,3.11
Verizon - FiOS,2.19 ,2.1 ,2.04 ,2.1 ,2.15 ,2.17 ,2.17 ,2.15 ,2.15 ,2.14 ,2.2 ,2.22 ,2.2 ,2.11 ,1.82 ,1.76 ,1.91 ,1.99 ,1.9 ,1.58 ,1.61 ,2.41 ,3.17 ,3.24 ,3.27 ,3.36 ,3.43 ,3.53
AT&T - DSL ,1.42 ,1.41 ,1.37 ,1.43 ,1.46 ,1.48 ,1.47 ,1.4 ,1.38 ,1.39 ,1.41 ,1.25 ,1.2 ,1.2 ,1.07 ,1.12 ,1.17 ,1.29 ,1.26 ,1.13 ,1.11 ,1.81 ,1.91 ,1.96 ,2.01 ,2.07 ,2.13 ,2.2
- This data is table of numerical values, but its intended for a machine to consume.
- This data isn't exactly human-readable.
- A visual inspection confirms this appears to be a comma-seperated-values file, which pandas has a method for parsing
loading the data into pandas
{{ # include ../../../uci_bootcamp_2021/examples/pandas_example.py:6:20 }}
The DF at this state looks like
2012-11-01 2012-12-01 ... 2015-01-01 2015-02-01
Provider ...
Comcast 2.17 2.10 ... 3.28 3.36
Cox 2.07 2.00 ... 3.32 3.38
AT&T - U-verse 1.94 1.92 ... 3.03 3.11
Verizon - FiOS 2.19 2.10 ... 3.43 3.53
AT&T - DSL 1.42 1.41 ... 2.13 2.20
[5 rows x 28 columns]
For the full analysis, please see
notebooks/Netflix.ipynb
Further reading
This presentation is how I typically use pandas in a nutshell. For further reading please consult Pandas's user guide: