Pandas and the dataframe

Data is often labeled, and often data scientists work with multiple streams of data at the same time.

Pandas is a library that handles series of labeled - tabular - data. It often the core library used to import data from common formats such as .csv, .tsv, and proprietary formats such as from Microsoft Excel to enable programatic processing. The core datatype that Pandas provides for this is the DataFrame.

For this and following slides, we will be using download speed data provided by Netflix back in 2015. For convenience, the dataset is provided in this repository as data/us-graph-2015-02.csv

"The Netflix ISP Speed Index." 27 06 2018. Netflix. csv. 27 06 2018 <https://ispspeedindex.netflix.com/country/us>

A cursory glance at the raw data

Provider      ,Nov-12,Dec-12,Jan-13,Feb-13,Mar-13,Apr-13,May-13,Jun-13,Jul-13,Aug-13,Sep-13,Oct-13,Nov-13,Dec-13,Jan-14,Feb-14,Mar-14,Apr-14,May-14,Jun-14,Jul-14,Aug-14,Sep-14,Oct-14,Nov-14,Dec-14,Jan-15,Feb-15
Comcast       ,2.17  ,2.1   ,2.01  ,2.06  ,2.09  ,2.11  ,2.13  ,2.1   ,2.09  ,2.04  ,2.11  ,2.07  ,1.82  ,1.63  ,1.51  ,1.68  ,2.5   ,2.77  ,2.72  ,2.61  ,2.82  ,2.9   ,2.92  ,3.05  ,3.11  ,3.24  ,3.28  ,3.36
Cox           ,2.07  ,2     ,1.96  ,2.12  ,2.25  ,2.31  ,2.36  ,2.39  ,2.44  ,2.47  ,2.5   ,2.56  ,2.56  ,2.62  ,2.69  ,2.81  ,2.84  ,2.9   ,2.94  ,2.99  ,3     ,3.03  ,3.04  ,3.09  ,3.11  ,3.21  ,3.32  ,3.38
AT&T - U-verse,1.94  ,1.92  ,1.83  ,1.91  ,1.97  ,2     ,1.99  ,1.91  ,1.9   ,1.92  ,1.97  ,1.87  ,1.75  ,1.75  ,1.59  ,1.66  ,1.73  ,1.76  ,1.7   ,1.5   ,1.44  ,2.61  ,2.77  ,2.86  ,2.94  ,3.02  ,3.03  ,3.11
Verizon - FiOS,2.19  ,2.1   ,2.04  ,2.1   ,2.15  ,2.17  ,2.17  ,2.15  ,2.15  ,2.14  ,2.2   ,2.22  ,2.2   ,2.11  ,1.82  ,1.76  ,1.91  ,1.99  ,1.9   ,1.58  ,1.61  ,2.41  ,3.17  ,3.24  ,3.27  ,3.36  ,3.43  ,3.53
AT&T - DSL    ,1.42  ,1.41  ,1.37  ,1.43  ,1.46  ,1.48  ,1.47  ,1.4   ,1.38  ,1.39  ,1.41  ,1.25  ,1.2   ,1.2   ,1.07  ,1.12  ,1.17  ,1.29  ,1.26  ,1.13  ,1.11  ,1.81  ,1.91  ,1.96  ,2.01  ,2.07  ,2.13  ,2.2
  • This data is table of numerical values, but its intended for a machine to consume.
  • This data isn't exactly human-readable.
  • A visual inspection confirms this appears to be a comma-seperated-values file, which pandas has a method for parsing

loading the data into pandas

{{  # include ../../../uci_bootcamp_2021/examples/pandas_example.py:6:20 }}

The DF at this state looks like

                2012-11-01  2012-12-01  ...  2015-01-01  2015-02-01
Provider                                ...                        
Comcast               2.17        2.10  ...        3.28        3.36
Cox                   2.07        2.00  ...        3.32        3.38
AT&T - U-verse        1.94        1.92  ...        3.03        3.11
Verizon - FiOS        2.19        2.10  ...        3.43        3.53
AT&T - DSL            1.42        1.41  ...        2.13        2.20
[5 rows x 28 columns]

For the full analysis, please see notebooks/Netflix.ipynb

Further reading

This presentation is how I typically use pandas in a nutshell. For further reading please consult Pandas's user guide: