Data Loading, Storage, Handling and All That

Introduction to Data Science S2 and Kotlin Data Loading, Storage, Handling and All That

We need data before we can do any data analysis. Depending on what the data source is, there are different ways to read in the data.

Reading Google Sheets

Suppose we have a gsheet in our Google drive, such as this one.

You can access this gsheet using this URL:

https://docs.google.com/spreadsheets/d/1jRFfcjrk-qRATGwhs9K2x8tSF7O2kHMfK5_kNxrUdyc

Note that the seemingly gibberish string at the end of the URL, namely “1jRFfcjrk-qRATGwhs9K2x8tSF7O2kHMfK5_kNxrUdyc”, is in fact the ID of the gsheet. We will use that later.

First of all, before we can access this gsheet, we need to make it public so that anyone, such as your code, can access it. We do this by changing the “Share” setting.

Then we make it public by sharing it with “Anyone with the link”.

The Kotlin code “DataFrame.readCSV” in the package “krangl” takes as input the URL of the gsheet and create a data frame for it. The following code reads this gsheet.

				
					%use krangl

val gsheetID = "1a7H90lq7OpFjKXlLRHYSHYxgViykyVsXWpw229IVjh8" // the gsheet ID
val df = DataFrame.readCSV("https://docs.google.com/spreadsheets/d/${gsheetID}/gviz/tq?tqx=out:csv")

The output is:

Data frame is not just a storage data structure. It comes with a number of utility functions to process the data. For example, we can use “filterByRow” to filter (in) those rows that have source = GISTEMP.

				
					val df1 = df.filterByRow{ it["Source"] == "GISTEMP" }
df1

The output is:

We can also sort the data frame. The following code sorts the data by “Date”.

Reading a CSV File through GitHub

Enter your data in Microsoft Excel and save the file as a .csv extension. For instance, consider the following file.

In order to upload the CSV file on GitHub, one must have a public repository in their GitHub account. The repository can be a new one or an already created one. Enter the repository, click on Add File, and then Upload Files. Below is the image of an existing repository.

Now, drag and drop the CSV file and click on Commit Changes as shown below. Your file will be uploaded!

Once the file is uploaded, copy the URL displayed on top. Now, open S2/Kotlin IDE and type the following code.

				
					%use s2

val df = DataFrame.readCSV("https://github.com/nmltd/s2-data-sets/blob/main/multiple-regression-dataset1.csv")
println(df)

On running the above code, you will get your data set uploaded and you are good to go ahead!

Previous Topic

Back to Lesson

Next Topic