Beginning Data Science with Jupyter Notebook and Kotlin
This tutorial introduces the concepts of Data Science, using Jupyter Notebook and Kotlin. You’ll learn how to set up a Jupyter notebook, load krangl for Kotlin and use it in data science utilizing a built-in sample data. By Joey deVilla.
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Contents
Beginning Data Science with Jupyter Notebook and Kotlin
35 mins
- Getting Started
- Working with krangl’s Built-In Data Frames
- Getting a Data Frame’s First and Last Rows
- Extracting a slice() from the Data Frame
- Learning that slice() Indexes Start at 1, not 0
- Exploring the Data
- Sorting Data
- Filtering
- Learning the Basics of Filtering
- Filtering With Negation
- Filtering With Multiple Criteria
- Fancier Text Filtering
- Removing Columns
- Performing Complex Data Operations
- Calculating Column Statistics
- Grouping
- Summarizing
- Importing Data
- Reading .csv Data
- Reading .tsv Data
- Where to Go From Here?
Filtering With Negation
Imagine you wanted a data frame containing the animals that aren’t herbivores? You may have noticed that there isn’t a notEqualTo
or ne
in the list of comparison functions.
krangl’s approach for cases like this is to extend BooleanArray
to give it a not()
method. When called, this method negates every element in the array: true
values become false
, and false
values become true
.
Enter the following in a new code cell to see how you can use not()
to create a data frame of all the non-herbivores:
val nonHerbivores = sleepData.filter { (it["vore"] eq "herbi").not() } nonHerbivores
This works because it["vore"] eq "herbi"
creates a BooleanArray
specifying which rows contain herbivores. Calling not()
on that array inverts its contents, making it a BooleanArray
specifying which rows do not contain herbivores.
The output shows some carnivores, insectivores, omnivores, and one animal for which the vore value was not provided, the vesper mouse.
Filtering With Multiple Criteria
In addition to not()
, krangl extends BooleanArray
with the AND
and OR
infix functions. Using AND
, you could have created and displayed a dataset of heavy herbivores by running this code instead:
val alsoHeavyHerbivores = sleepData .filter { (it["vore"] eq "herbi") AND (it["bodywt"] gt 30) } alsoHeavyHerbivores
Fancier Text Filtering
In addition to seeing if the contents of a column are equal to, greater than or less than a given value, you can do some fancier text filtering with the help of DataCol
’s isMatching()
method.
This is another method that’s easier to demonstrate than explain, so run the following in a new code cell:
val monkeys = sleepData.filter { it["name"].isMatching<String> { contains("monkey") } } monkeys
This will result in a data frame containing only those animals with “monkey” in their name. The lambda that you provide to isMatching()
should accept a string argument and return a boolean. You’ll find functions such as startsWith()
, endsWith()
, contains()
and matches()
useful within that lambda.
Removing Columns
With filtering, you were removing rows from a data frame. You’ll often find that removing columns whose information isn’t needed at the moment is also useful. DataFrame
has a couple of methods that you can use to create a new data frame from an existing one, but with fewer columns.
select()
creates a new data frame from an existing one, but only with the columns whose names you specify.
Run the following code in a new cell to see select()
in action:
val simplifiedSleepData = sleepData.select("name", "vore", "sleep_total", "sleep_rem") simplifiedSleepData
You’ll see with a table that has only the columns whose names you specified: name, vore, sleep_total, and sleep_rem.
remove()
creates a new data frame from an existing one, but without the columns whose names you specify.
Run the following code in a new cell to see remove()
in action:
val evenSimplerSleepData = simplifiedSleepData.remove("sleep_rem") evenSimplerSleepData
You’ll see this result:
Performing Complex Data Operations
So far, you’ve been working with operations that work with observations on an individual basis. While useful, the real power of data science comes from aggregating data — collecting many units of data into one.
Calculating Column Statistics
DataCol
provides a number of handy math and statistics methods for often-performed calculations on numerical columns.
The simplest way to demonstrate how they work is to show you these functions in action on a given column, such as sleep_total.
Run the following code in a new code cell to see these functions in action:
val sleepCol = sleepData["sleep_total"] println("The mean sleep period is ${sleepCol.mean(removeNA=true)} hours.") println("The median sleep period is ${sleepCol.median(removeNA=true)} hours.") println("The standard deviation for the sleep periods is ${sleepCol.sd(removeNA=true)} hours.") println("The shortest sleep period is ${sleepCol.min(removeNA=true)} hours.") println("The longest sleep period is ${sleepCol.max(removeNA=true)} hours.")
These methods take a boolean argument, removeNA
, which specifies whether to exclude missing values from the calculation. The default value is false
, but it’s generally better to set this value to true
.
Grouping
In the previous exercise, you calculated the mean, median, and other statistical information about the entire set of animals in the dataset. While those results might be useful, you’re likely to get more insights by doing the same calculations on smaller sets of animals with something in common. Grouping allows you to create these smaller sets.
Given one or more column names, DataFrame
’s group()
method creates a new data frame from an existing one, but organized into a set of smaller data frames. called groups. Each group consists of rows that had the same values for the specified columns.
Don’t worry if the previous paragraph confused you. This is yet another one of those situations where showing is better than telling.
Break sleepData
‘s animals into groups based on the food they eat. There should be a carnivore group, an omnivore group, a herbivore group, etc. The vore column specifies what an animal eats, so you’ll provide that column name to the group()
method.
Run the following in a new code cell:
val groupedData = sleepData.groupBy("vore")
This creates a new data frame named groupedData
.
If you try to display groupedData
‘s contents by entering its name into a new code cell and running it, the output will be confusing:
Grouped by: *[vore] A DataFrame: 5 x 11 name genus vore order conservation sleep_total sleep_rem 1 Cheetah Acinonyx carni Carnivora lc 12.1 2 Northern fur seal Callorhinus carni Carnivora vu 8.7 1.4 3 Dog Canis carni Carnivora domesticated 10.1 2.9 4 Long-nosed armadillo Dasypus carni Cingulata lc 17.4 3.1 5 Domestic cat Felis carni Carnivora domesticated 12.5 3.2 and 4 more variables: awake, brainwt, bodywt
The output says that the data frame is grouped by vore and has 5 rows and 11 columns. The problem is that the default way to display the contents of a data frame works only for ungrouped data frames. You have to do a little more work to display the contents of a grouped data frame.
Run and enter the following in a new code cell:
groupedData.groups()
groups()
returns the list of the data frame’s groups, and remember, groups, are just data frames.
You’ll see 5 data frames in a row. Each of these data frames contains rows with a common value in their vore column. There’s one group for “carni”, one called “omni”, and so on.
DataFrame
has a count()
method that’s very useful for groups; it returns a data frame containing the count of rows for each group.
Run and enter the following in a new code cell:
groupedData.count()
This is the result:
Any data frame operation performed on a grouped data frame is done on a per-group basis.
For example, sorting a grouped data frame creates a new data frame with individually sorted groups. You can see this for yourself by running the following in a new code cell:
val sortedGroupedData = groupedData.sortedBy("name") sortedGroupedData.groups()