オライリーの「入門機械学習」を手に入れたので、手を動かしながら学びます。

まずは、1章。Rのインストールと基本的な使い方の学習まで。

Rのインストール

手元にあったCentOS7にインストールしました。

$ cat /etc/redhat-release
CentOS Linux release 7.1.1503 (Core) 
$ sudo yum install epel-release
$ sudo yum install R
$ R

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
...

サンプルコードのダウンロード

次に、GitHubで公開されている「入門機械学習」のサンプルコードを取得します。

$ cd ~
$ git clone https://github.com/johnmyleswhite/ML_for_Hackers.git ml_for_hackers
$ cd  ml_for_hackers

必要モジュールのインストール

サンプルコードを動かす時に必要なパッケージをインストール。

$ R
> source("package_installer.R")

ユーザー権限で実行すると、デフォルトのインストール先に書き込み権がないとのこと。y を押して、ホームディレクトリにインストール。そこそこ時間がかかるので待つ・・・。

・・・いくつか、エラーになりました。

1:  install.packages(p, dependencies = TRUE, type = "source") で: 
  installation of package ‘rgl’ had non-zero exit status
2:  install.packages(p, dependencies = TRUE, type = "source") で: 
  installation of package ‘XML’ had non-zero exit status
3:  install.packages(p, dependencies = TRUE, type = "source") で: 
  installation of package ‘RCurl’ had non-zero exit status
4:  install.packages(p, dependencies = TRUE, type = "source") で: 
  installation of package ‘Rpoppler’ had non-zero exit status
5:  install.packages(p, dependencies = TRUE, type = "source") で: 
  installation of package ‘XML’ had non-zero exit status
6:  install.packages(p, dependencies = TRUE, type = "source") で: 
  installation of package ‘RCurl’ had non-zero exit status
7:  install.packages(p, dependencies = TRUE, type = "source") で: 
  installation of package ‘XML’ had non-zero exit status

個別にインストールしてみると、依存モジュールがないのが原因のよう。

install.packages("XML", dependencies=TRUE)

Cannot find xml2-config
ERROR: configuration failed for package ‘XML’
* removing ‘/home/yamautim/R/x86_64-redhat-linux-gnu-library/3.2/XML’

必要なモジュールをインストールして、

$ sudo yum -y install libxml2-devel curl-devel poppler-glib-devel freeglut-devel

再試行。

> install.packages("rgl", dependencies=TRUE)
> install.packages("Rpoppler", dependencies=TRUE)
> install.packages("XML", dependencies=TRUE)
> install.packages("RCurl", dependencies=TRUE)

基礎練習で使うライブラリとデータの読み込み

基礎練習で使うライブラリとデータ(UFOの目撃情報データ)を読み込みます。

> setwd("01-Introduction/")
> library("ggplot2")
> library(plyr)
> library(scales)
> ufo <- read.delim("data/ufo/ufo_awesome.tsv", sep="\t", stringsAsFactors=FALSE, header=FALSE, na.strings="")

先頭、末尾の6行を表示。

> head(ufo)                                    
        V1       V2                    V3   V4      V5
1 19951009 19951009         Iowa City, IA <NA>    <NA>
2 19951010 19951011         Milwaukee, WI <NA>  2 min.
3 19950101 19950103           Shelton, WA <NA>    <NA>
...
> tail(ufo)
            V1       V2                   V3        V4                V5
61865 20100828 20100828      Los Angeles, CA      disk        40 seconds
61866 20090424 20100820         Hartwell, GA      oval            10 min
61867 20100821 20100826  Franklin Square, NY  fireball        20 minutes
61868 20100827 20100827         Brighton, CO    circle    at lest 45 min
61869 20100818 20100821  Dryden (Canada), ON     other 5 Min. maybe more
61870 20050502 20100824        Fort Knox, KY  triangle        15 seconds

データをクリーニングする

読み込んだデータを、処理しやすい形に変換します。

データ列に名前をつける

> names(ufo) <- c("DateOccurred", "DateReported", "Location", "ShortDescription",
"Duration", "LongDescription")
> head(ufo)
  DateOccurred DateReported              Location ShortDescription Duration
1     19951009     19951009         Iowa City, IA             <NA>     <NA>
2     19951010     19951011         Milwaukee, WI             <NA>   2 min.
3     19950101     19950103           Shelton, WA             <NA>     <NA>
4     19950510     19950510          Columbia, MO             <NA>   2 min.
5     19950611     19950614           Seattle, WA             <NA>     <NA>
6     19951025     19951024  Brunswick County, ND             <NA>  30 min.

V1 などとなっていたところに、DateOccurredのようなわかりやすい名前が付きました。

DateOccurred,DateReported を日付に変換する

DateOccurred 列は、日付を示す文字列なので、Dateオブジェクトに変換します。

> ufo$DateOccurred <- as.Date(ufo$DateOccurred, format = "%Y%m%d")
 strptime(x, format, tz = "GMT") でエラー:  入力文字列が長すぎます

エラーになりました。長い文字列を含む列があるよう。探します。

> head(ufo[which(nchar(ufo$DateOccurred)!=8|nchar(ufo$DateReported)!=8),1])
[1] "ler@gnv.ifas.ufl.edu" 
[2] "0000"
[3] "Callers report sighting a number of soft white  balls of lights headingin an easterly directing then changing direction to the west beforespeeding off to the north west."
[4] "0000"
[5] "0000"
[6] "0000"

変なデータがありますな。ということで、8文字でない列を削除します。

# 行のDateOccurred または、DateReported が8文字かどうかを格納するデータ(good.rows)を作成
> good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 |
 nchar(ufo$DateReported) != 8,
 FALSE, TRUE)
# FALSE(=DateOccurred または、DateReported が8文字でない行)の数を確認。
> length(which(!good.rows))
[1] 731
# good.rowsがTRUEの行のみ抽出
> ufo <- ufo[good.rows, ]

変換を再試行。今度は、日付型に変換できました。

> ufo$DateOccurred <- as.Date(ufo$DateOccurred, format = "%Y%m%d")
> ufo$DateReported <- as.Date(ufo$DateReported, format = "%Y%m%d")
> head(ufo)
  DateOccurred DateReported              Location ShortDescription Duration
1   1995-10-09   1995-10-09         Iowa City, IA             <NA>     <NA>
2   1995-10-10   1995-10-11         Milwaukee, WI             <NA>   2 min.
3   1995-01-01   1995-01-03           Shelton, WA             <NA>     <NA>

関数を作ってLocation列のデータを都市名と州名に分割する

Location列のデータは、Iowa City, IA のような「都市名, 州名」形式の文字列になっています。関数を使ってこれを都市名, 州名に分解します。

まずは、分解を行う関数を作成。

> get.location <- function(l) {
  split.location <- tryCatch(strsplit(l, ",")[[1]], error = function(e) return(c(NA, NA)))
  clean.location <- gsub("^ ","",split.location)
  if (length(clean.location) > 2) {
    return(c(NA,NA))
  } else {
    return(clean.location)
  }
}

lapplyを使って、関数を適用したデータのリストを作成。

> city.state <- lapply(ufo$Location, get.location)
> head(city.state)
[[1]]
[1] "Iowa City" "IA"       

[[2]]
[1] "Milwaukee" "WI"

これを、ufoに追加します。まずは、do.callでリストを行列に変換。

> location.matrix <- do.call(rbind, city.state)
> head(location.matrix)
     [,1]               [,2]
[1,] "Iowa City"        "IA"
[2,] "Milwaukee"        "WI"
[3,] "Shelton"          "WA"
[4,] "Columbia"         "MO"
[5,] "Seattle"          "WA"
[6,] "Brunswick County" "ND"

transform で ufoに追加します。

> ufo <- transform(ufo, USCity = location.matrix[, 1], USState = location.matrix[, 2], stringsAsFactors = FALSE)

また、データには、カナダのものが含まれているので、これも除去します。 USStateにアメリカの州名以外が入っているデータを削除。

> ufo$USState <- state.abb[match(ufo$USState, state.abb)]
> ufo.us <- subset(ufo, !is.na(USState))

これで、データのクリーニングは完了。

> summary(ufo.us)                                                                                                                                                        
  DateOccurred         DateReported          Location        
 Min.   :1400-06-30   Min.   :1905-06-23   Length:51636      
 1st Qu.:1999-09-07   1st Qu.:2002-04-14   Class :character  
 Median :2004-01-10   Median :2005-03-27   Mode  :character  
 Mean   :2001-02-13   Mean   :2004-11-30                     
 3rd Qu.:2007-07-27   3rd Qu.:2008-01-20                     
 Max.   :2010-08-30   Max.   :2010-08-30                     
 ShortDescription     Duration         LongDescription       USCity         
 Length:51636       Length:51636       Length:51636       Length:51636      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   USState         
 Length:51636      
 Class :character  
 Mode  :character  
                   
                   
                   
> head(ufo.us)
  DateOccurred DateReported              Location ShortDescription Duration
1   1995-10-09   1995-10-09         Iowa City, IA             <NA>     <NA>
2   1995-10-10   1995-10-11         Milwaukee, WI             <NA>   2 min.
3   1995-01-01   1995-01-03           Shelton, WA             <NA>     <NA>
4   1995-05-10   1995-05-10          Columbia, MO             <NA>   2 min.
5   1995-06-11   1995-06-14           Seattle, WA             <NA>     <NA>
6   1995-10-25   1995-10-24  Brunswick County, ND             <NA>  30 min.

データを分析する

クリーニングしたデータを分析して、州/月ごとの目撃情報の傾向を分析します。

DateOccurredのばらつきを調べる

> summary(ufo.us$DateOccurred) 
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max.         NA's 
"1400-06-30" "1999-09-07" "2004-01-10" "2001-02-13" "2007-07-27" "2010-08-30"          "1"

1400年のデータが含まれている・・・。ヒストグラムで、分布をみてみます。

> quick.hist <- ggplot(ufo.us, aes(x = DateOccurred)) +
  geom_histogram() + scale_x_date(date_breaks = "50 years", date_labels = "%Y")
> ggsave(plot = quick.hist,
 filename = file.path("images", "quick_hist.png"),
 height = 6,
 width = 8)

f:id:unageanu:20160109175440p:plain

大部分は最近の20年に集中している模様。この範囲に絞って分析するため、古いデータを取り除きます。

> ufo.us <- subset(ufo.us, DateOccurred >= as.Date("1990-01-01"))

データを州/月ごとに集計する

まずは、DateOccurred 列を「年-月」に変換した列を作成。

> ufo.us$YearMonth <- strftime(ufo.us$DateOccurred, format = "%Y-%m")

次に、月/州ごとにデータをグループ化して、データ数を集計します。

> sightings.counts <- ddply(ufo.us, .(USState,YearMonth), nrow)
> head(sightings.counts)
  USState YearMonth V1
1      AK   1990-01  1
2      AK   1990-03  1
3      AK   1990-05  1
4      AK   1993-11  1
5      AK   1994-11  1
6      AK   1995-01  1

これだけだと、データが一つもない月が集計結果に含まれないので、それを補完します。 seq.Dateを使ってシーケンスを作成。

> date.range <- seq.Date(from = as.Date(min(ufo.us$DateOccurred)),
                         to = as.Date(max(ufo.us$DateOccurred)),
                         by = "month")
> date.strings <- strftime(date.range, "%Y-%m")

作成したシーケンスに、州の一覧を掛け合わせて、州/月の行列を作成します。

> states.dates <- lapply(state.abb, function(s) cbind(s, date.strings))
> states.dates <- data.frame(do.call(rbind, states.dates),
 stringsAsFactors = FALSE)
> head(states.dates)
   s date.strings
1 AL      1990-01
2 AL      1990-02
3 AL      1990-03
4 AL      1990-04
5 AL      1990-05
6 AL      1990-06

さらに、sightings.counts をマージして、月の欠落がない集計データを作成。

> all.sightings <- merge(states.dates,
 sightings.counts,
 by.x = c("s", "date.strings"),
 by.y = c("USState", "YearMonth"),
 all = TRUE)
> head(all.sightings)                                                                                                                                 
   s date.strings V1
1 AK      1990-01  1
2 AK      1990-02 NA
3 AK      1990-03  1
4 AK      1990-04 NA
5 AK      1990-05  1
6 AK      1990-06 NA

わかりやすいよう、列に名前を付けます。また、集計しやすいようにNAを0に変換するなどの操作を行っておきます。

> names(all.sightings) <- c("State", "YearMonth", "Sightings")
> all.sightings$Sightings[is.na(all.sightings$Sightings)] <- 0
> all.sightings$YearMonth <- as.Date(rep(date.range, length(state.abb)))
> all.sightings$State <- as.factor(all.sightings$State)
> head(all.sightings)
  State  YearMonth Sightings
1    AK 1990-01-01         1
2    AK 1990-02-01         0
3    AK 1990-03-01         1
4    AK 1990-04-01         0
5    AK 1990-05-01         1
6    AK 1990-06-01         0

分析用データはこれで完成。

月/州ごとの目撃情報数をグラフにして分析する

> state.plot <- ggplot(all.sightings, aes(x = YearMonth,y = Sightings)) +
  geom_line(aes(color = "darkblue")) +
  facet_wrap(~State, nrow = 10, ncol = 5) + 
  theme_bw() + 
  scale_color_manual(values = c("darkblue" = "darkblue"), guide = "none") +
  scale_x_date(date_breaks = "5 years", date_labels = '%Y') +
  xlab("Years") +
  ylab("Number of Sightings") +
  ggtitle("Number of UFO sightings by Month-Year and U.S. State (1990-2010)")
> ggsave(plot = state.plot,
       filename = file.path("images", "ufo_sightings.png"),
       width = 14,
       height = 8.5)