Computing for data analysis (01) 的筆記

2017-02-07

本文是computing for data analysis課程的摘要。也有一些上課的心得。

Data types and Basic Oprations

basic or ‘atomic’ classes

  • character
  • numeric
  • integer
  • complex
  • logical(True/False)

The most basic object is a vector

Numbers

  • as numeric objects
  • 1 is a numeric object; 1L is an integer
  • NaN represent undefined value: 0/0
  • Inf (infinity): 1/0 = Inf and 1/Inf = 0

Attributes R objects

  • names, dimnames
  • dimensions
  • class
  • length
  • other user-definded attributes

Matrices

  • vectors with a dimension attribute. dimension attribute is an ieteger vetor
1
2
3
4
5
6
> m<-matrix(nrow=2,ncol=3)
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3

List

1
2
3
4
5
6
7
8
9
10
11
12
13
> x <- list(1,"a", TRUE, 1 + 4i)
> x
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1 + 4i

NA

1
2
3
4
> is.na(NAN)
[1] TRUE
> is.nan(NA)
[1] FALSE

Data Frames

  • 創建data frame 通常用read.table(), read.csv()
  • 或者由matrix來創建 data.matrix()
1
2
3
4
5
> x <- data.frame(foo=1:4, bar=c(T,T,F,F))
> nrow(x)
[1] 4
> ncol(x)
[1] 2

Data Type and Basic Operations

subsetting list

  • [ 永遠會傳回跟原本物件一樣class的物件
  • [[ 用來截取list或data frame的元素,回傳的物件不一定是list或data frame
  • $ 用name來截取list或data frame的元素,很像是[[
1
2
3
4
5
6
7
8
9
10
11
12
> x <- list(foo=1:4, bar=0.6)
> x[1] # 回傳一個list
$foo
[1] 1 2 3 4
> x[[1]] # 回傳vector
[1] 1 2 3 4
> x$bar
[1] 0.6
> x[["bar"]]
[1] 0.6
> x["bar"]
[1] 0.6
1
2
3
4
5
6
7
> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> x[c(1,3)]
$foo
[1] 1 2 3 4

$baz
[1] "hello"

Subsetting Nestined Elements of a List

  • [[ take an integer sequnece
1
2
3
4
5
6
7
8
9
> x <- list(a = list(10,12,14), b = c(3.14, 2.81))
> x[[c(1,3)]]
[1] 14
> x[[1]][[3]]
[1] 14
> x[[c(2,1)]]
[1] 3.14
> x[[2]][[1]]
[1] 3.14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> x1 = 1:4; x2 = 1:4 + 10; x3 = 1:4 + 20; x4 = 1:4 + 30
> y <- list( ya = list(x1,x2) , yb = list(x3,x4) )
> y
$ya
$ya[[1]]
[1] 1 2 3 4

$ya[[2]]
[1] 11 12 13 14

$yb
$yb[[1]]
[1] 21 22 23 24

$yb[[2]]
[1] 31 32 33 34

> y[[1]][[2]][[3]]
[1] 13

> y[[c(1,2,3)]]
[1] 13

> y$ya[[2]]
[1] 11 12 13 14
> y$ya[[2]][3]
[1] 13

Removing NA values

1
2
3
> airquality[1:6,]
> good <- complete.case(airquality)
> airquality[good,][1:6,]

Vectorized Operations

1
2
3
4
5
> x <- matrix(1:4, 2, 2); y <- matrix(rep(10,4), 2, 2)
> x * y
> x / y
# true matrix multiplication
> x %*% y

Reading and Writing Data

reading

  • read.table, read.csv
  • readLines, 從文字檔案讀進來
  • source, 讀R code files
  • dget, 讀R code files
  • load, 讀入廚吋的workspaecs
  • unserialize, 從binary form讀入R物件

writing

  • write.table,
  • writeLines
  • dump
  • dput
  • save
  • serialize

使用 read.tables讀大檔時,稍微計算記憶體會不會爆炸掉,指定colClasses會讀得比較快,如果檔案裡面沒有註解,那就把comment.char="",這樣也會比較快

1
2
3
4
# 讀前100行看看型態
> initial <- read.table("datatable.txt", nrows = 100)
> classes <- sapply(initial, class)
> tabAll <- read.table("datatable.txt", colClasses = classes)