Curs 3

In [ ]:
setwd("e:/Dropbox (Personal)/FSEGA/cursuri/2018-2019/R/cursuri")
In [62]:
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
In [63]:
library(ggplot2)
options(repr.plot.width=5, repr.plot.height=3) # marimea graficului pentru afisarea pe ecran

Histograme

In [64]:
qplot(x=friend_count, data=pf)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

setare axe

cu xlim() controlam ce portiune din axa este prezentata

In [65]:
qplot(x=friend_count, data=pf, xlim=c(1,1000))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 4913 rows containing non-finite values (stat_bin)."Warning message:
"Removed 2 rows containing missing values (geom_bar)."
In [66]:
qplot(x=friend_count, data=pf, xlim=c(500,1000))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 93599 rows containing non-finite values (stat_bin)."Warning message:
"Removed 2 rows containing missing values (geom_bar)."
In [67]:
qplot(x=friend_count, data=pf)+
  scale_x_continuous(limits=c(500,1000))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 93599 rows containing non-finite values (stat_bin)."Warning message:
"Removed 2 rows containing missing values (geom_bar)."

etichete pentru axa x

In [68]:
qplot(x=friend_count, data=pf)+
  scale_x_continuous(limits=c(500,1000), breaks=seq(0,1000,50))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 93599 rows containing non-finite values (stat_bin)."Warning message:
"Removed 2 rows containing missing values (geom_bar)."

separat in functie de valorile unei alte variabile

In [69]:
qplot(x=friend_count, data=subset(pf, !is.na(gender)))+
  scale_x_continuous(limits=c(0,1000), breaks=seq(0,1000,250))+
  facet_wrap(~gender, ncol=2)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 2949 rows containing non-finite values (stat_bin)."Warning message:
"Removed 4 rows containing missing values (geom_bar)."
In [70]:
qplot(x=friend_count, data=na.omit(pf))+
        scale_x_continuous(limits=c(0,1000), breaks=seq(0,1000,250))+
        facet_wrap(~gender, ncol=2)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 2949 rows containing non-finite values (stat_bin)."Warning message:
"Removed 4 rows containing missing values (geom_bar)."

na.omit() inlatura NA din toate variabilele...

Numeric

In [71]:
table(pf$gender)
female   male 
 40254  58574 
In [72]:
by(pf$friend_count, pf$gender, summary)
pf$gender: female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      37      96     242     244    4923 
------------------------------------------------------------ 
pf$gender: male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      27      74     165     182    4917 

unde:

  • analizam variabila pf\$friend_count impartita in functie de valorile pf\\$gender, afisand valorile raportate de functia summary
  • putem folosi si alte functii in loc de summary: mean, median, etc.
In [73]:
by(pf$friend_count, pf$gender, mean)
pf$gender: female
[1] 241.9699
------------------------------------------------------------ 
pf$gender: male
[1] 165.0355
In [74]:
by(pf$friend_count, pf$gender, max)
pf$gender: female
[1] 4923
------------------------------------------------------------ 
pf$gender: male
[1] 4917

Adaugarea de culori

Analizam variabila tenure - numar de zile de utilizare facebook

In [75]:
qplot(x=tenure, data=pf, color=I('black'), fill=I('#099009'))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 2 rows containing non-finite values (stat_bin)."

Modificari:

Tenure este exprimat in zile: trecem la ani:

In [76]:
qplot(x=tenure/365, data=pf, color=I('black'), fill=I('#099009'))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 2 rows containing non-finite values (stat_bin)."

Personalizare

In [77]:
qplot(x=tenure/365, data=pf, color=I('black'), fill=I('#099009'))+
  xlab('un text potrivit pentru aceasta axa')+
  ylab('aici trebuie obligatoriu o explicatie!')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 2 rows containing non-finite values (stat_bin)."
In [78]:
qplot(x=age, data=pf, color=I('grey'), fill=I('grey'), binwidth=1)+
  scale_x_continuous(limits=c(18,100), breaks=seq(20,100,10))
Warning message:
"Removed 15083 rows containing non-finite values (stat_bin)."Warning message:
"Removed 2 rows containing missing values (geom_bar)."

Aranjarea graficelor multiple

In [79]:
install.packages('gridExtra', repos = "http://cran.us.r-project.org")
Installing package into 'C:/Users/ro/Documents/R/win-library/3.5'
(as 'lib' is unspecified)
Warning message:
"package 'gridExtra' is in use and will not be installed"
In [80]:
library(gridExtra)

pregatim mai multe grafice

In [81]:
p1=qplot(pf$friend_count)
p2=qplot(pf$likes)
p3=qplot(pf$www_likes)
p4=qplot(pf$mobile_likes)
In [82]:
grid.arrange(p1,p2,p3,p4, ncol=2)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Alta modalitate de a construi graficele: ggplot()

In [83]:
p8 <- ggplot(aes(x=friend_count), data=pf)+
  geom_histogram()
p8
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

peste care putem sa adaugam diverse optiuni:

In [84]:
p9 <- p8+scale_x_sqrt()
p9
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In [85]:
p10 <- p8+scale_x_log10()
p10
Warning message:
"Transformation introduced infinite values in continuous x-axis"`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 1962 rows containing non-finite values (stat_bin)."
In [86]:
grid.arrange(p8,p9,p10, ncol=1)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Transformation introduced infinite values in continuous x-axis"`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 1962 rows containing non-finite values (stat_bin)."

Poligon

In [87]:
p11 <- qplot(x=friend_count, data=subset(pf, !is.na(gender)), binwidth=10, geom='freqpoly')+
  scale_x_continuous(lim=c(0,1000), breaks=seq(0,1000,100))
p11
Warning message:
"Removed 2949 rows containing non-finite values (stat_bin)."Warning message:
"Removed 2 rows containing missing values (geom_path)."

separat in functie de gender

In [88]:
p12 <- qplot(x=friend_count, data=subset(pf, !is.na(gender)), binwidth=10, geom='freqpoly', color=gender)+
  scale_x_continuous(lim=c(0,1000), breaks=seq(0,1000,100))
p12
Warning message:
"Removed 2949 rows containing non-finite values (stat_bin)."Warning message:
"Removed 4 rows containing missing values (geom_path)."
In [89]:
p13 <- qplot(x=www_likes, data=subset(pf, !is.na(gender)), geom='freqpoly', color=gender)
p13
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In [90]:
p13+scale_x_continuous(lim=c(0,2000))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
"Removed 344 rows containing non-finite values (stat_bin)."Warning message:
"Removed 4 rows containing missing values (geom_path)."

din nou numeric

In [91]:
by(pf$www_likes, pf$gender, summary)
pf$gender: female
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    0.00     0.00     0.00    87.14    25.00 14865.00 
------------------------------------------------------------ 
pf$gender: male
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    0.00     0.00     0.00    24.42     2.00 12903.00 
In [92]:
by(pf$www_likes, pf$gender, sum)
pf$gender: female
[1] 3507665
------------------------------------------------------------ 
pf$gender: male
[1] 1430175

Boxplots

In [93]:
qplot(x=gender, y=friend_count, data=subset(pf, !is.na(gender)), geom='boxplot')

reglarea axelor - Varianta 1

In [94]:
qplot(x=gender, y=friend_count, data=subset(pf, !is.na(gender)), geom='boxplot')+
  scale_y_continuous(lim=c(0,1000))
Warning message:
"Removed 2949 rows containing non-finite values (stat_boxplot)."

reglarea axelor - Varianta 2

In [95]:
qplot(x=gender, y=friend_count, data=subset(pf, !is.na(gender)), geom='boxplot', ylim=c(0,1000))
Warning message:
"Removed 2949 rows containing non-finite values (stat_boxplot)."

scale_y_continuous:

  • urmariti mesajele de avertizare de la R!
  • reface calculele cu mai putine data/ se schimba forma

ylim

  • la fel :(

solutia:

coord_cartesian()
In [96]:
qplot(x=gender, y=friend_count, data=subset(pf, !is.na(gender)), geom='boxplot')+
  coord_cartesian(ylim=c(0,1000))
In [97]:
qplot(x=gender, y=friend_count, data=subset(pf, !is.na(gender)), geom='boxplot')+
  coord_cartesian(ylim=c(0,260))
In [98]:
by(pf$friend_count, pf$gender, summary)
pf$gender: female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      37      96     242     244    4923 
------------------------------------------------------------ 
pf$gender: male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      27      74     165     182    4917