curs2

rodica ioana lung

Tabel

  • reddit.csv
redd <- read.csv("D:/Dropbox/FSEGA/cursuri/2016-2017/semestrul 2/R/date/reddit.csv");
str(redd)
'data.frame':   32754 obs. of  14 variables:
 $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender           : int  0 0 1 0 1 0 0 0 0 0 ...
 $ age.range        : Factor w/ 7 levels "18-24","25-34",..: 2 2 1 2 2 2 2 1 3 2 ...
 $ marital.status   : Factor w/ 6 levels "Engaged","Forever Alone",..: NA NA NA NA NA 4 3 4 4 3 ...
 $ employment.status: Factor w/ 6 levels "Employed full time",..: 1 1 2 2 1 1 1 4 1 2 ...
 $ military.service : Factor w/ 2 levels "No","Yes": NA NA NA NA NA 1 1 1 1 1 ...
 $ children         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ education        : Factor w/ 7 levels "Associate degree",..: 2 2 5 2 2 2 5 2 2 5 ...
 $ country          : Factor w/ 439 levels " Canada"," Canada eh",..: 394 394 394 394 394 394 125 394 394 125 ...
 $ state            : Factor w/ 53 levels "","Alabama","Alaska",..: 33 33 48 33 6 33 1 6 33 1 ...
 $ income.range     : Factor w/ 8 levels "$100,000 - $149,999",..: 2 2 8 2 7 2 NA 7 2 7 ...
 $ fav.reddit       : Factor w/ 1834 levels "","'home' page (or front page if you prefer)",..: 720 691 1511 1528 188 691 1318 571 1629 1 ...
 $ dog.cat          : Factor w/ 3 levels "I like cats.",..: NA NA NA NA NA 2 2 2 1 1 ...
 $ cheese           : Factor w/ 11 levels "American","Brie",..: NA NA NA NA NA 3 3 1 10 7 ...

Tipul FACTOR

  • variabile calitatitve
  • levels - valorile pe care le poate lua
  • le putem verifica:
table(redd$employment.status)

                   Employed full time 
                                14814 
                            Freelance 
                                 1948 
Not employed and not looking for work 
                                  682 
   Not employed, but looking for work 
                                 2087 
                              Retired 
                                   85 
                              Student 
                                12987 

Verificare

summary(redd)
       id            gender          age.range    
 Min.   :    1   Min.   :0.0000   18-24   :15802  
 1st Qu.: 8189   1st Qu.:0.0000   25-34   :11575  
 Median :16380   Median :0.0000   Under 18: 2330  
 Mean   :16379   Mean   :0.1885   35-44   : 2257  
 3rd Qu.:24568   3rd Qu.:0.0000   45-54   :  502  
 Max.   :32756   Max.   :1.0000   (Other) :  200  
                 NA's   :201      NA's    :   88  
                                  marital.status 
 Engaged                                 : 1109  
 Forever Alone                           : 5850  
 In a relationship                       : 9828  
 Married/civil union/domestic partnership: 5490  
 Single                                  :10428  
 Widowed                                 :   44  
 NA's                                    :    5  
                             employment.status military.service
 Employed full time                   :14814   No  :30526      
 Freelance                            : 1948   Yes : 2223      
 Not employed and not looking for work:  682   NA's:    5      
 Not employed, but looking for work   : 2087                   
 Retired                              :   85                   
 Student                              :12987                   
 NA's                                 :  151                   
 children                                  education    
 No  :27488   Bachelor's degree                 :11046  
 Yes : 5047   Some college                      : 9600  
 NA's:  219   Graduate or professional degree   : 4722  
              High school graduate or equivalent: 3272  
              Some high school                  : 1924  
              (Other)                           : 2046  
              NA's                              :  144  
           country             state                    income.range 
 United States :20967             :11908   Under $20,000      :7892  
 Canada        : 2888   California: 3401   $50,000 - $69,999  :4133  
 United Kingdom: 1782   Texas     : 1541   $70,000 - $99,999  :4101  
 Australia     : 1051   New York  : 1418   $100,000 - $149,999:3522  
 Germany       :  407   Illinois  :  976   $20,000 - $29,999  :3206  
 (Other)       : 5482   Washington:  910   (Other)            :8285  
 NA's          :  177   (Other)   :12600   NA's               :1615  
               fav.reddit               dog.cat            cheese    
                    : 4335   I like cats.   :11156   Other    :6563  
 askreddit          : 2123   I like dogs.   :17151   Cheddar  :6102  
 fffffffuuuuuuuuuuuu: 1746   I like turtles.: 4442   Brie     :3742  
 pics               : 1651   NA's           :    5   Provolone:3456  
 trees              : 1311                           Swiss    :3214  
 (Other)            :21562                           (Other)  :9672  
 NA's               :   26                           NA's     :   5  
levels(redd$age.range)
[1] "18-24"       "25-34"       "35-44"       "45-54"       "55-64"      
[6] "65 or Above" "Under 18"   
table(redd$age.range)

      18-24       25-34       35-44       45-54       55-64 65 or Above 
      15802       11575        2257         502         140          60 
   Under 18 
       2330 

Pachete pentru grafice

install.packages('ggplot2', repos = "http://cran.us.r-project.org")
library('ggplot2')
qplot(data=redd, x=age.range)

plot of chunk unnamed-chunk-7

Ordonare?

qplot(data=redd, x=income.range)

plot of chunk unnamed-chunk-8

Vrem sa ordonam categoriile (personalizat sau nu)!

Tipul factor in R

intreg <- sample(0:1, 20, replace=T)
?sample
intreg
 [1] 1 0 0 1 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0 1
intregF <- factor(intreg, labels=c('privat','public'))
intregF
 [1] public privat privat public public public privat public privat public
[11] privat privat public privat privat public privat public privat public
Levels: privat public

Tipul factor (continuare)

is.factor(intregF)
[1] TRUE
ses <- c('low','middle','low','high','low')
is.factor(ses)
[1] FALSE
is.character(ses)
[1] TRUE

Grafice

qplot(ses)

plot of chunk unnamed-chunk-17

Automat ordonarea se face alfabetic:

sesF1 <- factor(ses)
sesF1
[1] low    middle low    high   low   
Levels: high low middle
qplot(sesF1)

plot of chunk unnamed-chunk-19

Factor ordonat personalizat

sesF <- factor(ses, levels=c('low','middle','high'))
sesF
[1] low    middle low    high   low   
Levels: low middle high
qplot(sesF)

plot of chunk unnamed-chunk-21

Si ultima varianta

sesFordonat=ordered(ses, levels=c('low','middle','high'))
sesFordonat
[1] low    middle low    high   low   
Levels: low < middle < high
qplot(sesFordonat)

plot of chunk unnamed-chunk-22

Revenind la reddit:

Ordonam redd$age.range:

redd$age.range <- ordered(redd$age.range, levels=c("Under 18","18-24"  ,     "25-34"  ,     "35-44" ,      "45-54"  ,     "55-64"    ,   "65 or Above"  ))
qplot(redd$age.range)

plot of chunk unnamed-chunk-24

Exercitiu:

Ordonati si reprezentati grafic in mod similar variabila income.range.

Analiza unei variabile. Tabelul pseudofacebook.

getwd()
[1] "D:/Dropbox/FSEGA/cursuri/2016-2017/semestrul 2/R/curs1"
pf <- read.csv('D:/Dropbox/FSEGA/cursuri/2016-2017/semestrul 2/R/date/pseudo_facebook.tsv', sep='\t')
names(pf)
 [1] "userid"                "age"                  
 [3] "dob_day"               "dob_year"             
 [5] "dob_month"             "gender"               
 [7] "tenure"                "friend_count"         
 [9] "friendships_initiated" "likes"                
[11] "likes_received"        "mobile_likes"         
[13] "mobile_likes_received" "www_likes"            
[15] "www_likes_received"   

analizam DOB_day - ziua nasterii

qplot(x=dob_day, data=pf)

plot of chunk unnamed-chunk-26

Adaugam etichete axei x

qplot(x=dob_day, data=pf)+
  scale_x_continuous(breaks=1:31)

plot of chunk unnamed-chunk-27

Separam pe luni

qplot(x=dob_day, data=pf)+
  scale_x_continuous(breaks=c(1,5,10,15,20,25,30))+
  facet_wrap(~dob_month,ncol=3)

plot of chunk unnamed-chunk-28

Controlam numarul de intervale

qplot(x=dob_day, data=pf, binwidth=2)+
  scale_x_continuous(breaks=seq(1,31,2))

plot of chunk unnamed-chunk-29

oare nu sunt prea multi nascuti in ianuarie?

Exercitii:

  1. Construiti o variabila x de marime 100 care sa ia aleator (cu repetare) valorile 1,2 sau 3.
  2. Atribuiti valorilor variabilei etichetele 'suficient', 'bine', 'foarte bine' folosind functia factor().
  3. Reprezentati o histograma cu variabila factor.
  4. Construiti pornind de la x o variabila factor ordonata in care 'suficient'<'bine'<'foarte bine'.
  5. Construiti o variabila y care contine doar valorile lui x corespunzand lui 'bine' si 'foarte bine'.
  6. construiti histograma lui y.

Exercitii:

Cu tabelul pseudofacebook

  1. Afisati statisticile descriptive uzuale (summary) pentru variabila likes.
  2. Desenati o histograma pentru likes.
  3. Incercati patru valori distincte pentru binwidth.
  4. Separati histograma in doua grafice (pe sexe).
  5. Separati (originalul) in 12 grafice (pe luni) asezate in 4 coloane.
  6. Etichetati axa x corespunzator la fiecare din graficele de mai sus.

repetati punctele de mai sus pentru variabilele likes_received si mobile_likes.