Curs 4

In [1]:
setwd("e:/Dropbox (Personal)/FSEGA/cursuri/2018-2019/R/cursuri")
In [2]:
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
In [3]:
library(ggplot2)
library(gridExtra)
options(repr.plot.width=4.5, repr.plot.height=2.5)
Warning message:
"package 'ggplot2' was built under R version 3.5.2"Warning message:
"package 'gridExtra' was built under R version 3.5.3"

Variabile logice

Exemplu: am vrea sa stim daca cineva a folosit sau nu mobilul pentru a accesa facebook

In [4]:
summary(pf$mobile_likes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     4.0   106.1    46.0 25111.0 
In [5]:
summary(pf$mobile_likes>0)
   Mode   FALSE    TRUE 
logical   35056   63947 
In [4]:
mobile_check_in <- NA
mobile_check_in <- pf$mobile_likes>0
summary(mobile_check_in)
   Mode   FALSE    TRUE 
logical   35056   63947 

sau direct in pf:

In [7]:
pf$mobile_check_in <- ifelse(pf$mobile_likes>0,1,0)
str(pf)
'data.frame':	99003 obs. of  16 variables:
 $ userid               : int  2094382 1192601 2083884 1203168 1733186 1524765 1136133 1680361 1365174 1712567 ...
 $ age                  : int  14 14 14 14 14 14 13 13 13 13 ...
 $ dob_day              : int  19 2 16 25 4 1 14 4 1 2 ...
 $ dob_year             : int  1999 1999 1999 1999 1999 1999 2000 2000 2000 2000 ...
 $ dob_month            : int  11 11 11 12 12 12 1 1 1 2 ...
 $ gender               : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 2 1 2 2 ...
 $ tenure               : int  266 6 13 93 82 15 12 0 81 171 ...
 $ friend_count         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ friendships_initiated: int  0 0 0 0 0 0 0 0 0 0 ...
 $ likes                : int  0 0 0 0 0 0 0 0 0 0 ...
 $ likes_received       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_likes         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_likes_received: int  0 0 0 0 0 0 0 0 0 0 ...
 $ www_likes            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ www_likes_received   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_check_in      : num  0 0 0 0 0 0 0 0 0 0 ...
In [5]:
?ifelse

si apoi o convertesc in factor:

In [9]:
pf$mobile_check_in <- factor(pf$mobile_check_in, labels=c('no','yes'))
In [10]:
summary(pf$mobile_check_in)
no
35056
yes
63947
In [11]:
levels(pf$mobile_check_in)
  1. 'no'
  2. 'yes'

Nor de puncte

In [12]:
qplot(x=age, y=friend_count, data=pf)
In [13]:
qplot(age, friend_count, data=pf)
In [14]:
qplot(friend_count, age, data=pf)

sau cu ggplot:

In [20]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point()
In [13]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/10)
In [14]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/30)

alpha = 1/20 : transparenta punctelor: 20 de puncte suprapuse apar complet colorate, un punct are transparenta 1/20

In [16]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/20)+
  xlim(13,90)
Warning message:
"Removed 4906 rows containing missing values (geom_point)."
In [27]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/20)+
  xlim(13,90)+
  geom_jitter(alpha=1/20)
Warning message:
"Removed 4906 rows containing missing values (geom_point)."Warning message:
"Removed 5188 rows containing missing values (geom_point)."

geom_jitter() adauga un zgomot pentru a reduce aspectul de 'linii'

In [28]:
# acelasi lucru, alta variabila
ggplot(aes(x=age, y=friendships_initiated), data=pf)+
  geom_point()+
  xlim(13,90)
Warning message:
"Removed 4906 rows containing missing values (geom_point)."
In [29]:
ggplot(aes(x=age, y=friendships_initiated), data=pf)+
  geom_point(aes(color=I("coral")))+
  xlim(13,90)
Warning message:
"Removed 4906 rows containing missing values (geom_point)."
In [20]:
ggplot(aes(x=age, y=friendships_initiated), data=pf)+
  geom_point(aes(color=gender))+
  xlim(13,90)+
  xlab('explicatie axa X')+
  ylab('explicatie axa Y')+
  ggtitle('titlu grafic', subtitle = 'subtitlu')+
  labs(caption="si inca o explicatie")
Warning message:
"Removed 4906 rows containing missing values (geom_point)."

Manipulare de date - pachetul dplyr

install.packages('dplyr', repos = "http://cran.us.r-project.org", dependencies=T)
In [17]:
library(dplyr)
Warning message:
"package 'dplyr' was built under R version 3.5.3"
Attaching package: 'dplyr'

The following object is masked from 'package:gridExtra':

    combine

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Grupam datele:

In [15]:
c4 <- read.csv('curs4.csv')
c4
Var1Var2
A 1
A 2
B 3
B 4
B 5
C 6
C 7
C 8
C 9
D 10
D 11
In [18]:
grupat <- group_by(c4,Var1)
grupat
str(grupat)
Var1Var2
A 1
A 2
B 3
B 4
B 5
C 6
C 7
C 8
C 9
D 10
D 11
Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':	11 obs. of  2 variables:
 $ Var1: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 2 3 3 3 3 4 ...
 $ Var2: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame':	4 obs. of  2 variables:
  ..$ Var1 : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
  ..$ .rows:List of 4
  .. ..$ : int  1 2
  .. ..$ : int  3 4 5
  .. ..$ : int  6 7 8 9
  .. ..$ : int  10 11
  ..- attr(*, ".drop")= logi TRUE
In [20]:
grupe_varsta <- group_by(pf,age)
str(grupe_varsta)
Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':	99003 obs. of  16 variables:
 $ userid               : int  2094382 1192601 2083884 1203168 1733186 1524765 1136133 1680361 1365174 1712567 ...
 $ age                  : int  14 14 14 14 14 14 13 13 13 13 ...
 $ dob_day              : int  19 2 16 25 4 1 14 4 1 2 ...
 $ dob_year             : int  1999 1999 1999 1999 1999 1999 2000 2000 2000 2000 ...
 $ dob_month            : int  11 11 11 12 12 12 1 1 1 2 ...
 $ gender               : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 2 1 2 2 ...
 $ tenure               : int  266 6 13 93 82 15 12 0 81 171 ...
 $ friend_count         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ friendships_initiated: int  0 0 0 0 0 0 0 0 0 0 ...
 $ likes                : int  0 0 0 0 0 0 0 0 0 0 ...
 $ likes_received       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_likes         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_likes_received: int  0 0 0 0 0 0 0 0 0 0 ...
 $ www_likes            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ www_likes_received   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_check_in      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame':	101 obs. of  2 variables:
  ..$ age  : int  13 14 15 16 17 18 19 20 21 22 ...
  ..$ .rows:List of 101
  .. ..$ : int  7 8 9 10 11 12 13 14 15 16 ...
  .. ..$ : int  1 2 3 4 5 6 32 33 34 35 ...
  .. ..$ : int  24 25 26 27 28 29 30 31 74 75 ...
  .. ..$ : int  68 69 70 71 72 73 118 119 120 121 ...
  .. ..$ : int  113 114 115 116 117 166 167 168 169 170 ...
  .. ..$ : int  154 155 156 157 158 159 160 161 162 163 ...
  .. ..$ : int  195 196 197 198 199 200 201 202 203 204 ...
  .. ..$ : int  259 260 261 262 308 309 310 311 312 313 ...
  .. ..$ : int  303 304 305 306 307 351 352 353 354 355 ...
  .. ..$ : int  348 349 350 404 405 406 407 408 409 410 ...
  .. ..$ : int  395 396 397 398 399 400 401 402 403 452 ...
  .. ..$ : int  446 447 448 449 450 451 549 550 551 552 ...
  .. ..$ : int  540 541 542 543 544 545 546 547 548 615 ...
  .. ..$ : int  612 613 614 732 733 734 735 736 737 738 ...
  .. ..$ : int  723 724 725 726 727 728 729 730 731 800 ...
  .. ..$ : int  788 789 790 791 792 793 794 795 796 797 ...
  .. ..$ : int  844 845 846 847 848 849 850 851 915 916 ...
  .. ..$ : int  907 908 909 910 911 912 913 914 954 955 ...
  .. ..$ : int  948 949 950 951 952 953 999 1000 1001 1002 ...
  .. ..$ : int  993 994 995 996 997 998 1041 1042 1043 1044 ...
  .. ..$ : int  1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 ...
  .. ..$ : int  1081 1082 1143 1144 1145 1146 1147 1148 1149 1150 ...
  .. ..$ : int  1136 1137 1138 1139 1140 1141 1142 1171 1172 1173 ...
  .. ..$ : int  1167 1168 1169 1170 1201 1202 1203 1204 1205 1206 ...
  .. ..$ : int  1199 1200 1230 1231 1232 1233 1234 1235 1236 1237 ...
  .. ..$ : int  1227 1228 1229 1256 1257 1258 1259 1260 1261 1262 ...
  .. ..$ : int  1255 1283 1284 1285 1286 1287 1288 1289 1290 1291 ...
  .. ..$ : int  1279 1280 1281 1282 1302 1303 1304 1305 1306 1307 ...
  .. ..$ : int  1301 1324 1325 1326 1327 1328 1329 1330 1331 1332 ...
  .. ..$ : int  1322 1323 1357 1358 1359 1360 1361 1362 1363 1364 ...
  .. ..$ : int  1354 1355 1356 1379 1380 1381 1382 1383 1384 1385 ...
  .. ..$ : int  1377 1378 1405 1406 1407 1408 1409 1410 1411 1412 ...
  .. ..$ : int  1399 1400 1401 1402 1403 1404 1420 1421 1422 1423 ...
  .. ..$ : int  1418 1419 1438 1439 1440 1441 1442 1443 1444 1445 ...
  .. ..$ : int  1436 1437 1455 1456 1457 1458 1459 1460 1461 1462 ...
  .. ..$ : int  1451 1452 1453 1454 1472 1473 1474 1475 1476 1477 ...
  .. ..$ : int  1488 1489 1490 1491 1492 1493 1756 1777 1838 1914 ...
  .. ..$ : int  1487 1495 1496 1497 1498 1499 1500 1501 1502 1503 ...
  .. ..$ : int  1494 1507 1508 1509 1510 1511 1512 1513 1514 1515 ...
  .. ..$ : int  1505 1506 1528 1529 1530 1531 1532 1533 1534 1535 ...
  .. ..$ : int  1525 1526 1527 1543 1544 1545 1546 1547 1548 1549 ...
  .. ..$ : int  1542 1554 1555 1556 1557 1558 1559 1560 1561 1562 ...
  .. ..$ : int  1552 1553 1569 1570 1571 1572 1573 1574 1575 1576 ...
  .. ..$ : int  1568 1580 1581 1582 3063 3064 3065 3184 3362 3380 ...
  .. ..$ : int  1578 1579 1584 1585 1586 1587 1588 1589 1590 1591 ...
  .. ..$ : int  1583 1593 1594 1595 1596 1597 1598 1599 1759 1791 ...
  .. ..$ : int  1592 1601 1602 1603 1604 1605 1606 1607 1608 1609 ...
  .. ..$ : int  1600 1613 1614 1615 1616 1617 1618 1619 3079 3080 ...
  .. ..$ : int  1611 1612 1620 1621 1622 1623 1624 1625 1626 1627 ...
  .. ..$ : int  1630 1631 1632 1633 1634 1760 3089 3090 3173 3272 ...
  .. ..$ : int  1628 1629 1636 1637 1638 1639 1640 1641 1878 1924 ...
  .. ..$ : int  1635 1643 1827 1906 1910 3097 3098 4438 4439 4499 ...
  .. ..$ : int  1642 1644 1645 1646 1647 1648 1726 3096 3101 3102 ...
  .. ..$ : int  1652 1653 1654 1655 1656 1657 1868 3099 3100 3105 ...
  .. ..$ : int  1649 1650 1651 1659 1660 1661 1761 3108 3109 3397 ...
  .. ..$ : int  1658 1662 1663 1664 1665 1666 3111 3112 3601 4447 ...
  .. ..$ : int  1667 1668 3110 3113 3114 4450 4849 5488 5683 5728 ...
  .. ..$ : int  1669 1670 1671 1943 3115 4449 4452 4453 5530 6456 ...
  .. ..$ : int  1672 1731 1929 3333 4451 4454 4877 6169 6170 6418 ...
  .. ..$ : int  1674 1727 3116 3117 3398 4455 4456 5379 5380 5381 ...
  .. ..$ : int  1673 1677 1678 1679 1680 1681 3273 4457 4458 4815 ...
  .. ..$ : int  1675 1676 1682 1683 1684 3118 3119 3120 4459 4826 ...
  .. ..$ : int  3121 6173 6174 7091 7166 7957 9000 9001 9104 10045 ...
  .. ..$ : int  1685 3122 4461 6901 7959 7960 8518 10562 11466 11780 ...
  .. ..$ : int  1829 3123 4460 4462 7958 7961 8010 8091 8136 8271 ...
  .. ..$ : int  1686 4463 4464 5383 5725 6175 6323 7963 9002 10049 ...
  .. ..$ : int  1687 3195 3463 5384 7112 7962 7964 7965 8277 12939 ...
  .. ..$ : int  1881 3124 5585 9358 10051 10052 11393 13796 13797 13828 ...
  .. ..$ : int  1762 4465 7061 9375 10075 10093 10218 10432 11104 11782 ...
  .. ..$ : int  3125 3126 4466 10053 10195 14161 18771 18772 18833 19032 ...
  .. ..$ : int  4467 4468 8291 9003 9388 12663 13132 13283 13798 13799 ...
  .. ..$ : int  1688 3127 4469 5385 6176 6177 7966 11783 14704 16870 ...
  .. ..$ : int  3175 7210 7258 10094 13034 13341 13800 13977 16722 16871 ...
  .. ..$ : int  7967 14866 16721 18776 19558 19965 25495 25584 26787 27149 ...
  .. ..$ : int  5386 6902 10580 13167 18243 18283 18777 21561 21562 22017 ...
  .. ..$ : int  1689 6201 6333 12664 21612 24072 24434 25695 27478 30035 ...
  .. ..$ : int  5548 10076 13405 14705 18934 22725 25759 33669 35901 36020 ...
  .. ..$ : int  1690 1958 7243 12665 14020 15833 21563 22172 22644 25496 ...
  .. ..$ : int  6903 6968 9004 11192 11784 11785 12782 13801 13923 17053 ...
  .. ..$ : int  1691 8274 9264 12105 16109 19018 25498 26788 31513 31514 ...
  .. ..$ : int  1728 3128 8240 8478 8515 10054 11105 14706 16896 17146 ...
  .. ..$ : int  1692 4822 6178 9378 10331 12666 12667 12994 13873 20222 ...
  .. ..$ : int  1693 4891 5697 6334 16723 21142 24226 34023 35440 35666 ...
  .. ..$ : int  7004 12799 17913 18333 19993 24312 26789 28137 28227 29387 ...
  .. ..$ : int  1694 24443 33159 35667 36075 36671 37889 40498 41012 43070 ...
  .. ..$ : int  3326 5733 9164 10727 14990 17592 19270 22380 22463 25041 ...
  .. ..$ : int  17273 17997 20888 21564 22989 33028 35032 39092 43231 43338 ...
  .. ..$ : int  1695 5387 5388 5608 6179 8177 8512 9399 10613 13235 ...
  .. ..$ : int  3129 6465 8409 12668 13006 13342 16724 19361 21005 24840 ...
  .. ..$ : int  1792 3130 6180 6410 8120 10368 11131 13113 15506 17500 ...
  .. ..$ : int  3131 3132 4508 5389 5390 6283 7968 8572 9337 10472 ...
  .. ..$ : int  14234 15410 16726 27219 27303 29135 32649 34378 34898 35097 ...
  .. ..$ : int  15192 15426 23541 24129 24681 31862 33675 33968 42803 43142 ...
  .. ..$ : int  3775 6181 7114 17161 24682 29906 32794 33420 36066 38575 ...
  .. ..$ : int  1696 3133 5679 6494 11106 12021 14996 16862 19647 26075 ...
  .. ..$ : int  1697 1779 3134 3328 3691 4509 4610 4709 4777 4843 ...
  .. ..$ : int  27127 37719 37723 58455 59366 72373 77413 86623 87354
  .. ..$ : int  30532 41590 44626 46125 56710 67876 78561 78660 85244 88065 ...
  .. ..$ : int  23787 23834 29187 32057 33702 38934 54859 58069 65938 71794 ...
  .. .. [list output truncated]
  ..- attr(*, ".drop")= logi TRUE

si apoi calculam statisticile dorite pentru fiecare grup:

In [19]:
c4sumar <- summarise(grupat,
                    media=mean(Var2),
                    numar=n(),
                    suma=sum(Var2),
                    mediana=median(Var2)
                    )
c4sumar
str(c4sumar)
Var1medianumarsumamediana
A 1.52 3 1.5
B 4.03 12 4.0
C 7.54 30 7.5
D 10.52 21 10.5
Classes 'tbl_df', 'tbl' and 'data.frame':	4 obs. of  5 variables:
 $ Var1   : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
 $ media  : num  1.5 4 7.5 10.5
 $ numar  : int  2 3 4 2
 $ suma   : int  3 12 30 21
 $ mediana: num  1.5 4 7.5 10.5
In [22]:
pf.fc_age <- summarise(grupe_varsta, 
                       friend_count_media=mean(friend_count),
                       friend_count_mediana=median(friend_count),
                       n=n())
str(pf.fc_age)
Classes 'tbl_df', 'tbl' and 'data.frame':	101 obs. of  4 variables:
 $ age                 : int  13 14 15 16 17 18 19 20 21 22 ...
 $ friend_count_media  : num  165 251 348 352 350 ...
 $ friend_count_mediana: num  74 132 161 172 156 ...
 $ n                   : int  484 1925 2618 3086 3283 5196 4391 3769 3671 3032 ...
In [35]:
head(pf.fc_age)
agefriend_count_mediafriend_count_medianan
13 164.7500 74.0 484
14 251.3901132.0 1925
15 347.6921161.0 2618
16 351.9371171.5 3086
17 350.3006156.0 3283
18 331.1663162.0 5196

Daca e nevoie il ordonam:

In [37]:
c4sumar <- arrange(c4sumar,numar)
c4sumar
Var1medianumarsuma
A 1.52 3
D 10.52 21
B 4.03 12
C 7.54 30
In [38]:
c4sumar <- arrange(c4sumar,desc(numar))
c4sumar
Var1medianumarsuma
C 7.54 30
B 4.03 12
A 1.52 3
D 10.52 21
In [23]:
pf.fc_age <- arrange(pf.fc_age,age)
tail(pf.fc_age)
agefriend_count_mediafriend_count_medianan
108 369.2426213.0 1661
109 172.8889120.0 9
110 336.7333243.0 15
111 240.2222166.0 18
112 484.9444120.5 18
113 334.6683206.0 202

Alternativ, putem realiza totul dintr-o instructiune:

In [40]:
pf.fc_age_nou <- pf %>%
  group_by(age) %>%
  summarise(friend_count_medie=mean(friend_count),
            friend_count_mediana=median(friend_count),
            n=n()) %>%
  arrange(age)

head(pf.fc_age_nou)
agefriend_count_mediefriend_count_medianan
13 164.7500 74.0 484
14 251.3901132.0 1925
15 347.6921161.0 2618
16 351.9371171.5 3086
17 350.3006156.0 3283
18 331.1663162.0 5196

Acum putem reprezenta grafic media friend_count /age

In [24]:
ggplot(aes(x=age, y=friend_count_mediana), data=pf.fc_age)+
  geom_point()
In [42]:
ggplot(aes(x=age, y=friend_count_media), data=pf.fc_age)+
  geom_line()
In [26]:
ggplot(aes(x=age, y=friend_count_media), data=pf.fc_age)+
  geom_line(color='coral')

Nor de puncte din nou

adaugam media pe grafic

In [45]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/10, color='orange')+
  xlim(13,90)+
  geom_line(stat='summary', fun.y=mean)
Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing missing values (geom_point)."

sau mediana

In [46]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/10, color='orange')+
  xlim(13,90)+
  geom_line(stat='summary', fun.y=median)
Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing missing values (geom_point)."

sau amandoua

In [48]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/10, color='orange')+
  xlim(13,90)+
  geom_line(stat='summary', fun.y=median)+
  geom_line(stat='summary', fun.y=mean, color='red')
Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing missing values (geom_point)."

si peste ele si quantile

In [49]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/10, color='orange')+
  xlim(13,90)+
  geom_line(stat='summary', fun.y=median)+
  geom_line(stat='summary', fun.y=mean, color='red')+
  geom_line(stat='summary', fun.y=quantile, fun.args=list(probs=0.1), linetype=2, color='blue')
Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing missing values (geom_point)."
In [50]:
summary(pf$friend_count)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    31.0    82.0   196.4   206.0  4923.0 
In [51]:
quantile(pf$friend_count, probs=0.25)
25%: 31
In [52]:
quantile(pf$friend_count, probs=c(0.1,0.25,0.5,0.75,0.90))
10%
9
25%
31
50%
82
75%
206
90%
440
In [53]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/10, color='orange')+
  xlim(13,90)+
  geom_line(stat='summary', fun.y=median)+
  geom_line(stat='summary', fun.y=mean, color='red')+
  geom_line(stat='summary', fun.y=quantile, fun.args=list(probs=0.1), linetype=2, color='blue')+
  geom_line(stat='summary', fun.y=quantile, fun.args=list(probs=0.9), linetype=2, color='blue')
Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing non-finite values (stat_summary)."Warning message:
"Removed 4906 rows containing missing values (geom_point)."
In [54]:
ggplot(aes(x=age, y=friend_count), data=pf)+
  geom_point(alpha=1/10, color='orange')+
  geom_line(stat='summary', fun.y=median)+
  geom_line(stat='summary', fun.y=mean, color='red')+
  geom_line(stat='summary', fun.y=quantile, fun.args=list(probs=0.1), linetype=2, color='blue')+
  geom_line(stat='summary', fun.y=quantile, fun.args=list(probs=0.9), linetype=2, color='blue')+
  coord_cartesian(ylim=c(0,1000), xlim=c(13,90))

Exercitii

Cu tabelul pseudo_facebook:

  1. Reprezentati printr-un nor de puncte variabilele friend_count si friendships_initiated.
  2. Testati 3 valori pentru alpha in graficul de mai sus.
  3. Colorati punctele din nor intr-o culoare la alegere.
  4. Colorati punctele din nor in functie de sex.
  5. Reprezentati printr-un nor de puncte variabilele dob_year (orizontal) si mobile_likes (vertical).
  6. Testati 3 valori pentru alpha in graficul de mai sus.
  7. Colorati punctele din nor intr-o culoare la alegere.
  8. Colorati punctele din nor in functie de sex.
  9. Grupati datele din pf in functie de anul nasterii.
  10. Pentru fiecare an al nasterii salvati intr-un tabel nou media, mediana si numarul de observatii pentru variabila mobile_likes.
  11. Reprezentati intr-un grafic media si mediana pentru mobile_likes pentru fiecare an (printr-o linie poligonala).
  12. Reprezentati intr-un grafic numarul de mobile_likes pentru fiecare an.
  13. Grupati ultimele doua grafice impreuna.
  14. Reprezentati printr-un nor de puncte dob_year si likes_received.
  15. Adaugati pe graficul de mai sus media likes_received pentru fiecare dob_year (printr-o linie poligonala).
  16. Adaugati pe graficul de mai sus mediana, quantilele 10% si 90%.
  17. Reveniti la toate graficele de mai sus si adaugati etichete potrivite pentru axe, titluri, subtitluri, si explicatii (caption).
In [55]:
str(pf)
'data.frame':	99003 obs. of  16 variables:
 $ userid               : int  2094382 1192601 2083884 1203168 1733186 1524765 1136133 1680361 1365174 1712567 ...
 $ age                  : int  14 14 14 14 14 14 13 13 13 13 ...
 $ dob_day              : int  19 2 16 25 4 1 14 4 1 2 ...
 $ dob_year             : int  1999 1999 1999 1999 1999 1999 2000 2000 2000 2000 ...
 $ dob_month            : int  11 11 11 12 12 12 1 1 1 2 ...
 $ gender               : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 2 1 2 2 ...
 $ tenure               : int  266 6 13 93 82 15 12 0 81 171 ...
 $ friend_count         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ friendships_initiated: int  0 0 0 0 0 0 0 0 0 0 ...
 $ likes                : int  0 0 0 0 0 0 0 0 0 0 ...
 $ likes_received       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_likes         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_likes_received: int  0 0 0 0 0 0 0 0 0 0 ...
 $ www_likes            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ www_likes_received   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobile_check_in      : Factor w/ 2 levels "yes","no": 1 1 1 1 1 1 1 1 1 1 ...