Lab #5: Basics of Social Network Analysis

In this lab we will be practicing the fundamentals of network analysis.

See the “Instructions” section of the Introduction to Lab Assignments page for more information about the labs. That page also gives descriptions for the datasets we will be using.

Required reading:

Theoretical:

Chapter 3: Integrated Network Analysis Tutorial (see Slack)
Social Network Analysis Wikipedia
Network Centrality Measures on Wikipedia
Using Metadata to find Paul Revere

Practical:

Optional resources:

Official documentation for igraph package
Official documentation for ggraph package
Introduction to ggraph layouts
Intro to Network Analysis with R, by Jesse Sadle.
Network analysis with R and igraph: NetSci X Tutorial (Parts 2-7), by Katya Ognyanova
R4DS Chapters 17-21: Programming
Bipartite/Two-Mode Networks in igraph by Phil Murphy & Brendan Knapp: specifically, the sections “Loading and configuring two-mode data” and “Another way to produce an overlap count in igraph bipartite_projection()”.
Bipartite Graph Wikipedia

Lab Dataset

For most of this lab we will use the common_subcomm data frame in addition to our other data sets. This dataset lists the number of common subcommittees that each pair of congress members is on. Note that the “from” and “to” columns correspond to the bioguide_id of congress members, and common_subcomm is the number of common committees they are on.

common_subcomm %>% head()

##      from      to common_subcomm
## 1 S001209 L000578              1
## 2 S001209 V000132              1
## 3 S001209 D000616              2
## 4 S001209 P000597              1
## 5 S001209 A000372              1
## 6 S001209 K000382              1

common_subcomm %>% summary()

##      from                to            common_subcomm  
##  Length:13851       Length:13851       Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 1.000  
##                                        Mean   : 1.576  
##                                        3rd Qu.: 2.000  
##                                        Max.   :12.000

Example Questions

ex1. Create an igraph graph from the data frame common_committees, and print the number of vertices and edges it has.

g <- common_subcomm %>% graph_from_data_frame()
V(g) %>% length()

## [1] 517

E(g) %>% length()

## [1] 13851

ex2. Create an igraph graph where edges only exist between congress members on at least 3 committees together. Output the number of edges that result.

# two ways to do this: filter before creating graph, or filter edges using igraph

# solution 1: filter before crreating graph
g <- common_subcomm %>% 
  filter(common_subcomm >= 3) %>% 
  graph_from_data_frame()

# solution 2: filter edges using igraph
h <- common_subcomm %>% graph_from_data_frame()
edges_to_remove <- E(h)[E(h)$common_subcomm < 3]
h <- h %>% delete_edges(edges_to_remove)

E(g) %>% length()

## [1] 1611

E(h) %>% length()

## [1] 1611

#node_attr <- congress %>% filter(bioguide_id %in% V(g))
#g %>% set_vertex_attr('full_name', index=node_attr$bioguide_id, value=node_attr$full_name)

ex3. make a simple visualization of the previous network using ggraph.

g <- common_subcomm %>% 
  filter(common_subcomm >= 3) %>% 
  graph_from_data_frame()

g %>% ggraph() +
  geom_edge_link(aes(alpha=common_subcomm)) +
  geom_node_point()

## Using `sugiyama` as default layout

ex4. make a network that includes only senators, not representatives.

# two fine ways to do this - filter the dataframe before making the graph, or 
# remove nodes after creating the graph

##################### solution 1 - filter dataframe first #####################
senators <- (congress %>% filter(type=='sen'))$bioguide_id
g <- common_subcomm %>% 
  filter((to %in% senators) & (from %in% senators)) %>% 
  graph_from_data_frame()

V(g) %>% length()

## [1] 99

E(g) %>% length()

## [1] 2578

##################### solution 2 - filter in graph #####################

# get ids of node set
node_ids <- c(common_subcomm$from, common_subcomm$to) %>% unique()

# get only rows of congress that are in the network - this will be used as node data
node_data <- congress %>% filter(bioguide_id %in% node_ids)

########### graph creation method 1: use the vertices argument in graph_from_data_frame to add all the vertex info
# this works because the first column of node_data is the bioguide_id.
h1 <- common_subcomm %>% 
  graph_from_data_frame(vertices=node_data)

########### graph creation method 2: use set_vertex_attr to add the specific attribute to the graph after creating it
h2 <- common_subcomm %>% 
  graph_from_data_frame()
h2 <- h2 %>% set_vertex_attr('type', index=node_data$bioguide_id, value=node_data$type)

# h1, h2 should the be the same
print(paste(gsize(h1), gsize(h2), gorder(h1), gorder(h2)))

## [1] "13851 13851 517 517"

# filter out actual nodes
vertices_to_remove <- V(h1)[V(h1)$type!='sen']
h1 <- h1 %>% delete_vertices(vertices_to_remove)

# same as in solution 1
V(h1) %>% length()

## [1] 99

E(h1) %>% length()

## [1] 2578

Questions

1. Describe the following concepts using the suggested readings or by searching on the web:

Basic Elements of Networks
  nodes (also called "vertices"): 
  edges (also called "ties" or "links"): 

Network Representations
  edge list: 
  adjacency matrix: 

Types of networks
  directed vs undirected network: 
  weighted vs unweighted network:

2. Using resources in the suggested readings and on the web, describe three different centrality measures that can be used to summarize the positions of specific nodes/vertices within a network: betweenness centrality, strength centrality, and eigenvector centrality. Give an example use case for each of these measures.

HINT: see required reading about centrality measures on Wikipedia to get some ideas.

1. 
2. 
3.

3. Describe the behavior of the following functions, including their outputs and the behavior of each argument/parameter.

graph_from_data_frame: 
graph_from_edgelist: 

E(): 
V(): 

strength: 
betweenness:

4. Examine the common_subcomm dataframe. Would we consider this as an adjacency matrix or edge list network representation? Is the network that can be constructed from this dataframe weighted/unweighted or directed/undirected? Use your substantive understanding of the data to answer these questions.

Adjacency Matrix or Edge List?: 
weighted/unweighted?: 
directed/undirected?:

5. Create a visualization showing a network of senators (NOT representatives) where edges exist only between those senators that are on at least three subcommittees. Set node color based on the gender of the senators. Do you see any patterns visually?

HINT: see the example questions for some ideas of how to accomplish this.

# Your answer here.

6. Find the average betweenness centrality (ignoring edge weights) of (ONLY) senators by gender after filtering edges that don’t have at least three common subcommittees.

HINT: see as_data_frame for creating dataframes from node or edge attributes.

7. Compute the correlation between birthyear and betweenness centrality (ignoring edge weights) of senators, filtering for edges with fewer than three common subcommittees. What can you conclude from the sign (positive or negative) of this result?

# Your answer here

8. Compare the average shortest path length for senator and representative networks after filtering for edges where congress members are on at least three common committees. Why are they different?

# your answer here

Why are they different?

9. In last week’s lab exercise, you were asked to identify several possible datasets you could use for your final project. Now write two specific data science research questions and describe variables in that dataset that could allow you to answer the questions.

HINT: What is a good research question? A good data science research question specifies a relationship between two or more variables that you can measure. The question “why did the chicken cross the road?” is not a good research question because it does not explicitly describe the relationship between any variables. The question “do chickens cross the road more frequently than raccoons?” is good because it specifies a relationship between the type of animal (chickens and raccoons) and the frequency with which the animal crosses the road. Your question should be answerable given the specific variables available in your dataset.

# your answer here

Lab #5: Basics of Social Network Analysis

Data Science and Society (Sociology 367)

Lab Dataset

Example Questions

Questions