Home Page
Figure 1: Home Page [PNG]

Assignment 3 Project Report

Visualization 1 - Email Content Word Cloud

Visualization 1 - Email Content Word Cloud
Figure 2: Visualization 1 - Email Content Word Cloud [PNG]

We decided to make clickable word cloud images for visualizing the emails’ text content from assignment 1 and 2. The intent is to compare the original 4000 email corpus with the machine generated email text and understand how attackers phrased. The top 150 words with the most frequency are displayed in the cloud proportional to their appearance in the emails, with punctuations and stop words removed. We can see that in a lot of words appeared often in both corpus, such as ‘ bank’, ‘money’, ‘fund’, and ‘business’. It is intuitive to understand that phishing emails mention words relating to money and fund a lot and this is closely connected to their attack types.

Visualization 2 - Attack type popularity and connectivity

Visualization 2 - Attack type popularity and connectivity
Figure 3: Visualization 2 - Attack type popularity and connectivity [PNG]

To the very beginning, we are required to seek the correlation among attack types and other fraudulent emails’ features. However, we believe that the inner relationships among these four attack types might be ignored. So we decided to see the popularity and connectivity of each attack type by using Venn Diagram. The “RE”,”SE”,”CP”,”MA” each stands for “Reconnaissance”, “Social Engineering”, “Credential Phishing”, “Malware”. And according to the graph, we can find that social engineering is very likely to appear together with credential phishing and malware is more likely to be an independent attract type.

Visualization 3 - Multi-Line Chart of Date & Stock

Visualization 3 - Multi-Line Chart of Date & Stock
Figure 4: Visualization 3 - Multi-Line Chart of Date & Stock [PNG]

We would like to find out whether the number of fraud emails for each day from 2003 to 2007 is related to the daily opening and closing stock prices. We chose a line chart for this because the line chart can clearly show the trends of how data vary. There is only little difference between opening and closing prices, but it’s clear that the prices are continually increasing over time. From the line at the bottom (which can be hardly seen), we can see that the number of fraudulent emails doesn’t vary too much. Thus, the stock prices seem to have little influence on the fraud emails.

Visualization 4 - Calendar View of Fraudulent Emails

Visualization 4 - Calendar View of Fraudulent Emails
Figure 5: Visualization 4 - Calendar View of Fraudulent Emails [PNG]

We wanted to find out how the stock market changed over the time period of these fraudulent emails. Besides the multi-line chart that shows the overall stock market price changes during 2003 to 2007, we also wanted to show how the daily opening and closing stock price related to the previous day. Calendar view plot is a great way to show off our intent. The greener the blocks are, the more daily changes it has. We also have some white blocks since we don’t have all the prices of these dates. However, sometimes we may have multiple stock prices on the same day, then we will take the average. We can see that the changes most happened around the winter/spring of 2003 and the summer of 2007.

Visualization 5 - Bubble Map of Fraudulent Emails Attacker Locations

Visualization 5 - Bubble Map of Fraudulent Emails Attacker Locations
Figure 6: Visualization 5 - Bubble Map of Fraudulent Emails Attacker Locations [PNG]

This is a bubble map for fraudulent email attackers’ location or the countries that they mentioned in the email contents. Size of the bubble indicates the number of spam emails sent from a certain location. The bubble map informed us that the majority of the attackers are located in North America. Some spam email attackers clustered in Africa and South America. Only a few attackers highlighted Asian countries in their emails or sent out emails from Asian countries. We would not be able to easily conclude where the majority of the spam emails came from just by extracting ip address or mentioned locations from email. Generating graphs are visually more straight-forward and convincible.

Contributions

Yuchen Zhang
  • Convert tsv to json using python3 compatible etllib
  • Pull request: https://github.com/Anthonyive/etllib.git
  • Set up Flask for team member’s visualizations
  • Visualization 4 - Calendar View of Fraudulent Emails
  • Run Image Space using face generator output from assignment 2
  • Help on Task 5 data conversions
  • Xiuwen Zhang
  • Visualization 3 - Multi-Line Chart of Date & Stock

  • Run and test GeoParser


  • Nat Zheng
  • Cleaned and reorganized the attacker locations from assignment 1 and stored stored them to a valid input csv file for visualization
  • Visualization 5 - Bubble map of fraudulent attackers’ location
  • Zixi Jiang
  • Made Visualization 1 - Clickable Word Cloud
  • Ran GeoParser and got the location graphs
  • Peizhen Li
  • Task 3 - ingest Task1 and Task 2 data into ElasticSearch
  • Task 5 - Using Elasticdump to generate new json files with Index for Task 3
  • Upload the folder and zip
  • Xiaoyu Wang
  • Visualization 2 - Attack type popularity and connectivity
  • Modify scripts for GeoParser, prepare datasets for geological data in assignment1&2, and test GeoParser