Instagram Data Analysis
Community Detection and Content Recommendation
Kimia Mavon
Francesca Morini
Karan Rajesh Motwani
Moreno Raimondo Vendra
Le Sirenuse Positano is a luxury 5 star hotel located in Positano, on the Amalfi Coast. The Emporio Sirenuse store stands in front of the hotel. Here, tourists and guests can find a collection of dresses and swimwear aligned with the Mediterranean style of the hotel. The collection, curated by the hotel owner Mrs. Carla Sersale, is also sold online on a dedicated website and on different e-commerce channels. Both the hotel and the store have an official Instagram account, though the hotel has a much wider influence. The goal of this project is to improve the online presence of the clothing brand by leveraging social media platforms, while protecting its heritage.
In this project, we address how a luxury brand may utilize a growing digital economy using big data, while maintaining its heritage. Our proposed tool, Frank, uses a human-in-the-loop model to identify relevant communities and suggest a post’s text, hashtags, and images. This system is housed in an interactive visualization interface designed for intuitive analysis.
We were given access to a dataset composed of Instagram data from the accounts of Emporio Sirenuse and its followers, along with the accounts of 9 other competitors and their followers. We want to:
This is a difficult task since social media data, while being complex and composed of heterogeneous components (text, image, geo-tags, hashtags, mentions...etc), tends to be very abstract and hard to ascertain. Moreover this heterogeneity also translates into completely different feature detection methods, which are hard to integrate and validate in an unsupervised learning setting like ours.
We intend to cluster users into specific communities, based on their Instagram activity. In order to do that we cluster all their instagram posts into general clusters, which can be thought of as standard units and then we cluster the users based on their representation in these standard units. This allows us to find communities of users that share interests, and post about the same topics in the same measure.
This clustering system is hinged on finding a latent space that accurately sepearates the posts based on their features and characteristics. Therefore, extracting the right features from images and text to create a large vector representation of an Instagram post is the most significant and complex task.
The initial implementation of image clustering involved using basic image features such:
A large vector representation of the image was thus constructed and K-Means was performed with K=5
The initial implementation of text clustering involved:
A vector representation of the text is thus a count vector of the entire observed vocabulary.
Of course this way the features were easy to extract and very interpretable. However, the features did not include any information about the objects/context in the image or style features in the text: they did not account for correlation between text and image and the features themselves are too simple to handle the stochasticity of social media data.
To add the perspective of what are the contents of the image, we tried using the Object Detection model designed by the Facebook AI Research team named Mask R-CNN. This model includes the power of a regular Faster R-CNN implementation and includes a Fully Convolutional Network (FCN) branch which performs pixel-by-pixel segmentation of each Region of Interest (RoI). The model detects 70 classes of objects and thus the image is represented by a vector of dimension 1x70.
To add the information of image context, we trained a Scene Detection model on the Flickr dataset which consists of images and corresponding captions. This model uses a CNN for Image Feature Extraction combined with an LSTM for Text Embedding. The model is trained to learn the relationship between the two vectors. The model outputs a string caption for an image input. For clustering, this caption was encoded using Bag of Words into a count vector.
Person | Ski Blades | Hill |
---|---|---|
1 | 1 | 1 |
“A person skis off a ramp to perform a stunt”
Caption : “She can fly !!!”
This approach managed to find the following clusters: Cosmetics, Modelling, Lifestyle.
Given the lack of ground truth at our disposable, we curated a dataset to help quantitatively evaluate the results of clustering. Thus, we chose 4 hashtags and scraped 10,000 posts from Instagram for each of them. The hashtags were:
Note : Although the topics are well defined, only ~70% of the dataset scraped was relevant. The remaining posts were either sarcastic, non-contextual or abstract.
To obtain a Latent Representation of the Image, we trained a Neural Network to reproduce the original input using a symmetric architecture i.e. an Autoencoder.
An autoencoder neural network is an unsupervised Machine learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. An autoencoder is trained to attempt to copy its input to its output. Internally, it has a hidden layer that describes a code used to represent the input.
The parameters of this implementation were:
The images produced were blurry and did not represent the input well. This can be attributed to the degree of variability in the dataset and overall complexity of the network.
We can clearly observe that the clusters are not well separated and the cluster purity when matched to the ground truth is not high.
To correctly classify the input images into topic classes by identifying distinct features of an image, we trained a neural network with convolutional layers i.e. Convolutional Neural Network.
A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units.
The parameters of this implementation were:
The classification accuracy obtained was 70%. The class which was hardest to predict was ‘Rock Climbing’.
We can clearly observe that the clusters are better separated than the Auto Encoder implementation. The cluster purity when matched to the ground truth is also relatively better.
To correctly classify the input text into topic classes by identifying distinct features such as style and grammar, we trained a neural network with a non-preset memory window i.e. a Long Short Term Memory based Recurrent Neural Network.
Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. Each of the three gates can be thought of as a "conventional" artificial neuron, as in a multi-layer (or feedforward) neural network: that is, they compute an activation (using an activation function) of a weighted sum.
The parameters of this implementation were:
The classification accuracy obtained was 87%. The class which was hardest to predict was ‘Nightlife’.
We can clearly observe that the clusters are well separated. The cluster purity when matched to the ground truth is very impressive.
After observing the results, the choice of model for input Image Embedding was a CNN, whereas LSTM was chosen for text
At this point we need to find the optimal cluster count through Elbow and Avg. Silhouette method; we can clearly observe that the best cluster count is between 4 and 5
Feature extraction is the most important step for efficiently clustering data and that naive feature extraction is simple but not accurate for finding more distinct features of online communities. More specifically when dealing with image data Convolutional Neural Networks are really powerful in feature extraction, whereas for text LSTMs are more robust than standard RNNs given how they overcome the problems of vanishing/exploding gradients. Combining CNN and LSTM features provides an accurate representation of an Instagram post and this allowed us to create meaningful users clusters.
From the clustering, we have an understanding of the underlying communities of users.
The intended user, a social media manager for example, will upload his or her pictures (from a photoshoot) to the tool. His/Her goal is to create a post that best appeals to a chosen community. In other words, the social media manager wants to optimize his/her image, caption, and hashtag for a given community.
A post can be optimized by two metrics:
The former requires an “engagement metric,” or a score that reflects the likelihood it will be received positively by the population; we address these two issues below.
In business, social media engagement measures the public shares, likes and comments for an online business' social media efforts. Engagement is a common metric for evaluating social media performance. While maximizing engagement is best, companies like Facebook, Twitter, and e-commerce retailers differ in how they calculate standardized engagement scores. Engagement scores typically include likes, comments, number of followers, and other metrics like views, mentions, and more.
We wanted to predict engagement from the quality of a photo:
Where the dependent variable, y, is the engagement score and the independent variables are the featured images. Thus, this became a supervised machine learning problem.
We calculated an engagement score for every post within each community. The team had much debate in deciding how to calculating an engagement score. We wanted to normalize by the number of followers so that popular people (“Influences”) did not overwhelm the metric, while also crediting comments, which can often be outweighed by the number of likes. We experimented with z-scores and linear combinations of these variables.
Proposed engagement scores were created using an iterative process between adjusting the composition and linear combinations of possible calculations with various supervised machine learning algorithms:
We began using the engagement score for linear regressions, but ultimately converted a score into a binomial metric and chose a logistic regression. This method out-performed all others on the test set. A 0/1 was calculated by whether an engagement score was above or below the median. The final model was a logistic regression SVM, as this supervised machine learning algorithm had the highest test score with an R-squared of .64
One reason linear combinations with z scores did not work is because outliers, while useful for the compilation of communities, severely affected engagement scores that used z scores. This is because as a line was attempting to project into a multidimensional space, all of the features were clustered into only a few points (a score of -8, or 40 for example). This caused very poor performance on the test set. We will return the engagement metric later
Now, looking back at the tool’s pipeline, we can see that a social media manager has uploaded pictures and would like suggested content. Each picture uploaded by the social media manager is passed through the clustering algorithm, and receives a probability score that the picture belongs to each community. The probability scores that a given picture belongs to each community sums to one. Each community also has its own distribution- the compilation of images that have been passed through the clustering algorithm.
Thus, we have probability distributions that we may measure the similarity between using the Kullback-Leibler Divergence (often shortened to just KL divergence).
KL Divergences measure the difference between two probability distributions p(x) and q(x):
Therefore, a lower KL divergence score suggests the two distributions are more similar.
Therefore, we can quantitatively measure the similarity or difference between an uploaded picture and the community. Each uploaded picture is iterated and a KL divergence score is computed by comparing each picture’s distribution with the target communities distribution. The KL scores and then ranked, from lowest to highest, to suggest which pictures best match the target communities’ images.
Now, we would like to use our trained, and pickled, logistic SVM model to predict whether an incoming image will have a 0/1 (or be a post that is predicted to have ‘engagement.’ Once a score is predicted, we use a linear combination between these scores (0/1) and the KL-Divergence scores. Because it is best to have a low KL divergence score though, we weighted pictures with predicted poor engagement scores (0) to increase their final score. Therefore, we added a 1 to the KL score of pictures that were predicted to be less popular. This was a conservative decision, as many of the images uploaded by the social media manager are expected to be high quality, professional photos with strong lighting and composition. This was qualitatively tested on a sample of previous Le Sirenuse professional photographs.
Now, we return back to community characterization. To best target a community, one should know how to speak like them and talk about similar themes. Once an engagement metric was established (a standardized ‘likes’ score), an engagement score was calculated for each post per community. This score then signals which accounts in the community have the highest engagement. Our goal is to collect text (topics and hashtags) used by the entire community’s posts, but weight the words by the posts’ engagement score.
The entire corpus post was split between caption and hashtags. Then each of these were transformed into a numerical form using Word2vec. Word2vec is a two-layer neural net that processes text and output a set of vectors. Then, the words said in each post were counted, then multiplied by user engagement score for that post. This was then summed across all possible words to identify the the most “popular” words used within a community, weighted by the engagement score. Therefore, more popular posts would have a higher ranking than posts with little engagement. Now, we have a compilation of popular words and hashtags used by each community.
We’re performing clustering and recommendation based on Instagram content (captions and pictures). Therefore, content must be relevant for the client.
Brands and retailers produce on average more than double the content of consumers. Moreover their posts usually target specific topics. This may introduce biases in our training datasets of Instagram accounts, hiding more relevant insights about consumers habits.
We were given a list of 685 hand labeled users, of which:
More specifically we were given Instagram data of the users accounts:
And of their posts:
How to distinguish brands and retailers accounts from consumers accounts? Which are the features, and characteristics that discriminate the two classes?
We're looking for different habits or strategies for posting contents, like differences in the number of posts per day or the distribution of posts over the time of the day, or again tags and mentions usage. The underlying assumption to this analisys is that regardless of the actual content of the posts, the behaviour of the user online can be a good indicator of wether or not we're in presence of a consumer. More indication about the users nature can be found in the content the user uses to present herself on Instagram, such as the profile picture and the biography.
In order to find the correct features to distinguish users of the two classes we started from the most simple and raw data we had: accounts metadata. Some preliminar analisys were preformed representing each users with all of the account metadata available such as:
The results of these preliminar analisys were quite disappointing so we started making research about which features could be relevant and which were not.
Based on the data we had we could compute two different posts counts: we had the total posts count, that is the one indicated in a user's profile, or we could count the posts of which we had metadata, that are the posts that the user uploaded during 2017. The second choice seemed much more relevant as it would introduce a kind of normalization, telling us how many posts each user had uploading in a given time window.
The results we already interesting: brands and retailers produce double the number of posts with respect to consumers, so this feature might be relevant for our task.
Keeping in mind our assumption we looked for a first very simple posting strategy: we tried to compare the times at which the two classes of users uploaded their posts. Brands and Retailers may try to concentrate their posts in specific times of the day in order to maximize the exposure of their content to consumers. These were the resulting distributions:
Number of posts per hour from consumers
Number of posts per hour from brands and retailers
In this case we can't observe a meaningful difference in the distribution of posts over time, so it is not going to serve us as a feature to distinguish the two classes.
As we stated before we're not looking directly at content, but we are rather trying to define a common behaviour. In this case we look for regularity of usage of tags and mentions. Brands and retailers tend to use tags and mentions in a strategic way in order to target a given audience, while normal consumers usually don’t; a very simple approach is the following: build a vector counting the number of times a tag was used by a user, then compute mean and variance of that vector. Using this same approach with mentions and adding these 4 features to the representation of a user brought slight improvements to the classifier.
In the case of profile pictures there were two completely different behaviours that could easily be recognized just looking at the users profiles: brands and retailers have a logo or text in their profile picture, whereas consumers have a picture of themselves, usually their face. By using the the OpenCV library, more specifically Haar Cascades, we were able to perform face detection on the users profile pictures and add that as a relevant feature. Moreover we used the tesseract ocr in order to detect the presence of text in the picture. These two features were really relevant and improved the classifier performance.
As for profile pictures, bios are used by users to describe themselves, so it is to be expected that the vocabulary used by the two classes of users differs in a relevant way. Moreover usually Instagram bios are not complete sentences, but rather a sequence of words that describe the user. For this reason we used a Count Vectorizer wich takes in input the bio text and returns a vector counting the occurrencies of each word. As expected the results differed quite a lot!
Occurrencies per word in consumers bio
Occurrencies per word in brands and retailers bio
Adding the presence of words from these vocabularies in the bios of the users showed improvements in the results of the classifier.
The following results were obtained using 3 fold cross validation and grid search for the parameters optimization
Support Vector Classifier
Logistic Regression
Multi Layer Perceptron Classifier
Random Forest Classifier
Since the purpose of this classifier is to detect brands and retailers profiles in order to remove their posts from the clustering, we aim to get the highest possible recall as it will remove the highest number of unwanted posts. False negatives in this case have a much higher cost than false positives (brands and retailers average number of posts per year are 335, versus 160 for consumers). Thus in this setting Support Vector Classifier is the best performing classifier for the task.
This is the first step: Frank is designed to help you choosing the perfect content for your Instagram account. Thanks to its algorithms, Frank will be able to give you suggestions starting from a bunch of pictures. Here's the first thing you should do.
Here's where the magic happens. As you can notice the screen is now splitted in two section. On the left you will have the interactive visualization of the Instagram users communities 8we call it "The Squid". On the right side of the screen you will always have thumbnails for each picture you uploaded. By clicking on the single picture you will access some useful details.
By using the dropdown menù you will be able to change the ranking of the pictures by setting a different cluster as default. NOTICE: This is not happening if you interact with the visualization in the first place, here the algorithm will re-calculate pictures scores
This page will visualize different visual models for the selected picture. In this phase you will be able to go deep in your picture analysis and decision making process.
As a future development this section is still not avaible on Frank. but we wanted to let you know what we could work on in the future.
Frank is based on web technologies such as HTML5 and Javascript. All of its visualizations are dynamic, based on the data it receives in input. In order to develop the visualizations and interactions D3.js was used; the Javascript frontend interacts with a web server, specifically a Django web server, which is written in python and allowed simple integration with the clustering and recommending pipelines.
In this project, we address how a luxury brand may utilize a growing digital economy using big data, while maintaining its heritage. Our proposed tool, Frank, uses a human-in-the-loop model to identify relevant communities and suggest a post’s text, hashtags, and images while housed in an interactive visualization interface designed for intuitive analysis. Working with a client provided practical experience setting expectations, pitching an idea, learning in the wild, and overcoming team and managerial obstacles while maintaining momentum on an international team.
Interesting avenues for potential development include: