Seek Natural Language Processing

Using Natural Language Processing to predict salaries from scraped job listings.



This project was an assignment I completed as part of my General Assembly Data Science intensive. It required me to construct my own dataset of "data related" job postings scraped from a job listings website. I chose to work with the Seek website. Once the dataset was collected, I was looking to determine the following:

• The industry factors that are most important in predicting the salary amounts for these data related jobs.
• The factors that distinguish job categories and titles from each other.

Once I had gathered my data through web scraping, I using Natural Language Processing to give my dataset new, text-focussed features to train a model to predict salary.

To tackle the second part of the problem, I again used Natural Language processing. This time I chose to focus on verbs used in the job listings (which were predominantly correlated to "skills"). I used this data to train another model to predict job cetegories based on these verbs.

Technology Used_

The project was completed with Python in a Python Notebook. Here are a selection of the libraries used:

• BeautifulSoup / Requests for web scraping.
• Re (regular expression) for data cleaning.
• Sci-Kit Learn for building classification models, regression models, NLP and metrics.
• PyCaret for initial machine learning model comparisons.
• TextBlob / SnowballStemmer for NLP.
• Pandas for data frame management.
• NumPy / SciPy for crunching numbers.
• Matplotlib / Seaborn for plotting graphs.


Below is an example of the output of the first stage of web scraping.



I’ve collected a few resources for further reading:

TowardsDataScience article by Raheel Shaikh on the topic of Natural Language Processing.
Article by Rahul Nayak about web scraping with Silenium and BeautifulSoup.

© Copyright 2020 Tom Gilmore. All Rights Reserved.