Data Scientist Malaysia Job Analytics

According to Oxford Languages, a Data Scientist is “a person employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making.” In 2016, Glassdoor asserted that being a Data Scientist was the best job in America. Last year, Data Scientist ranked as the number three job for 2020 by Glassdoor, and it’s only surpassed by Java Developer and Front End Engineer.

According to Udacity, top 8 skills for Data Scientists are:

#1. Math and Statistics.
#2. Analytics and Modeling.
#3. Machine Learning Methods.
#4. Programming.
#5. Data Visualization.
#6. Intellectual Curiosity.
#7. Communication.
#8. Business Acumen.

As a data science practitioner in Malaysia, I am interested to find out the skills and trends that are most sought in the industry right now. I do understand the fact that most companies look for an applicant who fits their vacancy and also has the skill set in the technologies they are looking for. This will definitely save time and money for the company instead of providing the training again.

In this article, I will try to find answers to a few important questions, which every data science jobseeker will have in mind, such as below:

1. What are the top skills companies are looking for?
2. What are the top tools companies are looking for?
3. What are the companies that are actively offering jobs in this field?
4. What are the locations that have more openings?

To investigate the above research questions, we will perform an analysis from the recent (21 September 2021) advertised Data Scientist jobs, extracted from job portal using web scraping. The tasks for the analysis can be grouped into several main tasks:

(1) Web scraping of Job Portal (Indeed)

I have gathered all the relevant job information (job title, company name, location, job summary and descriptions) from the main job portal in Malaysia (Indeed) which almost every jobseeker and recruiter uses these days. I used BeautifulSoap (Python library) for web scraping, since the traditional approach works well with the portal website.

A sample job listing from Indeed Malaysia

The main Python libraries used are BeautifulSoup, Requests (simple HTTP library for Python) and NLTK (text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning).

# installing required libraries
!pip install bs4
!pip install requests
!pip install nltk==3.6.2

We import main libraries including Pandas (open source data analysis and manipulation tool) to handle data frame for data processing.

import nltk
import pandas as pd

from datetime import datetime
import requests
from bs4 import BeautifulSoup
import sys, time

print('The Python version is {}.'.format(sys.version_info))
print('The nltk version is {}.'.format(nltk.__version__))

Scrape Indeed Malaysia job portal using “data scientist” keyword.

# prepare the link
def get_url(position , location):
  template = 'https://malaysia.indeed.com/jobs?q={}&l={}'
  url = template.format(position, location)
  return url

# get jobs records from html tags using BeautifulSoup library
def get_record(card):
  job_title= card.find('span' , {'title': True}).text.strip()
  company_name = card.find('span' , {'class':'companyName'}).text.strip()
  company_location = card.find('div' , {'class':'companyLocation'}).text.strip()
  job_summary = card.find('div' , {'class':'job-snippet'}).text.strip()
  job_link = 'https://malaysia.indeed.com'+card.get('href')
  try: # try get a salary if exists
    job_salary = card.find('span' , {'class':'salary-snippet'}).text.strip()
  except AttributeError:
    job_salary = ''
  try: # try get a job desriptions if exists
    job_desc = ''
    # creating request object
    req = requests.get(job_link)
    # creating soup object
    data = BeautifulSoup(req.text, 'html')
    # finding all li tags in ul and save the text within it (job descriptions)
    data1 = data.find('ul')
    for li in data1.find_all("li"):
        job_desc = job_desc + li.text
  except AttributeError:
    job_desc = ''
  # create job records to be save later
  record = (job_title,company_name,company_location,job_summary,job_salary, job_link,job_desc)

  return record

def jobsearch(position,location):
  records = []
  print("Job search:", position)
  print("Job location:", location)
  url = get_url(position, location)

  while True:
      response = requests.get(url)
      soup = BeautifulSoup(response.text,'html.parser')
      cards = soup.find_all('a' , {'data-hide-spinner':"true"})
      for card in cards:
        record = get_record(card)
        #print(record)
        records.append(record)
        #print(len(records))
      try:
        url = 'https://malaysia.indeed.com'+ soup.find('a', {'aria-label':'Next'}).get('href')
      except AttributeError:
        break
  
  # create pandas dataframe to save search job results
  indeed_data = pd.DataFrame(columns=['job_title','company_name','company_location','job_summary','job_salary','job_link','job_desc'])
  for x in records:
    j_title = x[0]
    j_cn = x[1]
    j_cl = x[2]
    j_s = x[3]
    j_sal = x[4]
    j_l = x[5]
    j_dsp = x[6]
    indeed_data = indeed_data.append({"job_title":j_title, "company_name":j_cn, "company_location":j_cl, "job_summary":j_s, "job_salary":j_sal, "job_link":j_l, "job_desc":j_dsp}, ignore_index=True)
  
  # export pandas dataframe into excel
  indeed_data.to_excel('jobsData.xlsx')
  print('DataFrame is written to Excel File successfully.')

  return indeed_data

# search "data scientist" jobs in Malaysia
df_data = jobsearch('data scientist', 'Malaysia')

We obtained about 191 job results from the portal website.

# print a sample of data frame
df_data.head()
Sample of Data Scientist data frame in Pandas

We can observe from a sample data above that some job results title may not relevant to data scientist job (index data no. 3 shows ‘Administrative Assistant’). We can use Pandas data filter method to obtain only title with includes ‘Data Scientist’ string. From a total of 191 job results, we managed to get around 54 data scientist jobs for further analysis.

# Only titles specifically contains Data Scientist will be considered
df_data_scientist = df_data[df_data['job_title'].str.contains('Data Scientist')]
df_data_scientist.shape

(2) Text Pre-processing of Raw Data

Next, we need to perform some text pre-processing task to clean data for the ‘job descriptions’ data column. This is because the string contains unusual text and symbols that need to be cleaned so that NLTK library can grasp it. The task includes main pre-processing techniques like normalization, tokenization and annotation.

# Clean the data for job desriptions column
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
requirements = df_data_scientist['job_desc']
requirements_token = requirements.str.replace(', ', ' ').str.replace('; ', ' ').str.replace('\n', ' ').str.replace('/', ' ').str.lower().str.split()
requirements_clean = requirements_token.apply(lambda x: [w for w in x if w not in stop_words])

(3) Data Modeling using NLP Tools

For a quick analysis, we can create a Word Cloud to highlight popular words and phrases based on frequency and relevance. They can provide us with quick and simple visual insights that can lead to more in-depth analyses.

# Find the most frequent words
unique_words = {}
for element in requirements_clean:
    for word in element:
        if word not in unique_words:
            unique_words[word] = 1
        else:
            unique_words[word] += 1

# Generate WordCloud
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

def generate_wordcloud(text):
    wordcloud = WordCloud(background_color ='white',
                          stopwords = None,
                          collocations=False,
                          max_words = 30).generate_from_frequencies(text)
    plt.figure(figsize = (12, 12), facecolor = None) 
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

generate_wordcloud(unique_words)
Data Scientist Job Descriptions Word Cloud

Despite we can extract some useful skills from the word cloud above, most of the words do not represent a skill on its own. For example, the most frequently mentioned words are models and data, which are not meaningful by themselves. What models and what data the companies are asking for? To answer this question, we need to extract the combinations of two words.

from nltk.tokenize import word_tokenize
from nltk.util import bigrams
from nltk.corpus import stopwords
nltk.download('punkt')

# Clean the data
stop_words = stopwords.words('english')
requirements = df_data_scientist['job_desc'].fillna('').apply(str) # convert to string
requirements_token = requirements.str.replace(', ', ' ').str.replace('; ', ' ').str.replace('\n', ' ').str.replace('/', ' ').str.lower().apply(word_tokenize)
requirements_clean = requirements_token.apply(lambda x: [w for w in x if w not in stop_words])
requirements_clean_bigrams = requirements_clean.apply(nltk.bigrams).apply(list)

# Find the most frequent two-word combinations
unique_bigrams = []
counts = []

for element in requirements_clean_bigrams:
    for bigrams in element:
        if bigrams not in unique_bigrams:
            unique_bigrams.append(bigrams)

for unique in unique_bigrams:
    count = 0
    for element in requirements_clean_bigrams:
        for bigrams in element:
            if bigrams == unique:
                count += 1
    counts.append((count, unique))
 
counts.sort()
counts.reverse()
i = 0
for i in range(30):
    count, word = counts[i]
    print ("%s: %d" %(word, count))

From here, we can extract some meaningful combination of two words (bigram) such as data science, machine learning and data mining etc.

('data', 'science'): 21
('machine', 'learning'): 14
('data', 'mining'): 11
('models', 'algorithms'): 9
('data', 'visualization'): 7
('use', 'data'): 6
('science', 'department'): 6
('marketing', 'data'): 6
('deep', 'learning'): 6
('data', 'sets'): 6
('techniques', '.'): 5
('python', 'r'): 5
('outcomes', '.'): 5
('data', 'sources'): 5
('business', 'problems'): 5
('various', 'data'): 4
('tools', 'monitor'): 4
('statistical', 'analysis'): 4
('processes', 'tools'): 4
('problems', 'using'): 4
('predictive', 'modelling'): 4
('performance', 'data'): 4
('multiple', 'data'): 4
('model', 'performance'): 4

Now, let us go deeper to investigate further into job description text by extracting common data scientist tools and skills required by these companies.

# import TextBlob to perform more investigation into Job Descriptions
import numpy as np
from textblob import TextBlob

!python -m textblob.download_corpora # download textblob basic tools

# make a copy of original data and add new column 'job_description_analysis' for data manipulation.
df_datascience = df_data.copy()
df_datascience.dropna(inplace=True) # dropna() allows the user to drop Rows/Columns with Null values
df_datascience['job_description_analysis'] = df_data['job_desc'].str.lower()
df_datascience['job_description_analysis'] = df_datascience['job_description_analysis'].apply(lambda row: TextBlob(row).words)

# remove irrelevant words from being considered
stop = stopwords.words('english')
additional_stop_words = ['business','new','team','company','equal','opportunity','gender','orientation', 'religion', 'sex', 'national', 'origin', 'sexual', 'identity', 'marital', 'status','race',       'color', 'employee', 'work', 'experience', 'product', 'role', 'organization', 'job', 'description', 'age', 'disability', 'use']
stop.extend(additional_stop_words)
df_datascience['job_description_analysis'] = df_datascience['job_description_analysis'].apply(lambda row: [wd for wd in row if wd not in stop])

# Set up keywords for search

# Educational Levels
degree_dict = {'bs': 1, 'bachelor': 1, 'bachelors': 1, 'undergraduate': 1, 
               'master': 2, 'graduate': 2, 'mba': 2, 
               'phd': 3, 'ba': 1, 'ma': 2, 'ms': 2, 'msc': 2,
               'postdoctoral': 4, 'postdoc': 4, 'doctorate': 3}

degree_phrases = {'advanced degree': 2, 'ms degree': 2, '4 year degree': 1, 'ph.d': 3, '4-year degree': 1, 'b.s.': 1, 'm.s.': 2, 'm.s': 2, 'b.s': 1, 'ph.d.': 3, 'm.s': 2, 'm.s.': 2, 'master\'s': 2, 'bachelor\'s': 1, 'post-doc': 4, 'post doc': 4}

degree_keywords_set = set(degree_dict.keys())

# Tools Used
tool_keywords = ['airflow', 'angular', 'aws', 'azure', 'bigquery', 'bigtable', 'birt', 'bokeh', #'c', 'caffe',  'cassandra', 'clojure', 'cloud', 'cntk', 'cognos', 'd3', 'dash', 'dask', 'django', 'docker', 'ec2', 'einstein', 'elasticsearch', 'excel', 'firebase', 'flask', 'gcp', 'git',
'github', 'go', 'h20', 'h2o', 'hadoop', 'hana', 'hbase', 'hive', 'java', 'javascript', 'jquery', 'julia', 'jupyter', 'kafka', 'keras', 'kubeflow', 'kubernetes', 'lambda', 'lightgbm', 'linux', 'looker', 'mahout', 'mapreduce', 'mariadb', 'matlab', 'matplotlib', 'mlflow', 'mllib', 'mongodb', 'mxnet',
'mysql', 'net', 'nltk', 'node', 'nodejs', 'nosql', 'numpy', 'oracle', 'pandas', 'pentaho', 'perl', 'php', 'pig', 'plotly', r'pl/sql', 'postgres', 'postgresql', 'postgressql', 'powerbi', 'powerpoint', 'pyspark', 'python', 'pytorch', 'qlik', 'r', 'rapidminer', 'react', 'redshift', 'rpython',
'rstudio', 'ruby', 'rust', 's3', 'sagemaker', 'salesforce', 'sas', 'scala', 'scala', 'scikit', 'scikitlearn', 'scipy', 'seaborn', 'shiny', 'sklearn', 'snowflake', 'solr', 'spacy', 'spark', 'splunk', 'spss', 'sql', 'ssrs', 'swift', 'tableau', 'tensor', 'tensorflow', 'tf', 'theano', 
'torch', 't-sql', 'unix', 'unixlinux', 'vba', 'xgboost', 'dplyr', 'ggplot2','esquisse','bioconductor','shiny','lubridate','knitr','mlr','quanteda','dt','rcrawler','caret','rmarkdown', 'leaflet', 'janitor', 'ggvis', 'plotly','rcharts','rbokeh','broom','stringr','magrittr','slidify','rvest',
'rmysql','rsqlite','prophet','glmnet','text2vec','snowballc','quantmod','rstan','swirl','datasciencer']

tool_phrases = ['amazon web services', 'c#', 'c++', 'google cloud', 'microsoft access', 'sql server']

tool_keywords_set = set(tool_keywords).union(set(tool_phrases))

# Skills Required
skill_keywords = set(['ai', 'analyisis', 'anomaly', 'bandit', 'bayes', 'bayesian', 'bi', 'bioinformatics',
 'biostatistics', 'blockchain', 'boosting', 'causality', 'chatbot', 'chi', 'classification', 
 'cleaning', 'cleansing','cluster', 'clustering', 'cnn', 'convolutional', 'correlation', 'dashboard',
 'econometrics','etl', 'exploratory', 'forecasting', 'gbm', 'geospatial', 'gis', 'glm', 'graph', 
 'hyperparameter','hypothesis', 'kaggle', 'knn', 'knowledge', 'kpi', 'logistic', 'lstm', 'management','mathematics',
 'ml', 'nlp', 'nn',  'normalization', 'ocr','optimization', 'optimizing', 'outlier', 'pca', 'prediction', 'probability', 'recommendations', 'recommender', 'recurrent', 'regression', 'reinforcement', 'research', 'rnn', 'salesforce', 'salesforcecom', 'sdlc', 'segmentation', 'sequencing', 'statistics', 'story', 'storytelling', 'supervised', 'svd', 'svm', 'tuning', 'unsupervised', 'visualization'])

skill_phrases = ['a/b testing', 'ab testing', 'artificial intelligence', 'big data', 'business intelligence', 'chat bot', 'computer science', 'computer vision', 'data mining', 'data science', 'decision tree', 
 'deep learning', 'dimensionality reduction', 'domain knowledge', 'experimental design', 'graph theory', 'image analysis', 'k-means clustering', 'machine learning', 'machine vision',  'natural language processing', 'nearest neighbors', 'neural network', 'power bi', 'random forest', 'software development life cycle', 'support vector machine', 'text analytics', 'time series']

skill_keywords_set = set(skill_keywords).union(set(skill_phrases))

Let us do some counting on the number of tools, skills and education on the posted jobs.

# count number of tools, skills and education
def analyze_posts(df):
    tool_list = []
    skill_list = []
    degree_list = []

    num_postings = len(df.index)
    for i in range(num_postings):
        job_desc = df.iloc[i]['job_desc'].lower()
        job_desc_set = df.iloc[i]['job_description_analysis']

        # check if the keywords are in the job description. Look for exact match by token.
        tool_words = tool_keywords_set.intersection(job_desc_set)
        skill_words = skill_keywords_set.intersection(job_desc_set)
        degree_words = degree_keywords_set.intersection(job_desc_set)

        # check if longer keywords (more than one word) are in the job description. Match by substring.
        j = 0
        for tool_phrase in tool_phrases:
            # tool keywords.
            if tool_phrase in job_desc:
                tool_list.append(tool_phrase)
                j += 1

        k = 0
        for skill_phrase in skill_phrases:
            # skill keywords.
            if skill_phrase in job_desc:
                skill_list.append(skill_phrase)
                k += 1

        # search for the minimum education.
        min_education_level = 999
        for degree_word in degree_words:
            level = degree_dict[degree_word]
            min_education_level = min(min_education_level, level)

        for degree_phrase in degree_phrases:
            # longer keywords. Match by substring.
            if degree_phrase in job_desc:
                level = degree_phrases[degree_phrase]
                min_education_level = min(min_education_level, level)

        # label the job descriptions without any tool keywords.
        if len(tool_words) == 0 and j == 0:
            tool_list.append('nothing specified')        

        # label the job descriptions without any skill keywords.
        if len(skill_words) == 0 and k == 0:
            skill_list.append('nothing specified')

        # If none of the keywords were found, but the word degree is present, then assume it's a bachelors level.
        if min_education_level > 500:
            if 'degree' in job_desc:
                min_education_level = 1
            else:
                min_education_level = 0

        tool_list += list(tool_words)
        skill_list += list(skill_words)
        degree_list.append(min_education_level)
return tool_list, skill_list, degree_list, num_postings

# get top 20 tools used
def get_topN_tools(tool_list, n=20):
    # create the list of tools.
    df_tool = pd.DataFrame(data={'cnt': tool_list})

    msk = np.logical_or(df_tool['cnt'] == 't-sql', df_tool['cnt']=='pl/sql')
    df_tool.loc[msk, 'cnt'] = 'sql'

    msk = np.logical_or(df_tool['cnt'] == 'scikit', df_tool['cnt'] == 'scikitlearn')
    df_tool.loc[msk,'cnt'] = 'sklearn'

    msk = np.logical_or(df_tool['cnt'] == 'tensor', df_tool['cnt']=='tf')
    df_tool.loc[msk, 'cnt'] = 'tensorflow'

    msk = df_tool['cnt'] == 'amazon web services'
    df_tool.loc[msk, 'cnt'] = 'aws'

    msk = df_tool['cnt'] == 'rstudio'
    df_tool.loc[msk, 'cnt'] = 'r'

    msk = df_tool['cnt'] == 'vba'
    df_tool.loc[msk, 'cnt'] = 'excel'

    msk = df_tool['cnt'] == 'h2o'
    df_tool.loc[msk, 'cnt'] = 'h20'

    msk = df_tool['cnt'] == 'github'
    df_tool.loc[msk, 'cnt'] = 'git'

    msk = df_tool['cnt'] == 'pytorch'
    df_tool.loc[msk, 'cnt'] = 'torch'

    msk = df_tool['cnt'] == 'pyspark'
    df_tool.loc[msk, 'cnt'] = 'spark'

    msk = df_tool['cnt'] == 'gcp'
    df_tool.loc[msk, 'cnt'] = 'google cloud'

    df_tool_topN = df_tool['cnt'].value_counts().reset_index().rename(columns={'index': 'tool'}).iloc[:n]

return df_tool_topN

# gather all tools, skills and education
all_tools, all_skills, all_degrees, all_posts = analyze_posts(df_datascience)
top_20_tools = get_topN_tools(all_tools, 21)

# plot Chart of the Top 20 Tools
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go

def plot_top_tools(df, post_count):
    layout = dict(
        title='Tools For Data Scientists',
        yaxis=dict(
            title='% of job postings',
            tickformat=',.0%',
        )
    )

    fig = go.Figure(layout=layout)
    fig.add_trace(go.Bar(
        x=df['tool'],
        y=df['cnt']/post_count
    ))

return iplot(fig)

plot_top_tools(top_20_tools, all_posts)

Top 20 Data Scientist tools requirement in Malaysia are:

1	python	16
2	cloud	15
3	sql	14
4	r	13
5	aws	8
6	sas	6
7	google cloud	5
8	oracle	4
9	azure	4
10	javascript	3
11	mongodb	3
12	java	3
13	nosql	3
14	spark	3
15	go	3
16	react	3
17	nodejs	2
18	tableau	2
19	snowflake	2
20	salesforce	2
Data Science Tools
# get the Top 20 Skills
def get_topN_skills(skill_list, n=20):
    # create the list of tools.
    df_skill = pd.DataFrame(data={'cnt': skill_list})

    msk = df_skill['cnt'] == 'ml'
    df_skill.loc[msk, 'cnt'] = 'machine learning'

    msk = df_skill['cnt'] == 'ai'
    df_skill.loc[msk, 'cnt'] = 'artificial intelligence'

    msk = df_skill['cnt'] == 'nlp'
    df_skill.loc[msk, 'cnt'] = 'natural language processing'
    
    msk = df_skill['cnt'] == 'bi'
    df_skill.loc[msk, 'cnt'] = 'business intelligence'

    df_skill_topN = df_skill['cnt'].value_counts().reset_index().rename(columns={'index': 'skill'}).iloc[:n]

return df_skill_topN

top_skills = get_topN_skills(all_skills, 21)

# plot Chart of the Top 20 Skills
def plot_wanted_skills(df_skill_topN, post_count):
    layout = dict(
        title='Skills For Data Scientists',
        yaxis=dict(
            title='% of job postings',
            tickformat=',.0%',
        )
    )

    fig = go.Figure(layout=layout)
    fig.add_trace(go.Bar(
        x=df_skill_topN['skill'],
        y=df_skill_topN['cnt']/post_count
    ))

    iplot(fig)

plot_wanted_skills(top_skills,all_posts)

Top 20 Data Scientist skills requirement in Malaysia are:

1	knowledge	25
2	machine learning 25
3	data science	22
4	optimization	17
5	management	17
6	research	12
7	big data	12
8	data mining	12
9	business intelligence	10
10	computer science	10
11	statistics	9
12	etl	9
13	visualization	7
14	recommendations	7
15	forecasting	6
16	optimizing	5
17	deep learning	4
18	artificial intelligence	4
19	mathematics	4
20	natural language processing	3
Data Science Skills

We can find out top locations of these companies using the Pandas data frame filter method. As expected, top cities for data scientist jobs are Kuala Lumpur, Penang and Melaka.

df_data['company_location'].value_counts()

Kuala Lumpur                       101
Petaling Jaya                        8
Penang                               5
Melaka                               4
Subang Jaya                          4
Bangsar South                        3
Butterworth                          3
Kota Damansara                       2
Shah Alam                            2
Batu Caves                           2
Penang Island                        2
Johor Bahru                          2
Seremban                             2
Bukit Jalil                          2
Selangor                             1
Bangsar                              1
Kulai                                1
Kuching                              1
Johor                                1
Brickfields                          1
Kulim                                1
George Town                          1

Leave a Reply

Your email address will not be published. Required fields are marked *