Text Mining of Instagram Job Ads With NER and HDBSCAN for Surabaya Informal Job Segmentation

Qalbi, Qawiyah (2026) Text Mining of Instagram Job Ads With NER and HDBSCAN for Surabaya Informal Job Segmentation. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5003221049-Undergraduate_Thesis.pdf] Text
5003221049-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (5MB) | Request a copy

Abstract

Surabaya’s labour market is increasingly characterised by informal employment, with Instagram widely used as an informal channel for advertising vacancies. However, Instagram job captions are typically unstructured, noisy, and inconsistent, limiting their suitability for systematic labour-market analysis. This study develops a text-mining framework to transform Instagram-based informal job advertisements into structured, analyzable data and to map dominant segments of informal labour demand. Instagram job-advertisement captions posted between August and December 2025 were collected from Surabaya-based vacancy accounts. A knowledge-based, rule-driven Named Entity Recognition (NER) pipeline was applied to preprocess captions and extract key employment attributes, including job title, salary category, age requirement, education level, job type, marital status, gender preference, GPA requirement, experience preference, and soft and hard skills. To support segmentation, extracted mixed-type attributes were encoded and clustered using density-based methods. HDBSCAN served as the primary clustering algorithm due to its robustness to irregular density structures, while DBSCAN was applied as a baseline for comparison. Model comparison relied on interpretability and dissimilarity-based measures (intra-cluster, inter-cluster, and separation ratios). Results indicate that Instagram-based informal recruitment in Surabaya is not random but structured into two dominant and substantively interpretable segments. The first segment reflects creative and digital roles (e.g., content creation, digital marketing, visual design) that emphasise practical digital competencies and relatively flexible demographic conditions. The second segment reflects operational roles (e.g., sales, administration, front-office) characterised by routine task requirements, stronger work-readiness signals, and more restrictive access conditions. Comparative evaluation shows that HDBSCAN produces a more parsimonious and interpretable segmentation than DBSCAN, which tends to generate fragmented clusters driven by local density variations. This study contributes a reproducible approach integrating rule-driven NER with density-based clustering for noisy social-media recruitment text, and provides actionable insights for job seekers, SMEs/UMKM, and policymakers in strengthening informal labour-market understanding aligned with SDG 8.

Item Type: Thesis (Other)
Uncontrolled Keywords: Text Mining, Named Entity Recognition, HDBSCAN, Social Media, Job Market Segmentation
Subjects: Q Science > QA Mathematics > QA76.9.D343 Data mining. Querying (Computer science)
Divisions: Faculty of Science and Data Analytics (SCIENTICS) > Statistics > 49201-(S1) Undergraduate Thesis
Depositing User: Qawiyah Qalbi
Date Deposited: 28 Jan 2026 02:21
Last Modified: 28 Jan 2026 02:21
URI: http://repository.its.ac.id/id/eprint/129853

Actions (login required)

View Item View Item