Detecting Traffic Event Related Blog Posts by Using Traffic Related Named Entities
by A. Dundar Unsal, Pinar Karagoz, Hediye Tuydes-Yaman
Traffic management systems, which aim to provide measures for sustainable and optimal traffic on road network, require monitoring the traffic flow. Traffic flow is commonly monitored using sensors that are deployed on roads, such as inductive loops or camera systems . Due to high cost of installation and maintenance for such sensor systems, it is impractical to cover traffic networks with an adequate number of sensors. The information shared on social media by users or ”human sensors” as commonly referred, can complement or replace the data provided by physical sensors, as a sustainable source of data. Twitter, where 313 million active users send an average of around 500 million tweets a day, is used as a data source to evaluate the methods proposed . Our observations show that among other topics and daily issues, tweets cover the road and traffic flow conditions, as well.
In this work, a cost effective solution to be used within traffic monitoring is proposed based on social media data, more specifically, Twitter stream. We propose a supervised learning based method, which is a type of machine learning approach that is based on developing (training/supervising) a model using manually annotated data. The proposed solution aims to classify blog posts as traffic event related or not. In order to improve the classification performance, blog posts are preprocessed by using Natural Language Processing (NLP) operations, and furthermore, traffic event related named entities are extracted by using a customized Named Entity Recognition (NER) model based on Conditional Random Fields (CRF). The case study is conducted on blog posts in Turkish, which is a morphologically complex language. This causes lower accuracy rates for Turkish texts than that of English in NLP tasks, such as NER . However, working languages other than English, especially on morphologically complex languages, such as Turkish, contribute to showing the applicability of the approach in different countries.
The initial steps of the method consists of data collection, preprocessing, and morphological analysis. For data collection, we used Twitter Search API for performing queries by using a set of predefined terms to retrieve potentially relevant posts. In preprocessing step, cleaning-up, tokenization and segmentation tasks are applied. Within our method, each token is analyzed morphologically in order to extract roots and a set of inflection groups using the morphological analysis tool TRMorph .
Traffic event related name entity recognition and traffic event related posting classification constitutes the hearth of the method. Initally, a customized set of traffic related named entities are defined organized in entity groups such as incident related, location related or direction related named entities. For named entity recognition, we developed a model based on Conditional Random Fields (CRF), which is a supervised learning technique, which is commonly used in tag learning for sequence data. To this aim, tagging is performed on subtokens. Subtokens are represented with their surface text, morphological tags that are assigned during morphological analysis, and tag annotations. Recognized entities by the model are used as features in classification step in order to improve the accuracy performance.
In order to classify relevant posts among all tweet collection, which is optained via keyword seach using Search API, we constructed a classification model by comparing Support Vector Machines (SVM), Naive Bayes (NB) and Decision Trees based classifiers, which are commonly applied in incident detection problems.
Our data set is a tweet collection that is retrieved under location filtering defining 50mile radius around the city center of Ankara. The collection consists of 21,077 tweets, posted from January 1st to January 31st, 2017. For ground truth construction, tweets are manually annotated for the traffic entity recognition and classification. Two classes are defined to categorize the stream, Direct Traffic Report (DTR), and Other. DTR denotes that the posting is a direct and immediate report of an incident or road condition which might affect the traffic flow. On the other hand, Other denotes that the post does not meet the criteria of DTR. 649 tweets are labeled as DTR. In Figure 1, locations of some traffic events reported are shown on a map of Ankara.
Figure 1. Locations of DTR tweets around Eskişehir Road, Ankara
Feature vector corresponding to a tweet consists of stems and named entities as the feature types. Stems are the stemmed forms of the words that are generated in morphological analysis step. In order to observe the contribution of the extracted morphological features on the classification task, the experiments are conducted under different feature settings. SVM classifier performed best with the subtoken-based feature set including stems and all named entities with a F1-Score of 70.2%. SVM classifier reached this score by using the top 25% of all terms sorted by their frequency in all documents. This shows that the morphological analysis is effective for detecting traffic event related posts and all named entities contribute to the model.
Highest F1-Score of 63.9% was achieved by C4.5 algorithm-based classifier using the subtoken-based feature set consisting stems and top traffic related named entities. Classifier was able to detect 66.6% of all traffic related events, while, 61.3% of the events detected as traffic related events were actually traffic related.
Naive-Bayes classifier reached highest F1-Score of 68.1% with subtoken-based models employing stems and all entities. Models employing subtoken-based feature sets scored better in all classification tests .
Several challenges have been observed due to the strict inclusion criteria for DTR class. DTR label is used for only immediate and direct reports. Indirect reports of incidences that refer other sources or news articles, which do not reflect a recent incident, are labeled as Others. Such tweets affected the performance of classification task negatively, due to lack of features to model subtle semantic differences.
In this study, we aimed to investigate the effectiveness of social network postings as human sensors to be used within a traffic management system. Our observations on the Twitter data set, which is collected within the scope of this work, shows that the amount of postings on road traffic incidents and variety of the incidents covered are satisfactory to be used as sensors. The obtained results are promising to augement the physical sensors in order to monitor the road network. Furthermore, having textual content may provide additional information as to the happening, which can not be obtained through physical sensors. As a novel contribution, we particularly focus on two aspects. The first one is the use of the approach on a morphologically complex language, Turkish, on which it is harder to apply NLP than English. The performance obtained on such a language shows the applicability of the approach over a varity of languages, and hence over a varity of different countries. The second contribution is on use of traffic related named entities and the type of the named entities devised within this work. Classification of the blog posts through such named entities is shown to detect traffic related posts. Futhermore, after traffic related post detection, analysis over the named entities within the post can reveal the type of the incident, as well.
Twitter, “About Company,” 2016. [Online]. Available: http://about.twitter.com/company
R.Krikorian,“New Tweets per second record,and how!”, 2013.[Online]. Available: https://blog.twitter.com/2013/new-tweets-per-second-record-and-how (Visited on 22 Feb 2019)
G. A. Şeker and G. Eryiğit, “Initial explorations on using CRFs for Turkish named entity recognition,” In Proc. of Int. Conf. on Computational Linguistics (COLING 2012), Mumbai, India, pp. 2459–2474, 2012.
C. Çöltekin, “A set of open source tools for turkish natural language processing,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, pp. 1079–1086, 2014.
A. D. Unsal, H. Tuydes-Yaman, P. Karagoz, “Event related blog post classification by using traffic related named entities”, In Proc. of Int. Smart Cities Conf. (ISC2), Kansas City, USA; Sept 16-19, 2018.
A. Dundar Unsal is a PhD student in Middle East Technical University (METU) Department of Geodetic and Geographic Information Technologies. His main areas of study are Geographical Information Systems, Intelligent Transportation Systems, Information Retrieval and Natural Language Processing. He has masters degree from the same department, with a focus of Intelligent Transportation Systems. He has been working on Geographical Information Systems and other Information Systems fields in R&D companies for over 18 years.
|Dr. Pinar Karagoz is working as a full professor in Middle East Technical University (METU) Computer Engineering Department. She received PhD from the same department. During her doctoral studies she worked as a researcher at SUNY Stony Brook University in New York, USA. She had research visits in MIT CSAIL in the USA, Computer Science Department of Ostrava University in Czech Republic, and Computer Engineering Department of Aalto University in Finland. Her research focuses on data mining, machine learning algorithms, information retrieval, social media analysis and mining. She has publications in internationally recognized and indexed international journals including IEEE TKDE, ACM TWEB, IEEE TII and The Computer Journal, and she has about 100 papers in international conferences in the area. In 2016 he received the best paper award in IEEE Transactions on Industrial Informatics. In 2017, her paper was nominated for a Wilkes Award of the Computer Journal.|
|Dr. Hediye Tuydes-Yaman is a graduate of Civil Engineering (CE) Department at Middle East Technical University (METU), Ankara Turkey. She completed her MS and PhD programs in transportation at Northwestern University, IL, USA, during which she studied travel demand forecasting and network traffic modeling. Since 2006, she works a a full-time faculty member at METU, where she also currently acts as the Head of the Intelligent Transportation Systems (ITS) Unit under METU-BILTIR Research Center. She has publications in prestigious transportation and civil engineering journals, including TRR, ASCE, Accident & Analysis and Prevention, covering various topics in the fields of network management, traffic safety and ITS. She conducted various projects focused on urban transportation monitoring and management, as well as an advocacy project for on safe urban speed management funded by Global Road Safety Partnership (GRSP). More recently, she is in the executive committee of the “METU Smart Campus” project funded by the United States Trade & Development Agency (USTDA), aiming to develop a road map for the transformation of the METU Campus towards a smart and sustainable one.
1 August 2019