Non-Standard Chinese Building Address Standardization in Smart City

Xue-feng Xi, Bao-chuan Fu and Victor S. Sheng

Abstract: Due to the large number of non-standard building addresses and the semantic ambiguity of addresses expressed in Chinese natural language, traditional methods based on string matching are difficult to meet requirements. To address these problems, we propose an innovative joint learning approach based on the hash map principle and the word frequency theory for standardizing Chinese non-standard building addresses. Our experimental results on a real-world dataset constructed via the crowdsourcing technology show that our approach has an outstanding accuracy and the adaptability for utilizing data from different sources.

1. Introduction

Unique building address representation is critical to a smart city [1, 2]. Different from other countries, Chinese building addresses do not have a strict uniform format. Moreover, there are various address sources, such as different user registration addresses at different companies (e.g., utilities such as water companies and gas companies) and different registration addresses at the public security personnel. This leads to different representations of a building address in China. Lack of unique building address representation results in a series of problems, e.g., the inability of the information from different sources to be connected, related services not being able to be displayed on the map, and the incomplete integration of different data resources. To solve these problems, we propose a feasible method to construct a standardized address database and then map other addresses into the standard address uniformly.

2. Problem Statement

We define the problem of non-standard Chinese building address standardization as follows.

Let us use X = {x1, x2, ……, xn} to denote a non-standard Chinese building address dataset and Y = {y1, y2, ……, ym} to denote the standard address dataset. The aim of non-standard Chinese building address standardization is to find a mapping function f satisfied the equality Y = f(X).

Article 04 Fig 01

Fig1: An Example of Address Mapping.

However, there are two key challenges in this task: (I) Due to the large amount of data, the method of manual retrieval in more than 900,000 pieces of data is time-consuming and labor-consuming. Therefore, manual retrieval is not feasible. (II) Because of the semantic similarity between address strings, traditional string matching retrieval methods cannot be used in this problem.

3. Our Model

Some existing methods, such as tree structure [3] and fish structure [4], partially solve the above problems. However, these methods do not perform well and the computation cost is high due to their complexity [5].

Inspired by these methods, we propose a joint learning approach for standardizing Chinese non-standard building addresses, which automatically implements matches between non-standard addresses and standard addresses based on the theory of hash mapping and word frequency.

The implementation of address matching uses a non-standard address as an input and outputs a set of standard addresses in descending order in terms of similarity. Non-standard address automatic matching mainly includes three major steps: (i) build standard address dictionary (Step A); (ii) reformat each non-standard building address (Step B); (iii) match non-standard addresses (Step C). The architecture diagram is shown in Fig. 2.

 Article 04 Fig 02

Fig2: The Architecture of Our Non-standard Chinese Building Address Standardization Model.

4. Discussions

Firstly, we proposed a standard address modelling via a hash standard address dictionary. The standard address can be used to create a dictionary file that can be quickly found. The main structure is a hash structure, which is used for subsequent non-standard address matching steps. Secondly, the address panning procedure is used to reduce the space for the candidate set of a standard address. Finally, non-standard address match is achieved. For an input non-standard address, the matching standard address is found in the candidates of the standard address dataset.

Experimental results on the dataset constructed via the crowdsourcing technology show that our model can achieve 97.71% in terms of accuracy and 98.33% in terms of F1-measure.

5. Conclusions

We proposed a joint learning approach for building a nonstandard address standardization. Briefly, we first constructed a standard address dictionary. According to the standard address dictionary, we used address panning to reduce the space of the candidate standard address dataset. Furthermore, we exploited our model with two different methods (i.e., MOR (Matching via One-Hot Representation) and MWE (Matching via Word Embedding)) to achieve automatic matching between a non-standard address and the standard address dataset. Our experimental results showed that the model based on MOR has an outstanding performance in terms of accuracy and the generalization ability. However, the model based on MOR cannot adapt to semantic processing such as address aliases, abbreviations, and so on. Therefore, we used deep learning to build an address matching model via word embedding. Our experimental results showed that the model based on MWE has a certain competitiveness. However, its performance is far from practical requirements. In the future, it is considered to expand the neural network training data in order to obtain a better performance. Moreover, the computation speed of this model needs to be further improved. We plan to migrate the model to a high-performance cloud computing platform and adopt a distributed computing architecture to increase the computation speed, so that it can provide basic support for practical applications.

 References

  1. Delmastro F, Arnaboldi V, Conti M. People-centric computing and communications in smart cities. IEEE Communications Magazine, 2016, 54(7):122-128.
  2. Wang J, Li C, Xiong Z, et al. Survey of data-centric smart city. Journal of Computer Research and Development(In Chinese), 2014, 51(2): 237-259.
  3. Ying S, Li W, He B, Wang W, Zhao C. Address Text Matching Method Based on City Address Tree. Geomatics World (In Chinese), 2017(6).
  4. Chen M. Study on Construction and Application of Standard Address Database in Shanghai. Geomatics & Spatial Information Technology (In Chinese), 2017(3):86-89.
  5. Cheng C, Yu B. A Rule-Based Segmenting and Matching Method for Fuzzy Chinese Addresses. Geography and Geo-Information Science (In Chinese), 2011, 27(3):26- 29.

 

Contributers

  • Xue-feng Xi, PhD - Associate professor at College of Electronic and Information Engineering, Suzhou University of Science and Technology, China.  Senior Data Scientist of Smart City Research Institute of Suzhou, China.
  • Bao-chuan Fu, PhD - Professor at College of Electronic and Information Engineering, Suzhou University of Science and Technology, China. Senior Research Scientist of Smart City Research Institute of Suzhou, China.
  • Victor. S.Sheng, PhD - Associate Professor of Arkansas State University, USA. Senior Visiting Research Scientist of Smart City Research Institute of Suzhou, China.