The study aims to formulate a solution for identifying the safest route between any two inputted Geographical locations. Using the New York City dataset, which provides us with location tagged crime statistics; we are implementing different clustering algorithms and analysed the results comparatively to discover the best-suited one. The results unveil the fact that the K-Means algorithm best suits for our needs and delivered the best results. Moreover, a comparative analysis has been performed among various clustering techniques to obtain best results. we compared all the achieved results and using the conclusions we have developed a user-friendly application to provide safe route to users. The successful implementation would hopefully aid us to curb the ever-increasing crime rates; as it aims to provide the user with a before-hand knowledge of the route they are about to take. A warning that the path is marked high on danger index would convey the basic hint for the user to decide which path to prefer. Thus, addressing a social problem which needs to be eradicated from our modern era.

Safe route detection remains a problem which needs to be investigated. An initial approach was implementing video surveillance, which is an important research area in computer vision [

To achieve the same, various approaches are explored which align with the proposed problem statement. The problem is relatively much less explored, and the availability of linked resources is constrained. To counter this issue, we have tried to document the results achieved from as many variations as possible and built the primary stage with the best results achieved which not only proves the feasibility of the approach but also instills the idea that a lot more could be done and to counter the steadily increasing problem of crime.

Further, in Section 2 of this article discusses related work; Section 3 gives the details about the Data and method used. In Section 4, the experiment of the proposed schemes is carried out with the results achieved with them. It also discusses the calculation of the Danger Index. Section 5 yields the results of proposed schemes. Ultimately, Section 6 presents the conclusion of the scheme.

The study of crime prevention has been explored at different areas in several community environments [

Communities have not let the topic fade away. A research [

Trippin, which seemed a probable solution, giving the user a danger index insight, to provide the same for all the possible routes, but fails to provid the alternatives to the user. Geographical Information System Based Safe Path Recommender [

While working with geo-location and crime dataset, the data-crunch is a big issue. The dataset used for the proposed work is of New York City taken from online resources NYC Open-Data. It is a well built and updated resource with detailed information revealing no personal data and open for public use. Two datasets, NYPD Arrest Data [

Calculation of Danger Index for each Data Point: For each latitude and longitude value, the datasets provided us with the parameters like “Number of Deaths of Cyclists”, “Number of Injured Pedestrians” related to the death and injury of motorcyclists, pedestrians and cyclists in the read accidents. The data additionally provided us with the field, giving the crime description committed at the spot. Though the fields were present, we wanted to club all these fields and complete a column which could present them for a justified representation, the danger index for that geo-coordinate. To achieve this aim, we heeled the approach suggested in [

where,

Likewise, for the crime data, we have given each crime a rating to consider its weightage. The reasoning behind doing this is that rape, a far more severe and heinous crime, cannot be considered in the same bucket as chain snatching. Hence, giving the crime score with consideration to severity and passing the weightage to the final danger index. The weightage is presented in

Crime | Weightage | Crime | Weightage |
---|---|---|---|

Rape | 15 | Dangerous weapons | 7 |

Other sex crimes | 15 | Felony assault | 6 |

Manslaughter | 14 | Unclassified felony | 6 |

Prostitution & related offenses | 13 | For other authorities | 6 |

Offenses related to children | 13 | Burglary | 5 |

Kidnapping & related offenses | 12 | Burglar's tools | 5 |

Child abandonment/non-support | 11 | Intoxicated/impaired driving | 4 |

Gambling | 10 | Theft-fraud | 4 |

Disorderly conduct | 10 | Moving infractions | 3 |

Theft of services | 9 | Vehicle and traffic laws | 3 |

Arson | 9 | Forgery/fraudulent accosting | 2 |

Other state laws | 8 | Grand larceny | 1 |

For calculation of Crime Score:-

For calculation of Accident Score:-

Several approaches are available for clustering the data points, but these chiefly depend on the type of dataset we are working upon. The Geo-location dataset does not yield us good results with all the algorithms [

This clustering technique, we initially consider all the points in the dataset to be individual clusters. Following each iteration clusters like each other are grouped together until only the required number of K clusters are formed [

For a dataset d1, d2, d3 … dn having n elements, we have calculated a distance matrix represented by “dis_mat” using:

where

Initially, each data point in the dataset is a “singleton cluster.” We repeat the following process again and again until it leaves us with a single cluster: (1) The clusters gaining the minimum distance to each other are merged. (2) The distance matrix is updated. We demonstrate an example of which in

Cluster point | Mean | Standard deviation | Min/Max |
---|---|---|---|

0 | 2.0 | 0.0 | 2.0/2.0 |

1 | 0.0 | 0.0 | 0.0/0.0 |

2 | 5.734622 | 0.441923 | 5.0/6.0 |

3 | 4.0 | 0.0 | 4.0/4.0 |

4 | 12.208333 | 0.406540 | 12.0/13.0 |

5 | 7.797909 | 0.402261 | 7.0/8.0 |

6 | 9.937500 | 0.243975 | 9.0/10.0 |

7 | 1.0 | 0.0 | 1.0/1.0 |

8 | 14.896552 | 0.985681 | 14.0/15.0 |

It shifts the attention to graph theory where each of the data points is treated as the graph node, making this problem a graph-partitioning problem. The algorithm tries to identify the clusters as the group of nodes judging them based on the edge connecting them.

The Spectral algorithm can be abstracted into the following steps: (1) The data points

The evaluated results are shown in

Cluster point | Mean | Standard deviation | Min/Max |
---|---|---|---|

0 | 0.471196 | 1.016488 | 0.0/5.0 |

1 | 6.116883 | 0.322329 | 6.0/7.0 |

2 | 12.0 | 0.0 | 12.0/12.0 |

3 | 8.0 | 0.0 | 8.0/8.0 |

4 | 10.0 | 0.0 | 10.0/10.0 |

5 | 16.0 | 0.0 | 16.0/16.0 |

6 | 13.0 | 0.0 | 13.0/13.0 |

7 | 14.0 | 0.0 | 14.0/14.0 |

8 | 15.0 | 0.0 | 15.0/15.0 |

We can discard the spectral clustering on the go as the results on the first sight are not at all representing all the data points. These points vary minimally but drop the high values and show variation in the lower values, which have a more significant number of data-points as compared to high values and should be well segregated. But, seeing Agglomerative Clustering, we cannot directly believe the same. The deviation is not so significant, but it was overruled by the K-means because, as the dataset increases the approach gets more time taking and resource bound,

An approach, “Density Based Spatial Clustering of Applications with Noise” (DBSCAN), was tried and implemented by determining the clusters [

It basically deals with enormous datasets by creating first a compact synopsis comprising as must distribution information which is possible to keep, and then finally it is this summary of data which will be clustered instead of the entire dataset itself. This procedure saves a lot of computation. For Birch clustering [

The result is a height balanced tree. The further steps include scanning through all the nodes of the CF, building some smaller CF trees for removing any of the outliers present and grouping the nearby sub-clusters into large ones. The distance calculation for two clusters

Another algorithm which showed similar deviation results was birch clustering. The results looked promising at first sight as seen in

Cluster point | Mean | Standard deviation | Min/Max |
---|---|---|---|

0 | 2.034162 | 0.181654 | 2.0/3.0 |

1 | 0.008131 | 0.089807 | 0.0/1.0 |

2 | 6.128019 | 0.334246 | 6.0/7.0 |

3 | 4.151251 | 0.358359 | 4.0/5.0 |

4 | 12.254160 | 0.507776 | 11.0/14.0 |

5 | 8.392374 | 0.783003 | 8.0/10.0 |

6 | 15.121212 | 0.328035 | 15.0/16.0 |

7 | 18.615385 | 0.960769 | 18.0/20.0 |

8 | 26.0 | 0.0 | 26.0/26.0 |

The algorithm uses distance as the criteria for the grouping of the data points. It originated from signal processing and uses vector quantization which divides the n points into k-clusters and then calculating the distance, assigning the point to the cluster with the nearest mean. The idea behind the K-means clustering is as follows: (1) Choosing the best value for k that fits the proposed work. (2) Initially working with some random centroid here,

Cluster point | Mean | Standard deviation | Min/Max |
---|---|---|---|

1 | 0.0 | 0.0 | 0.0/0.0 |

2 | 5.722000 | 0.448163 | 5.0/6.0 |

3 | 12.197338 | 0.398145 | 12.0/13.0 |

4 | 7.802484 | 0.398372 | 7.0/8.0 |

5 | 1.0 | 0.0 | 1.0/1.0 |

6 | 9.923077 | 0.308607 | 9.0/10.0 |

7 | 15.103226 | 1.233589 | 14.0/20.0 |

8 | 26.666667 | 1.154701 | 26.0/28.0 |

9 | 4.0 | 0.0 | 4.0/4.0 |

It is concluded that the K-means shows the least deviation for the data points. The variation in the minimum and maximum values is also least, and it gives the representation to the data points very well.

After the calculation of danger index, we achieved 19 unique values for this column. We needed to group these values in a way that neither the standard deviation goes too high, nor we have a lot of clusters to deal with. Through experimentation, 10 was the optimal value for the approach. Some results achieved while we list experimentation below in

Cluster point | Mean | Standard deviation | Min/Max |
---|---|---|---|

1 | 1.959945 | 0.196095 | 1.0/2.0 |

2 | 0.0 | 0.0 | 0.0/0.0 |

3 | 3.846906 | 0.360144 | 3.0/4.0 |

4 | 5.722000 | 0.448163 | 5.0/6.0 |

5 | 12.197338 | 12.197338 | 12.0/13.0 |

6 | 7.802484 | 0.398372 | 7.0/8.0 |

7 | 9.923077 | 0.308607 | 10.0/11.0 |

8 | 15.322785 | 2.004037 | 15.0/28.0 |

Cluster point | Mean | Standard deviation | Min/Max |
---|---|---|---|

1 | 0.0 | 0.0 | 0.0/0.0 |

2 | 5.722000 | 0.448163 | 5.0/6.0 |

3 | 12.197338 | 0.398145 | 12.0/13.0 |

4 | 7.802484 | 0.398372 | 7.0/8.0 |

5 | 1.0 | 0.0 | 1.0/1.0 |

6 | 9.923077 | 0.308607 | 9.0/10.0 |

7 | 15.103226 | 1.233589 | 14.0/20.0 |

8 | 26.666667 | 1.154701 | 26.0/28.0 |

9 | 4.0 | 0.0 | 4.0/4.0 |

10 | 2.034162 | 0.181654 | 2.0/3.0 |

Cluster point | Mean | Standard deviation | Min/Max |
---|---|---|---|

1 | 2.0 | 0.0 | 2.0/2.0 |

2 | 0.0 | 0.0 | 0.0/0.0 |

3 | 3.0 | 0.0 | 3.0/3.0 |

4 | 6.0 | 0.0 | 6.0/6.0 |

5 | 5.0 | 0.0 | 5.0/5.0 |

6 | 12.197338 | 0.398145 | 12.0/13.0 |

7 | 7.802484 | 0.398372 | 7.0/8.0 |

8 | 1.0 | 0.0 | 1.0/1.0 |

9 | 9.923077 | 0.308607 | 10.0/11.0 |

10 | 14.781690 | 0.584871 | 14.0/16.0 |

11 | 20.125000 | 3.383785 | 18.0/28.0 |

12 | 4.0 | 0.0 | 4.0/4.0 |

The analysis of such results headed us to the conclusion that many clusters set below 10 is not giving the dataset a justified representation and the standard deviation is spiking high. Significantly, if we try to choose a number above 10, we are failing to reduce the standard deviation by any significant term. A greater number of clusters is leading us to having some sets which have very minimum points in them, and this increase in the amount of segmentation is just increasing the computation time with no benefits. Hence, we stick to taking 10 clusters, which is the optimal approach for the proposed work. With an increase in data points or an increase in the number of danger indexes, it is possible that this procedure needs to be repeated to once more find the optimal value for the number of clusters.

On calculating the danger index, we receive 10 numbers of unique values of danger index. As the first step, we will cluster the data points based on the danger index. The lesser the number of clusters, the better it will be for the proposed software, but decreasing the number of clusters brings with it the risk of increased standard deviation. Since the problem of finding the safest route has been utmost sensitive to even minor changes in standard deviations, we have needed to find an optimal number. The best way to discover an optimal value is through experimentation.

So, now we have clustered it into 10 clusters. After conducting the first round of clustering we primarily have 10 cluster groups which have a similar danger index value. Presently, in order to make these points useful for us, we have to again cluster them based on the geo coordinates. Each of 10 clusters is separately re-clustered again. Now the number of subclusters in each of the original clusters depends on the number of points in that cluster. Therefore, to generalize it, we choose (number of points in the original cluster) * 0.6 as the number of sub clusters in each of the original clusters. The general approach is (number of points in the original cluster) * 0.5 but the same has not been used because the square root did not generate enough clusters to represent the data distributed over the unified city. Now, the entire sub cluster is associated with 2 feature points-one the geo coordinates of the centroid and second the mean danger index of the subcluster. We create 10 separate lists each for one value of the danger index. Then store all the cluster centroids in them and then dump it to json format.

When the user inputs the origin and end destination, using the google maps API, we get all the workable routes between those two points. For each route we calculate its total danger index separately. Let us consider Route 1 that we have got. Now we store all the intermediate latitude and longitude of the routes in a separate list. For each route we consider each latitude longitude pair separately. Promptly for one pair, we compute the closest sub cluster from each of the 10 main clusters. To such a degree, we get several geo-points along with their danger index for each latitude longitude pair from the computed route. Since the danger influence of one point on another point will decrease with increasing distance, we divide the danger index by the distance between the cluster centroid and the original route point. Therefore, each route point now has its own cumulative danger index of these points. Currently, this is performed to all the route points, and they are all added to calculate the final danger index of the route. Now we can return the route, which provides the least danger index.

In our approach, we have implemented clustering at two layers, first based on the Danger Index and the second based on the coordinates to further increase our accuracy and allow routes to be marked more accurately with the danger index. To increase the data, we have taken into far more crime types as compared to any other implementation which gives us an upper hand on the danger index and its presentation of the possibility of occurrence of crime on the route. To grant the user all options, we have tried to show and mark all the routes between any coordinates. Finally, the comparative study helped us prove that our direction for our problem statement is correct.

Thus, through this paper, we are suggesting a probable approach which could work as a base for further development of the full-fledged product which has potentially huge application in the social domain.Through this analysis and research in alternative methods and parameters, we have been able to develop a basic website which uses the Google API to fetch the path coordinates between the inputs of location and individually calculate the danger index for each one giving the user an overview of how safe the path is. The approach employs the data provided by “The NYC Open-Data” consisting of NYPD Arrest Data and Motor Vehicle Collisions-Crashes consisting of highly reliable information with no particulars of individuals. The data is maintained and updated by the City government providing us consistently updated and exhaustive information. The diagram represents the flow of our application is shown in

Route number | Route colour | Time duration for reaching | Distance (in kilo-meters) | Danger index |
---|---|---|---|---|

1. | Purple | 3 h 43 mins | 333 | 1.75 |

2. | Yellow | 4 h 36 mins | 421 | 1.42 |

3. | Green | 5 h 5 mins | 401 | 1.47 |

The result displayed comprises the distance, the time duration required to traverse that route and the danger index calculated by the proposed algorithm. Thus, giving the user an overview of the safety on the path, they are about to travel.

We have performed a comparative study between various clustering algorithms and have found out K-means with 10 cluster centres as the most optimal approach, with an average standard deviation of only 0.41232. The number of clusters were varied and compared for getting the best out of them. Also, the different clustering algorithms like Agglomerative Clustering, DBSCAN, Spectral Clustering and so on were taken into consideration and their respective results were listed and compared. A mathematical notation was designed to compute the danger indexed. Extensive experiments reveal that the proposed model can be used for real-time safe route detection.

The approach is not final and can be further extended. The considered factors in our approach are the Crime rate and the accident statistics which could be stretched to the factors like the liveliness of the router, the availability of the streetlights, the availability of the communal facilities as hospitals and so on. If we alter the data set, it will require us to re-verify whether the approach still suits the dataset. Many modifications like the number of clusters, the algorithm used and so on, can be changed and fitted as per the problem statement. An instance could expand our dataset to a unified country instead of merely focusing on a particular city. This would require us to manage a whole current scope of variation in datasets, with the data-points exponentially increased. Hence, we see that there is an immense scope of improvement as the problem directly tries to affect the day-to-day life of the masses. The results presented in the paper help not only one to perceive why K-Means works best but also summarize what other approaches might cause. The K-means additionally allows a scope of improvement, as it has many variations present in the market. More isolated exploration of these variations could yield some more outstanding results. Thus, the variations which are expected and possible are too being assisted and might help in assembling a full-fledged, robust system.

The solution achieved therefore does have its own shortcomings. It is highly based on the input data and if it is altered, a difference in results is expected. More parameters for the calculation of Danger index need to be included in order to make it a fail-proof system. A significant amount of preprocessing is required and the complexity could be further reduced by the use of advanced data structures. Concluding, it is not the ultimate solution but rather a base to construct up a solution which could counter the problem. In near future, the exploration of the used approaches could give some better results. Hence, the variations which are expected and possible are too being assisted and might help in designing a full-fledged, robust system.