Improving libgweather Locations

This page describes the process for improving the set of locations known by libgweather for a given country.

Getting Started

First check the issue tracker to see if someone else has already created a "Country name locations" bug for your country. If not, create it yourself.

libgweather gets its location information from a file called Locations.xml. However, this is generated from a more detailed file in the source tree called Locations.xml.in, which is in turn generated from various other databases of cities and weather information. This means that:

  1. For purposes of deciding whether or not a city is missing (or otherwise incorrect), you need to look at data/Locations.xml.in in the libgweather module in subversion, not the Locations.xml file installed on your computer. (The Git page has more information on GNOME's Git setup, but if you just want to check out libgweather, you can do: git clone https://gitlab.gnome.org/GNOME/libgweather.git

  2. You cannot simply submit a patch to data/Locations.xml.in to add, remove, or modify cities, because that file is generated from other data. This page explains the right way to submit patches and/or suggestions to the locations database.

Please do not:

  1. request that we add your hometown, if it is not a "major city" as discussed below; Locations.xml.in is already enormous; we cannot simply add every city that any GNOME user lives in. Bug 530178 discusses the possibility of making libgweather use an online database so that we can resolve arbitrary city names.

  2. submit a bug report asking us to add just 1 city; if we fix the location database one city at a time, it will take decades to get everything right. But if we fix it up a whole country at a time, we should be able to get a large percentage of it fixed by GNOME 2.26.

  3. ask us to add cities without also looking for cities to remove (as discussed below in "Getting Rid of Junk"); we really need to clean the cruft out of the database.

Making Sure All Major Cities Are Present

We want to list all of the "major" cities in every country. At the moment, that means:

  1. All cities with more than 100,000 people
  2. All cities that are major political centers (eg, national or state/provincial capitals, but not necessarily the capitals of smaller subdivisions).

  3. If the country is divided into <state>s (which can mean "states", "provinces", "administrative regions", etc), then every "state" needs to have at least one city, so if it doesn't have a city that meets either of the above two criteria, then its largest city should be considered "major".

Wikipedia has lots of pages like "List of Major Cities in Elbonia" or "List of Cities in Equatorial Kundu With More Than 100,000 Inhabitants" that will probably be helpful here.

In some cases it may make sense to consider a city to be major even if it doesn't meet those criteria; these rules are still being figured out.

Note that in some cases, libgweather may be aware of cities in a country, but not know of any weather station around them, in which case they won't be listed as <city> entries in Locations.xml.in; in this case there will be an XML comment at the start of the country entry stating "Could not find weather stations for the following major cities". We are still trying to figure out exactly what to do about this. For the moment though, the important thing is that you don't need to consider those cities "missing"; we know about them, we just don't know what to do with them.

If major cities are missing, they can be added by editing data/major-cities.txt in the source tree. That file has comments describing its format. You will want to add a new section for your country. You'll need to find its two-letter FIPS code by checking the country's <fips-code> entry in Locations.xml.in; this might not be the same as its more-familiar ISO 3166 code. For "major cities", use "2" for the "Level" column.

Eg, the island nation of Nauru currently has no cities listed in it in Locations.xml.in. So you could fix that by adding these lines to the bottom of data/major-cities.txt:

    # Nauru. According to Wikipedia, the city of Yaren is the "de facto capital".
    NR  -0.55   166.917 2       Yaren

You can then submit your fixes either by doing a svn diff, or by just pasting the additional data in the bug report.

Getting Rid of Junk

We also need to get rid of useless cities. These come in a few forms:

  1. Alternate names, neighborhoods/districts, suburbs: in some cases, we will have multiple entries for a given city for some reason; either the city is listed under two different names (or two different romanizations of the same name), or else we list both the city and one of its neighborhoods/districts (eg, including both "New York City" and "Manhattan"), or else we list both a major city and a very small suburb of the city that really doesn't need its own entry. In these cases, we should remove the entry for the smaller "city".
  2. In some cases, we may list a tiny city (or neighborhood/district), while completely ignoring a larger city nearby. In these cases, we should replace the entry for the tiny city with an entry for the larger city instead.
  3. Finally, in some cases there are tiny cities listed "in the middle of nowhere". In these cases, even if there isn't another city nearby, it's not worth listing these tiny cities. We don't currently have a firm definition of when a city is worth getting rid of; this is a judgement call.

We haven't yet figured out how this data will be automatically incorporated into libgweather yet, so the best way to provide this currently is to just give us a list of the affected cities, and what they should be merged with / renamed to / removed.

Additional Correctness Issues

  • Is every city name correct? (ie, it's the real name of a real city which is actually in that country, not some other country.)
  • For cities known by multiple names:
    • If the city is known by a different name in English than it is locally, is it listed by its English name, with a comment indicating the local name?
    • If the city is known by multiple names locally, are there comments indicating the other local names? (Note that we're mostly only interested in local names; it's generally not interesting to have comments giving alternative foreign names for the city, except in the case of cities like Gdańsk/Danzig or Kaliningrad/Königsberg that have very conspicuously changed names over the centuries.)

    • For cities with multiple local names but no English-specific name, are we using the right local name as the "English" version? (Eg, since our dataset has both Finnish and Swedish names for many cities in Finland, we have to explicitly tell it to use the Finnish names. There may be other countries that have this problem as well.)
  • Likewise, for countries that are divided into <state>s, are the <state> names correct? (Basically all the same issues as the city names above apply here.)

    • Also, are all cities in the correct <state>?

    • One known bug with the handling of <state>s: we can't currently deal with cities that are considered to be their own state-level entities (eg, Beijing, China or Berlin, Germany); For now, those cities need to be placed inside imaginary <state>s of the same name. This will eventually be fixed.

  • Is the use of transliteration / diacritic marks correct and "good"? In some cases we're stripping out diacritic marks to make the names more English-like. (Eg, the source dataset has all sorts of macrons and hooks and stuff in its transliterations of Arabic names, so that "Riyadh" beomes "Riyāḑ", etc. We flatten those back down to ASCII.) It might be the case that in some countries we're stripping out diacritics that we should be leaving there. (Or vice versa, maybe we're leaving in diacritics that should be skipped in the English versions?)

Really Nitpicky Extra-Credit Stuff

  • Does each city and each weather station location have the correct coordinates?
    • Cities are easy (but time-consuming) to verify; just search for the coordinates in Google Maps. (The coordinates listed don't need to be the exact coordinates of the city center, just something reasonably close.)
    • Weather stations are harder; if the weather station is an airport, you can often tell that the coordinates are correct because you can see the airport's runways in the Google Maps satellite view. Other than that, if the station is named after a city or geographic feature, you can at least check that it really is near that feature. (In particularly egregious cases, you can tell that the coordinates must be wrong because they place the weather station in the ocean, or in the wrong country...)

Locating Unknown Weather Stations

The upstream weather station data file is not kept up-to-date, and there are many weather stations reporting which we don't actually know the names or locations of. We can generally tell what country they are in based on the first two letters of their station code, but there's no way to automatically figure out where they are.

If your country has unknown weather stations, there will be a comment at the top of the <country> in Locations.xml.in stating "Could not find information about the following stations, which may be in Country Name". (Note "may be"; the code that generates these warnings is easily confused. Eg, because there's one US military base with a "KQ" station code in Qatar, it thinks that all of the other missing "KQ" stations are in Qatar too, but really they aren't.)

In order to use these stations, we need to at least know their latitude and longitude (and preferably also their elevation in meters, and some name for the station). If the code corresponds to an airport, you may be able to find it by googling for that code plus "airport". Or you might be able to get a forecast for that station on a site like wunderground.com which will contain additional information...

There are also a handful of weather stations listed at the end of Locations.xml.in that we can't even figure out what country they belong to.

Projects/LibGWeather/ImprovingLocationsOld (last edited 2019-01-13 13:15:45 by AndreKlapper)