Speeding up geocoding performance

Val1 · March 22, 2016, 11:23am

I am still having problems with your explanations.

1. That's correct - with your example it seems to behave ok. But if you look at the original address example that I provided that gives you a different story "1314 University Ave, Sewanee TN 37375" . Because of the data issue in ThinkGeo as outlined before in my previous post, the geocoder fails to match anything when a ZIP code is included in the string. If I remove the ZIP code from the address string and submit the address without the zipcode "1314 University Ave, Sewanee TN", geocoder takes 1028ms to return even with the changes you proposed to initialization of geocoder. See sample code below of what I ran:

var results = geoCoder.Match("1314 University Ave, Sewanee TN");

Since I don't have any alternative for information for any other zip codes for this address (as received from the source client data), my only option is to go with matching without the address (and since ThinkGeo does not understand the address with the correct zip code anyway). Given that I am trying to geocode about 1.5 million records out of which about 500,000 fail to geocode with the zip code, the fall back for the problematic records would take about 140 hours which is not acceptable.

2. From your explanation, streets.dbf was built incorrectly and the logic seems to be faulty. Basically you are missing quite a lot of data in your streets.dbf file. The cases of having different zip codes on the left and right sides of the street is quite common in the US since that's how Postal Office defines zip code boundaries. In the example file that I supplied in my previous post that covered only one county in US there were 130 cases where the ZipCodes were different on the left and right side of the street. In addition one TLID record can match to multiple address ranges. Here is another example of the data problems in your data set:

Census data:

    TLID


            TLID
            FROMHN
            TOHN
            SIDE
            ZIP
            PLUS4
            FROMTYP
            TOTYP
            FROMARMID
            TOARMID
            ARID
            MTFCC
            FULLNAME
            NAME
            PREDIRABRV
            PRETYPABRV
            PREQUALABR
            SUFDIRABRV
            SUFTYPABRV
            SUFQUALABR
            PREDIR
            PRETYP




            614844025
            4073
            4071
            R
            37345



            I



            0
            0
            4002318771642
            D1000
            John Hunter Hwy
            John Hunter












            Hwy











            614844025
            4073
            4071
            R
            37345



            I



            0
            0
            4002318771642
            D1000
            State Hwy 122
            122



            State Hwy















            579


            614844025
            4074
            4072
            L
            37345



            I



            0
            0
            4002318771632
            D1000
            John Hunter Hwy
            John Hunter












            Hwy











            614844025
            4074
            4072
            L
            37345



            I



            0
            0
            4002318771632
            D1000
            State Hwy 122
            122



            State Hwy















            579


            614844025
            4142
            4100
            L
            37328



            I



            0
            0
            4002318771639
            D1000
            John Hunter Hwy
            John Hunter












            Hwy











            614844025
            4142
            4100
            L
            37328



            I



            0
            0
            4002318771639
            D1000
            State Hwy 122
            122



            State Hwy















            579


            614844025
            4143
            4101
            R
            37328



            I



            0
            0
            4002318771626
            D1000
            John Hunter Hwy
            John Hunter












            Hwy











            614844025
            4143
            4101
            R
            37328



            I



            0
            0
            4002318771626
            D1000
            State Hwy 122
            122



            State Hwy















            579

All those 8 records get condensed into just one in your dataset. So in this example you are not only loosing mutliple zip codes, but you are loosing the fact that particular TLID is getting translated into multiple street names. Some other things to consider: County names could be different on each side of the street and so could be the city names. Think of Lake Cook Road in Chicago area for example. That road separates Lake and Cook counties.

So for your last question, do I want you to improve your index data - the answer is definite YES, because in its current state its not exactly usable.

ThinkGeo · March 22, 2016, 11:23am

Val,

We are still working on the index data re-building, when the new index data is prepared, we will change the logic in the Geocoder core code, I estimated the next release version will be published next week,

Thanks,

Scott,

Val1 · March 22, 2016, 11:23am

That's good news. Looking forward to your updates.

ThinkGeo · March 22, 2016, 11:23am

Val,

The new index data for US streets had been prepared, we are processing the N-Unit to test the correction for the new index data, the test data is a large dataset, so it would cost some times, I estimate we will provide you the new version of Geocoder tomorrow,

If you have any questions please let me know,

Thanks,

Scott,

ThinkGeo · March 22, 2016, 11:23am

Val,

The new index data for US streets had been prepared, please contact the ThinkGeo support to get the updated index files and try again, when you use the updated index files in your application, if you encountered any problems please let us know,

Thanks,

Scott,

Val1 · March 22, 2016, 11:23am

I downloaded the latest file and tried to use it.

So the good news is that the first example with TLID record #608794984 is now correctly in the table with 2 entries and geocodes correctly as well. The bad news is that there is still only one entry in streets.dbf for TLID record #614844025 when it is supposed to be 8 records. Does someone need to provide you a listing of all TLID that are incorrectly recorded in your streets.dbf or is it something that you can figure out as part of your quality check processes?

ThinkGeo · March 22, 2016, 11:23am

Val,

Maybe I misunderstood something from you, you said the TLID 614844025 just has one record in the streets.dbf, I think it is correct, I checked the original data source for this TLID, it just has one zip code 37328. Also you told me that this TLID is supposed to be 8 records, I think maybe there are anything wrong, each TLID at most has 2 records based on the left zipcode and right zip code.

Can you show me it so that I can find out what’s the problem you said.

Thanks,

Scott,

Val1 · March 22, 2016, 11:23am

Scott - please look at the reply I made on 01-05-2011. in the original source data TLID 614844025 has 8 records. It describes address ranges for for 2 roads - 'State Hwy 122' and 'John Hunter Hwy' (I am guessing those are 2 alisese for the same street) and there are 2 zip codes: 37345 and 37328 - per original data from TIGER. As I mentioned before - a single TLID can have multiple records (not just 2). It can have multiple road aliases for the same address range as well as different zip codes for the left and the right side of the street. In the example with 614844025, there are 2 address ranges with 2 street aliases and different zip codes on the left and the right side - so its 2x2x2 = 8 records in total that show up in TIGER data and should show up in the streets.dbf

Val1 · March 22, 2016, 11:23am

To see an example, look at the TLID query in the database that was created from TIGER data - docs.google.com/leaf?id=0Bx2P-A0O1hJzNzNmMzMyMjAtMTkwYi00NzExLWI0NDgtYWY4YjVkMjdkY2I0&hl=en&authkey=CLDb1agL

ThinkGeo · March 22, 2016, 11:23am

Val,

Can you tell me the TigerData version, I checked the original source data for streets and there is only one record for TLID 614844025, also left zipcode and right zipcode are the same for this TLID. The attachment is the original source street data that includes this TLID 614844025, you can open it using Excel or DbfViewer tool and search the "614844025" value so you can locate it in the dbf file, according to the documentation of TigerData2009, the "*_edges.dbf" file list all streets records for a county of a state, you can get the actual state name and county name from the dbf file name, for example, the attached file name is tl_2009_47051_edges.dbf, the "2009" represents the tiger data version, the "47" represents the statefp and the "51" represents the countyfp. I compared the original dbf file with your access file, they have many differences on the street fields. In your original source data TLID 614844025 has 8 records, but in our original source data TLID 614844025 only has one record and one zipcode 37328, when you locate the TLID 614844025 in the attached dbf file you can see the the zipl and zipr are the same, it is 37328.

So I realize where is the exact problem between us, our source data are not the same, can you give me the link what your original source data so that I can download it and have a comparision on it?

Thanks for your post,

Thanks,

Scott,

tl_2009_47051_edges.zip (243 KB)

Val1 · March 22, 2016, 11:23am

Scott,

as mentioned above, the source files came from 2009 TIGER web site: www2.census.gov/cgi-bin/shap...unty=47051

There were two files that I was looking at:

1. Address Ranges Relationship File (addr.dbf)

2. Feature Names Relationship File (featname.dbf).

I have not worked really deeply with TIGER data for about 10 years, but if I had to guess, edges.dbf does not necessarily represent the source of truth for geocoding - it represents the information needed to draw the line segments (and potentially the primary road name). So there is a potential for mutliple street/address ranges to be associated with one edge segment.

Some notes from looking census.gov/geo/www/tiger...RSHP09.pdf

Excerpt From page: 4-15:

• Address ranges in the TIGER/Line Shapefiles may be associated with one or more of the street names

that belong to an edge. Caution: Address range overlap conflicts may occur if the address ranges are

associated with some street names or route numbers that were not intended for use in locating

addresses. A route number may traverse several streets with similar house numbers but different

common names that are used for mail delivery. .

Excerpt From page: 4-17:

Geocoding—To get the best match results, the Census Bureau advises data users to use all of the

available address ranges to geo-reference/geocode addresses. A single pair of left- and right-side address

ranges may not always provide complete address range coverage. This limitation is also true for the most

inclusive address ranges as well. The address ranges in the TIGER/Line Shapefiles may be separated

because of ZIP Code differences or to establish gaps created by out-of-sequence addresses located

elsewhere. Some address ranges may include embedded alphanumeric characters or hyphens that make

them distinct from the other address ranges.

4.2 Address Range-Feature Name Relationships

Address range-to-feature name relationship information is available by county in the following relationship

file:

Address Range-Feature Name Relationship File

The Address Range-Feature Name Relationship File contains a record for each address range-linear feature

name relationship. The purpose of this relationship file is to identify all street names associated with each

address range. An edge can have several feature names; an address range located on an edge can be

associated with one or any combination of the available feature names (an address range can be linked to

multiple feature names). The address range is identified by the address range identifier (ARID) attribute,

which can be used to link to the Address Ranges Relationship File. The linear feature name is identified by

the linear feature identifier (LINEARID) attribute that relates the address range back to the Feature Names

Relationship File (see Figure 5)

----------

Based on Figure 5 relationships - to get full address ranges, you should have the following joins: edges to addr on TLID, addr to addrfn on ARID, and addrfn to featname on LINEARID. Once you have all those relationships established (as per TIGER documentation) you would end up with the correct set of address ranges and names - as described previously, for TLID 614844025 there are 8 unique address records

Val1 · March 22, 2016, 11:23am

One minor correction on the joins I described in the previous post. featnames needs to have an additional join to edges.shp on TLID (per Figure 5 of TIGER documentation)

Val1 · March 22, 2016, 11:23am

Scott, any updates?

ThinkGeo · March 22, 2016, 11:23am

Val,

Sorry for the inconvenient about it, we found out the reason and had a discussion about it, as you know, the current streets.dbf file is more than 2.3 G, it has about 39067712 records and we queried the total records of the feature name for the whole USA, the valid record count is 44003151, if we just re-build the streets.dbf file with all feature names, I believe the streets.dbf file is about 5G, I think it is unacceptable and it will cause the performance issue for geocoding, we are looking for a solution on that now, maybe we will change the index data structure so that this problem can be resolved smoothly,

Please keep your eye on the official site if there are nay new updates we will let you know,

Thanks,

Scott,

Val1 · March 22, 2016, 11:23am

Scott, do you have any timelines for updates? This is a pretty crticial issue for us to be able to use this product effectively in production environment.

Also, as part of this process, are you planning to look at the original performance issues (ie, when ZIP code is not supplied as part of the data to geocode?)

ThinkGeo · March 22, 2016, 11:23am

Val,

I’m sorry for the issue, currently, we are talking about the design for re-building the Geocoder index data, it is a big change in our current structure, we have to make sure the design first then  re-building it. Also for our design, we just include all of the feature names for each street record and the zipcode we still get it from the *_edges.dbf file, because the *_edges.dbf includes the zipl and zipr fields and we can determine both of them, if a street just has one zip code in the *_edges.dbf file, we will extract it directly and mark the zip code as the only zip code.

Just let you know how to do that we will, if you have any suggestions or issues please let me know,

Thanks,

Scott,

Val1 · March 22, 2016, 11:23am

Any updates?

ThinkGeo · March 22, 2016, 11:23am

Val,

Thanks for the delay, we decided how to design the new index data for streets, the streets.dbf file will holed and the structure will not be changed, we will add all of the feature names for each street to this streets.dbf file, according to our calculation, after that the size of streets.dbf will be about 5 G, it is the limitation for FAT32 disk. So we will split the streets.dbf to two dbf files, we will split it according to the letters, for example, the first street dbf file name is streets_a-g.dbf, the second street dbf file name is streets_h-z.dbf, that means the streets_a-g.dbf file stores all street names with feature names both for the right and left zipcodes.

We have added this task to our Gemini issue list, there are still many thing need to do, after we build the new index data we also need to change our core code for matching and process the Nunit tests. It needs a little more times. We have to make sure everything is ok then public it.

If there are any updates we will let you know as soon as possible,

Thanks,

Scott,

Val1 · March 22, 2016, 11:23am

Sorry to bother you again, but it's been almost a month since the last update on this issue. Are you getting closer to releasing an update?

ThinkGeo · March 22, 2016, 11:23am

Val,

We are still working on the new dataset for Geocoder index, during we built the new index data we encountered an issue that took our many times on that, currently we have fixed it and the progress is normal, I estimated we will release the new index data within one week,

Thanks,

Scott,