2. HLFS Data Tools User Guide - Stats NZ Integrated Data Commons

This post is intended to give an understanding of how to access and set up the HLFS tools library for use on HLFS data.

Pre-requisites for using the HLFS tools

You should have access to SQL Server Management Console in the IDI. This is a tool that enables you to directly run queries and stored procedures on the SQL server database that holds the IDI data. You can verify this by logging into an IDI session, and checking through the tools available in the Start Menu and locating SQL Server Tools. This tool is usually automatically made available to IDI researchers.
To use the HLFS tools, the primary pre-requisite is that you have access to the HLFS datasets in the IDI. You would need to request access to these datasets when you submit your request for a research project in the IDI, or else submit a variation request if you already have an IDI project that you need HLFS data to be part of.
Assuming you already have access to access and use HLFS data (i.e., you can access datasets under IDI_Clean_YYYYMM.hlfs_clean schema), you should additionaly also have access to the IDI_UserCode database in the IDI SQL server (prtprdsql36) - this is usually granted automatically to IDI research projects. You can verify access by logging into SQL Server Management Console (instructions to do this are available in the IDI Wiki), and checking the databases available to you under the database list viewer on the left side of the SQL Server Management Console window.

How can I obtain access to the HLFS Summarisation tools?

The HLFS summarisation tools are a collection of SQL programs that enable you to easily create estimates (like weighted counts, means, rates, comparisons between HLFS quarters, etc.) and confidence intervals without having an intimate understanding of how to use replicate weights or the jackknife method of creating estimates. And for those who already know how this works - this collection of programs saves you the effort of doing it in an ad-hoc fashion.

The tools can either be accessed from ID Commons directly (links to these tools are available in the HLFS landing page, OR these can be accessed from within the IDI.

You will need to install these SQL programs on IDI_UserCode database before you use these. Details on how these should be installed are described in subsequent sections.

How do I install these tools?

Assuming you’ve checked and validated that all the pre-requisites are satisfied, use the following steps to install the tools for your use -

Copy over the HLFS summarisation tools folder into a suitable location in your IDI project folder.
Open SQL Server Management Console, and log into the SQL server.
From the HLFS summarisation tools folder, open the hlfs_data_preparation.sql file in SQL Server Management Console. This script is meant to take the hlfs_clean.data from the IDI_Clean version of your choice, and apply some cleaning rules to this dataset and prepare it for use by the HLFS summarisation tools.
From the Query Menu at the top of the SQL Server Management Studio, scroll down to SQLCMD Mode, and click on it to enable it.
After the documentation section of the code file, at the beginning of the code section, you’ll notice 4 lines of code that look like this -

:setvar targetdb "{targetdb}"
:setvar targetschema "{targetschema}"
:setvar projprefix "{projprefix}"
:setvar idicleanversion "{idicleanversion}"

Replace the placeholder {targetdb} with the name of the database where this object will be created, i.e. - [IDI_UserCode] in most usual scenarios.
{targetschema} must be replaced with the project schema allocated to your project - something like [DL-MAA20XX-XX]
{projprefix} must be replaced with an identifier of your choice - something that enables to easily identify that this table was made by you for a specific purpose. This exists just in case you need multiple copies of the object within the same project schema, so that you don’t overwrite earlier copies. Example - dev, or final, etc.
{idicleanversion} must be replaced with the appropriate IDI_Clean version that you want to use in your project. For example, if you use IDI_Clean_202210, then this database object will use the HLFS dataset in the IDI_Clean_202210 database as its source.

Now, select everything (Ctrl + A) and click Execute (or press F5) to run the script end-to-end. Note: If you get the following error: CREATE VIEW must be the first statement in a query batch, then comment out the statements use $(targetdb) and DROP VIEW IF EXISTS $(targetschema).$(projprefix)_hlfs_data and click Execute (or press F5) to run the script again.
At the end of execution, a new database object will be created under IDI_UserCode.[DL-MAAXX-XX], by the name as specified by you - <projprefix>_hlfs_data. You can verify this by right-clicking IDI_UserCode database in the viewer and clicking refresh. Then navigate to IDI_UserCode → Views and checking for the object name as specified. This database object holds the cleaned HLFS data that you can use for generating your summaries.
Now, from the HLFS summarisation tools folder, open the specific summarisation procedure you want to run. The details regarding each summarisation procedure can be found from the links provided in the HLFS landing page. For instance, the SELevel procedure creates weighted population counts by a set of grouping variables and the confidence intervals around this estimate as generated from the replicate weights.
After the SE stored procedure file is opened, you’ll notice 3 lines of code that look like this (just like step 5) -

:setvar targetdb "{targetdb}"
:setvar targetschema "{targetschema}"
:setvar projprefix "{projprefix}"

Enable SQLCMD Mode (just like in Step 4).
Replace the placeholder {targetdb} with the name of the database where this object will be created, i.e. - [IDI_UserCode] in most usual scenarios.
{targetschema} must be replaced with the project schema allocated to your project - something like [DL-MAA20XX-XX], just like step 4.
{projprefix} must be replaced with an identifier of your choice - something that enables to easily identify that this table was made by you for a specific purpose. This exists just in case you need multiple copies of the object within the same project schema, so that you don’t overwrite earlier copies. Example - dev, or final, etc.

Select everything (Ctrl + A) and click Execute (or press F5) to run the code end-to-end. This will create the stored procedure in the database (for example - IDI_UserCode.[DL-MAA20XX-XX].SELevel). You can check that this is created, by You can verify this by right-clicking IDI_UserCode database in the viewer and clicking refresh. Then navigate to IDI_UserCode → Programmability → Stored Procedures and checking for the object name as specified.
Repeat steps 7 to 9 for all the summarisation tools you want to install and use. Once you’re done, you are now ready to start using these tools to generate summaries from HLFS data.

How do I use the HLFS Summarisation tools?

From SQL Server

Assuming you’ve successfully installed the tools by using the directions provided above, now you are ready to start using the tools for creating summaries. Each summarisation tool creates a different kind of summary , and accepts a different set of parameters as input. Detailed information on what summaries are produced can be understood from the tool-specific pages (these links are available in the landing page).

In general, the syntax for calling the summarisation tool is as follows.

EXEC [DL-MAA20xx-xx].<procedurename> <parameter1>, <parameter2>, ...

The functionality is best explained with an example. For instance, assume that you have already run the hlfs_data.sql script and created a cleaned HLFS dataset with the name [DL-MAA2020-47].test_hlfs_data and also installed the SEMean.sql stored procedure under your project schema [DL-MAA2020-47], with the name test_SEMean. If you wanted to obtain the average of total working hours per week (hlfs_urd_usual_hr_tot_nbr) by region(hlfs_urd_region_code) and sex(hlfs_urd_sex_code) for the HLFS quarter 145, you would use the following code snippet. Note that the exact parameters to be supplied vary for different HLFS summarisation tools.

USE IDI_UserCode
GO

EXEC [DL-MAA2020-47].SELevel [DL-MAA2020-47].test_SEMean 
       '[DL-MAA2020-47].[test_hlfs_data]',
       'hlfs_urd_region_code, hlfs_urd_sex_code',
       'where hlfs_urd_quarter_nbr = 145',
       'hlfs_urd_usual_hr_tot_nbr'

More detailed examples are provided under the linked pages for each summarisation tool for your reference.

From SAS

/* If you're running this from SAS.*/
proc sql;
    connect to odbc(dsn=IDI_UserCode_srvprd);
    create table work.mysummarytable as
    select * from connection to odbc(
        EXEC [DL-MAA2020-47].SELevel [DL-MAA2020-47].test_SEMean 
       '[DL-MAA2020-47].[test_hlfs_data]',
       'hlfs_urd_region_code, hlfs_urd_sex_code',
       'where hlfs_urd_quarter_nbr = 145',
       'hlfs_urd_usual_hr_tot_nbr'
     );

    disconnect from odbc;
quit;

From R

# If you plan on running this from R.
require(DBI)

conn <- DBI::dbConnect(odbc::odbc()
                      ,Driver = <SQL Driver name>
                      ,Server = <IDI Server name>
                      ,Database= "IDI UserCode"
                      ,Trusted_Connection="Yes"
                      ,ConnectionPooling=FALSE)

res <- DBI::dbGetQuery(conn, "EXEC [DL-MAA2020-47].SELevel [DL-MAA2020-47].test_SEMean 
       '[DL-MAA2020-47].[test_hlfs_data]',
       'hlfs_urd_region_code, hlfs_urd_sex_code',
       'where hlfs_urd_quarter_nbr = 145',
       'hlfs_urd_usual_hr_tot_nbr'")

print(res)

What outputs can I expect from the HLFS Summarisation tools?

Each tool serves a different purpose, and produces a different summary. You can create weighted counts, rates, means, or even create differences between specific HLFS quarters. The output table also contains subtotals and grand totals by grouping columns that you supply as input. More details regarding outputs produced by each procedure can be found on the specifc pages dedicated to each tool (available on the HLFS landing page).

How will this data compare with published estimates?

Data produced using these procedures will match person-level survey estimates as produced in the quarter that was most recently loaded to the IDI.
Data produced using these procedures will not match seasonally adjusted or trend series. ‘Headline’ statistics published (e.g. the official unemployment rate) are often seasonally adjusted to remove regular seasonal effects from labour market statistics.
This is done at a total level for the latest and all previous quarters after survey series are produced – indirectly in some cases (e.g. male unemployment and female unemployment levels are seasonally adjusted separately, and then added to calculate total unemployment).
Data produced using these procedures will not match the most up-to-date survey estimates on Infoshare.
HLFS benchmarks are based on national population estimates, which incorporate outcomes-based migration measures. Outcomes-based migration, while more accurate than historic intentions-based measures, are also subject to revision until data is finalised.
These migration revisions lead to subsequent revisions to historic HLFS benchmarks, which is reflected in both final and replicate weights. The more recent quarters will have larger revisions, while those 14 or more quarters back should be completely finalised.
The HLFS IDI dataset contains the final weights for all quarters that were used in the most recent quarter available in that IDI refresh.
Data produced using these procedures will not match published estimates of total actual hours worked.
While the HLFS rolling collection period is designed to measure across the quarter, the actual reference period does not always line up with the ideal reference period. For example, a March quarter is theoretically 1st January to 31st March, but the reference period may cover 26th December to 27th March.
For most variables, the differences between ideal and actual reference period may not make a difference. However, there is evidence that it has an impact on the total actual hours estimates, and a model approach is therefore used to adjust national actual hours series as surveyed to the ideal calendar period before publication.
Data produced using these procedures will match survey estimates in the excel and csv tables published as part of that most recent quarter’s release. To test that, for results produced using IDI_Clean_202210, the most recent quarter loaded was June 2022. Therefore, rounded estimates of employment, unemployment and not in the labour force levels by sex produced using the HLFS data tool should match those found in table 3 of the Household labour force survey: June 2022 quarter excel spreadsheet. A similar test can be performed for any IDI refresh and the last quarter in that refresh’s original publication.

What kind of sample errors does the HLFS data tool produce?

This code calculates sampling errors using the replication-based delete-a-group jackknife method. These sample errors represent the 95% confidence interval of estimates.
Within the HLFS dataset, 100 sets of replicate weights have been produced based on subsamples of the full sample using the same benchmarks as final weights. This allows users to estimate the overall variance of the parameter estimate based on the variability of the 100 replicates estimates produced using replicate weights.The formula behind this variance calculation is:

V( \hat{y}) = \frac{(G-1)}{G} \sum_{g=1}^G(\hat{y}_{(g)} - \hat{y})^2

where G is the total number of groups, g the replicate group, \hat{y} the final weight and \hat{y}_{(g)} the replicate weight . For more details, see Household Labour Force Survey sources and methods: 2025

Estimates that equal benchmarks (e.g. the working-age population within a certain region) will not have sample errors, since each set of replicate weights creates an identical benchmarked value, and the variance is 0.
This code also produces relative sample errors (RSEs), to help put the absolute sample errors into context.
Larger groups have more room for variance but tend to be more relatively certain. For example, an estimate of 2.5 million may have sample errors of 25,000, while an estimate of 100,000 may have sample errors of 10,000. While there is more absolute variance on the estimate of 2.5 million, the RSE is only 1% compared with an RSE of 10% on the estimate of 100,000.

How should sample errors be used or expressed in text?

Sample errors should be interpreted as the level above or below estimates within which users can be relatively confident the true value lies. For example, an estimate of 1,100 with a sample error of 1,000 could be expected to fall between 100 and 2,100, and a rate of 15.0 percent with a sample error of 8.4pp could be expected to fall between 6.6 percent and 23.4 percent. In text, this can be expressed as estimate ± sample error or by pointing readers to a source of error information in tables/graphs. If errors aren’t included in text directly, a rule for discussion (e.g. only statistically significant movements will be discussed) should be established.
Sample errors should be used in graphs using survey data as error bars or shaded regions. Upper and lower bounds should be based on unrounded estimates and sample errors if users want to perfectly match official estimates. Where based on rounded estimates and sample errors, bounds may differ slightly.
Change procedures should be used to test whether groups significantly differ, rather than simply comparing upper and lower bounds. This is because estimates with normal distributions will tend to cluster at the middle of confidence intervals, and some estimates may significantly differ even when there is some small overlap between bounds.
Using bounds that don’t overlap as the qualifier for significant differences instead will be a harsher test – so it won’t include any non-significant differences but may exclude some significant differences.

How should you output data produced using this tool?

Data produced using this tool is in the form of weighted counts and rates – see section 4.2 in Microdata Output Guide for full details of suppression and rounding.
Weighted level estimates should be suppressed if less than 1,000 when unrounded, including blank or zero values.
Level estimates of 1,000 or more should then be rounded to the nearest 100.
Secondary suppression is not required. However, rates based on weighted numerators of less than 1,000 should be treated with caution from a quality perspective. Users may also choose to suppress these values.
Percentage values (ie rates and RSEs) should be rounded to one decimal place.
If an estimated level or rate is suppressed, so too should associated sample errors and changes.
Sample errors and changes do not need to be suppressed based on their own values, however. If an unsuppressed estimate remains unchanged or has a sample error below 1,000, that should be rounded, but can be output even if it falls below 1,000.