Using regular expressions in DB2 for i

Regular expressions have been supported in DB2 for a while now and, while they are not something I often need, there are times when they come in very handy indeed. A simple example involves cleaning up some data so that it can be exported from an old (poorly maintained) database to a new, shiny one that (I’m promised) absolutely will be used.

The database in question is an employee master and, in order to protect the innocent, I have both simplified it and changed the names for the purpose if this example. There are only two fields to worry about empnumber and empname, which represent the employee number and name respectively. There are a couple of issues to address. The first one is that the empnumber is ten characters in the old database but five numeric in the new, so I need to ensure that all the numbers fit.

The second issue is that this database has been very poorly maintained. If you allow someone to enter a character in a numeric field, they will. And they did.

Here’s an idea of how the database looked:

EMPNUMBER   EMPNAME
00001       Bugs Bunny
OOOO2       Elmer Fudd
0000000003  Daffy Duck
-4          Speedy Gonzales
00005       Sylvester

As you can see, there are lots of ways in which the employee number can be invalid, and these have all been very well explored. And watch out for Elmer Fudd — he’s a bit confused about the difference between letters and numbers.

Fortunately REGEXP_INSTR is your friend:

The REGEXP_INSTR returns the starting position or the position after the end of the matched substring, depending on the value of the return_option argument.

So all I need is a regular expression that identifies which employee numbers are five digits followed by spaces, and which aren’t. And here it is:

select * from employees
where regexp_instr(empnumber, '^d{5}s+$') != 1

There are several optional parameters but I only need to use the source string (empnumber) and the pattern expression. The expression I am using looks for digits in the first five characters followed by spaces to the end of the field. And this is the result:

EMPNUMBER   EMPNAME
OOOO2       Elmer Fudd
0000000003  Daffy Duck
-4          Speedy Gonzales

I told you that Elmer Fudd was tricksy.

It should also be possible to use REGEXP_REPLACE to automate some of the cleanup, but you really do need your business users to buy in before you go down that particular rabbit hole.

Adding variables to ad-hoc SQL queries with CL

Some time ago I mentioned using REXX to build and execute ad-hoc SQL queries. For the sake of completeness, here is the same solution implemented in CL using the RUNSQL command.

The program does exactly the same as the previously describe REXX script — extracts all records flagged with today’s date so that they can be reviewed at a later date. And here it is:

pgm
dcl &date6 *char 6
dcl &date8 *char 8
dcl &statement *char 255

/* First retrieve the current system date (QDATE)                             */
/* and convert it to an 8 character value                                     */
RTVSYSVAL  SYSVAL(QDATE) RTNVAR(&DATE6)
CVTDAT     DATE(&DATE6) TOVAR(&DATE8) FROMFMT(*JOB) TOFMT(*YYMD) TOSEP(*NONE)

/* Next, build the SQL statement                                              */
CHGVAR     VAR(&STATEMENT) +
           VALUE('insert into EXPATPAUL1/INPUTDATA' *bcat +
                 'select * from EXPATPAUL1/PRODDATA where dhdate =' *bcat +
                 &date8)
runsql sql(&statement) commit(*none)

endpgm

Merging data with SQL MERGE

There have been times, many times, when I have needed to update the one or more columns in a table with values retrieved from another. My usual approach to this is to use an UPDATE with subselects but it turns out that there is a better way:

The MERGE statement updates a target (a table or view) using data from a source (result of a table reference). Rows in the target that match the input data may be updated or deleted as specified, and rows that do not exist in the target may be inserted as specified. Updating, deleting, or inserting a row in a view updates, deletes, or inserts the row into the tables on which the view is based if no INSTEAD OF trigger is defined on the view.

In order to play around with this, I build and populated a couple of test tables (imaginatively named TEST01 and TEST02) using the below scripts:

create table test01
 (key_field char(1) not null,
  data01 char(20),
  data02 char(30),
  data03 char(5),
  primary key (key_field));

insert into test01
 (Key_field, data01)
values('A', 'TEXT One');

insert into test01
 (Key_field, data01)
values('B', 'TEXT Two');

insert into test01
 (Key_field, data01)
values('D', 'TEXT Four');
create table test02
 (key_field char(1) not null,
  data01 char(20),
  data02 char(30),
  data03 char(5),
  primary key (key_field));

insert into test02
 (Key_field, data02)
values('A', 'More text one');

insert into test02
 (Key_field, data02, data03)
values('C', 'more text two', 'CODE');

insert into test02
 (Key_field, data01, data02)
values('E', 'TEXT five', 'More text five');

This gives me two tables with one matched row (key_field=’A’). To update the first table with the values in the second is really rather simple:

MERGE INTO test01 a
  USING (SELECT key_field, data01, data02, data03 from test02) b
  ON b.key_field = a.key_field
WHEN MATCHED THEN
   UPDATE SET data02 = b.data02, data03 = b.data03;

And the one matched row is updated in table TEST01 with the data from table TEST02.

If this was all I could do, I would be happy to have replaced a rather clunky approach with a more readable solution. But now the MERGE statement really starts to make a difference:

MERGE INTO test01 a
  USING (SELECT key_field, data01, data02, data03 from test02) b
  ON b.key_field = a.key_field
WHEN MATCHED THEN
   UPDATE SET data02 = b.data02, data03 = b.data03
WHEN NOT MATCHED THEN
   INSERT (key_field, data01, data02, data03)
   VALUES (b.key_field, b.data01, b.data02, b.data03)
NOT ATOMIC
  CONTINUE ON SQLEXCEPTION;

The WHEN NOT MATCHED clause means that not only can I update the matched rows, I can also do something about the rows that exist in table TEST02 that have now matched record in table TEST01. In this example, I have inserted the rows from table TEST02, other options allow me to either delete the row from table TEST01 or raise a SIGNAL statement.

And that NOT ATOMIC CONTINUE ON SQLEXCEPTION specifies that if an error is encountered on the insert, the MERGE will continue. Bad data happens, this saves be the burden of constantly having to refine my selection criteria to avoid the crap.

And a final note on the subject of selection criteria, the WHEN MATCHED and WHEN NOT MATCHED can both be extended with an AND to refine the selection to exactly what you want.

Do go and read up on the MERGE statement. It provides a powerful and flexible approach to updating tables from source data and one that I will be using in the very near future.

Better date conversion with timestamp_format

Some time ago, I discussed the never-ending problem of converting a six digit number into an ISO date. I have since found a better way of doing this and, in order to save myself a search through the IBM Knowledge Center every time I need to do this, I’m putting it up here.

So here’s the problem: I have a table in which the date is represented as a 7 digit number. The first digit is the century (0 indicates 20th century, 1 indicates 21st century) and the next six digits represent the date in YYMMDD format. So 1150831, for example, is 31st August 2015.

The request in hand is for a report, so I want to convert the number into a date so that it looks sensible once exported into Excel. Timestamp_format is your friend.

The TIMESTAMP_FORMAT function returns a timestamp that is based on the interpretation of the input string using the specified format.

It’s that “interpretation of the input string” that makes this so handy. Put simply: you tell the function how to interpret the string and it will do the rest.

Here’s an example. I have a contracts table (CONTRACTS) with three columns: Contract Number (CTNR), Start Date (STDT) and End Date (ENDT). The start and end date are both seven digit numbers as described previously. I need to list all of the currently active contracts (Start Date is before Today and End Date is after Today).

And the SQL looks like this:

select CTNR,
       date(timestamp_format(substr(digits(STDT), 2, 6), 'YYMMDD')),
       date(timestamp_format(substr(digits(ENDT), 2, 6), 'YYMMDD'))
from CONTRACTS
where date(timestamp_format(substr(digits(STDT), 2, 6), 'YYMMDD')) = current_date

Current_date is a special register that returns the current date, but that’s not important right now.

The date conversion part involves several nested functions to get to a final date, so here’s the breakdown using a date of 1st September 2015 (1150901):

  • digits(STDT) converts the 7 digit number into a 7 character string. So 1150901 is converted to ‘1150901’.
  • The IBM i can figure out which century it’s in by looking at the year, so I don’t need the century flag and the substr(digits(STDT), 2, 6) strips it out to give me a value of ‘150901’
  • The timestamp_format function takes the date string and uses the format string of ‘YYMMDD’ to generate a timestamp of 2015-09-01-00.00.00.000000. Since I only passed it a date, the hours, minutes and seconds are all zeroes.
  • And finally, I can use the date function to retrieve the date from the generated timestamp.

And here’s the result:

CTNR      DATE      DATE
C1000001  01/09/15  31/08/16

Pretty, and portable.

Adding variables to ad-hoc SQL queries with REXX

It’s incredible how easily I can be distracted. All I needed was a quick and dirty way of capturing the input into an interface and now I’m writing a blog post.

Everything starts with an issue and, in this case, the issue is with an interface not behaving as expected in production even though everything works as expected in the test environment. My suspicion is that the incoming data is not correctly formatted, causing transactions to not meet the selection criteria. But to confirm this, I need to see exactly what is waiting to be processed when the interface runs.

Since this interface runs after the end of business I want to be able to submit an SQL query to capture the input data into a separate table so that I can see what was processed when I come in tomorrow morning. And, because much of this data is not destined to pass through the interface in question (this is why we have selection criteria) I want to be able to select today’s transactions for whatever the current value of today happens to be.

In a sane world, this would be a simple case of using the current date but in the real world there are still people using eight digit numbers to represent the date. This leads to some unwieldy date calculations in SQL which led me to wondering whether I could capture this in a variable to make things a bit more readable. It turns out I can, but not on the version of the IBM i operating system that I am currently using.

What really caught my eye at the above link however, was this:

but if “ad hoc sql + variables” is the thing you need, you should really try Rexx.

And I thought: “Rexx. That’s something I haven’t used since last century.”

Any excuse to download a couple of manuals is reasonable as far as I’m concerned, so off I went.

And here’s the script (library and table names have been changed):

datenow = DATE('S')
statement = 'insert into MYLIB/INPUTDATA' ,
            'select * from PRODLIB/INPUTDATA where dhdate =' datenow
address 'EXECSQL'
EXECSQL 'SET OPTION COMMIT = *NONE, NAMING = *SYS'
EXECSQL 'delete from MYLIB/INPUTDATA'
EXECSQL statement

The DATE function returns the current date and the 'S' option handily reformats it into an eight digit number (yyyymmdd). The address 'EXECSQL' statement handily points the subsequent commands to the SQL command environment and then it’s just a case of executing the statements I need to execute.

It all works very nicely indeed.

On a tangential note, I noticed that Rexx is also available in Arch. I may well install this sometime soon just to have a play around with the language.

Using SQL to update one file from another

With a recent interface change, I was asked to go back and fix all of the affected historical data so that it matched the new requirements. Updating a field in one file with a value from a different file is something I have done several times in the past (far too many ah-hoc queries have been launched, if truth be told) and, while you do need to take a bit of care, the approach is pretty simple.

So here is an example (some names have been changed to protect the proprietary):

update target_file upd
set target_field = (select source_field from source_file
                    where source_key_1 = substr(upd.target_key, 1, 16)
                    and digits(source_key_2) = substr(upd.target_key, 17, 5) )
where exists (select source_field from source_file
              where source_key_1 = substr(upd.target_key, 1, 16)
              and digits(source_key_2) = substr(upd.target_key, 17, 5) )

The script is pretty simple. For each record in target_file for which an associated record can be found in source_file, populate target_field with the value in source_field. Obviously, the select clauses will need to reflect the relevant keys of whatever files you happen to be using.

Inevitably, there is a gotcha: for each record in target_file that you want to update, there must be exactly one record returned by the subquery. Handling this can be split into two parts.

The first part is handled by the where exists clause which ensures that the script will only attempt to update records in target_file if there is a record in source_file with which to update it. This ensures you don’t get caught out by subqueries that return zero records.

The second part involves that the subquery returns no more than one record for each record in target_file. This, unfortunately, cannot be solved generically – you just need to be a bit careful to ensure that the subquery selection is returning unique records. If in doubt, a variation on the below SQL can be used to validate.

select source_key_1, digits(source_key_2), count(*)
from source_file
group by source_key_1, digits(source_key_2)
having count(*) > 1

If you can’t find a unique selection criteria, the distinct clause may help and, if all else fails, try arbitarily using either max() or min().

Executing SQL statements from within a CL with RUNSQL

Here’s a CL Command I didn’t know about, RUNSQL, which allows an SQL statement to be run from within a CL program without needing a source file.

Inevitably, I found this because I was looking for a quick solution to a transitory problem. I won’t bore you with the details but, what I wanted to do was create a temporary file, populate it with orders for a (user specified) date and then copy this file into a library that has been made available to an external ftp application.

Here is what I ended up with (field, file and library names have been changed for clarity and to protect the innocent):

  pgm &date
  dcl &date *char 8
  dcl &statement *char 150
  dcl &nbrrcd *dec (10 0)
  
  cpyf fromfile(template/orders) tofile(qtemp/orders) +
       mbropt(*add) crtfile(*yes)
  clrpfm qtemp/orders
  
  chgvar &statement value('insert into qtemp/orders +
                           select * from archlib/orders  +
                           where msgdat = ' *cat &date)
  runsql sql(&statement) commit(*none)
  
  rtvmbrd qtemp/orders nbrcurrcd(&nbrrcd)
  if cond(&nbrrcd *gt 0) then(do)
     cpyf fromfile(qtemp/orders) tofile(ftplib/orders) +
          mbropt(*add) crtfile(*yes)
     enddo
  
  dltf qtemp/orders
  
  endpgm

It’s all pretty simple stuff but being able to embed the SQL statement right in the CL makes a conceptually simple solution very simple to implement.

Using the SQL with clause to group by calculated values

So here’s a question that cropped up recently: In an SQL Select statement, can you group by a calculated field (scalar function)?

The short answer is: No.

The longer answer is: Of course you can.

It did occur to me, last week, to simply post the script that I used to achieve this but there are a number of other things going on in there and I suspect that the unannotated code would be more confusing than clarifying. It is a useful, technique, however and one that I can see myself using again, so here is a simplified version of the problem along with the solution (file1 and field names have been changed for the sake of clarity).

So here’s the question: Can you provide a report showing the latest invoice number and invoice date for each customer (some selection criteria may apply)?

Since we have a sales file for which the sh_invoice_number field is updated when the transaction is invoiced, the first pass of this report is pretty simple:

select sh_sales_customer, max(sh_invoice_number), max(sh_invoice_date)
from sales_history
group by sh_sales_customer

Now for the wrinkle.

The sh_sales_customer field is the delivery point for the sales transaction. In many case, this is the same as the invoice account but there are cases where one invoice account encompasses several delivery points. Finance folk don’t care where goods are shipped, they just want to know where to send the invoice.

We have a file, account_numbers which maps the delivery point to the invoice account, if necessary. So the script needs to be changed so that it checks this file and uses either the retrieved value or the sales_customer, depending on the results of the lookup. The With clause is your friend.

You can’t Group By the calculated field, but you can use the With clause to factor the sh_sales_customer calculation into a subquery. And this is what we end up with:

with
 customer_accounts as 
   (select dp_delivery_point as ca_delivery_point,
           case 
              when an_invoice_account is not null then an_invoice_account 
              else dp_delivery_point
              end as ca_customer
    from delivery_points
    left outer join account_numbers on an_delivery_point = dp_delivery_point)
select ca_customer, max(sh_invoice_number), max(sh_invoice_date)
from sales_history
join customer_accounts on ca_delivery_point = sh_sales_customer
group by ca_customer

The delivery_points file contains all delivery points along with shipping information.

Footnotes

1 In case any of you youngsters are feeling confused by the terminology: a file (in this context) is a table, a record is a row and a field is a column. I am aware that I have the rather bad habit of using these terms interchangably, but I’m too lazy to go back and reword this post.

Addenda

The field and file names should be reasonably self explanatory, but if you are struggling with what belongs where, here are some definitions:

TABLE_NAME            COLUMN_NAME           DATA_TYPE         LENGTH   NUMERIC_SCALE 
SALES_HISTORY         SH_SALES_CUSTOMER     CHAR                   8                -
SALES_HISTORY         SH_INVOICE_NUMBER     INTEGER                4               0 
SALES_HISTORY         SH_INVOICE_DATE       DATE                   4                -

DELIVERY_POINTS       DP_DELIVERY_POINT     CHAR                   8                -

ACCOUNT_NUMBERS       AN_INVOICE_ACCOUNT    CHAR                   8                -
ACCOUNT_NUMBERS       AN_DELIVERY_POINT     CHAR                   8                -

The real files contain much more information, of course, but I have stripped these out in order to keep things reasonably simple.

Using an SQL subselect to insert records into a table: An example

The following bit of code should be reasonably self-explanatory but I was a little surprised to discover that there are no examples of doing this (for IBM i folks, at least) on the interwebnet.

So now there’s one.

insert into ignoreiln                                               
  with
    slsrpt as (select * from ignoreiln where mtyp = 'SLSRPT'), 
    invrpt as (select * from ignoreiln where mtyp = 'INVRPT') 
  select a.mdir, 'INVRPT', a.siln, a.diln, a.riln, 
         a.dept, a.brnd, 'Auto Created' 
  from slsrpt a    
  left outer join invrpt b on b.mdir = a.mdir 
       and b.siln = a.siln and b.diln = a.diln 
       and b.riln = a.riln and b.dept = a.dept 
       and b.brnd = a.brnd 
  where b.siln is null

If you want to know more, I would recommend starting with the IBM i Information Centre.