awk does not recognize ^ to lock pattern to start of line

Asked by James M Stansberry

When trying to match a pattern in awk that should only occur on the beginning of a line,
I.E. /^303 [0-9]/ the pattern is not recognized. Nothing is output from awk. That
is a space between the 3 and [

Looking in the awk man pages, not only does it state that the ^ should work, it even shows
an example of that.

I've tried everything I can think of but I can't resolve the issue.

This is awk on Ubuntu 12.04 LTS with the Gnome Classic desktop.

This pattern works on HP-UX.

Thanks for any help,

James M (Mike) Stansberry

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu gawk Edit question
Assignee:
No assignee Edit question
Solved by:
James M Stansberry
Solved:
Last query:
Last reply:
Revision history for this message
Thomas Krüger (thkrueger) said :
#1

Can you post the whole command line you are using?

Revision history for this message
Ralph Corderoy (ralph-inputplus) said :
#2

Works for me.

    $ seq -f '%.0f 123' 300 310 | awk '/^303 [0-9]/'
    303 123
    $

Revision history for this message
James M Stansberry (stansberrymj) said :
#3

Here is the complete command line which is in a file:

   awk 'BEGIN { FS = "\n"; RS = ""; }
   ^/303 [0-9]/ { printf "%s\n\n",$0 }
' $infile > $temp_file

The '$infile is a file that contains addresses plus other information. I'm trying to
sort by the area code. As long as I leave the ^ off the script does print out just
the '303' area codes, BUT there can also be other information that has the
same pattern but does not occur at the first of a line.

The above pattern is repeated for other area codes.

I've also tried the following with no luck:

   $0 ~ /[4-6][0-9][0-9] [0-9]/ { printf "%s\n\n",$0 }
   $0 ~ /8[0-9][0-9] [0-9]/ { printf "%s\n\n",$0 }

This above one does not seem to be locking the pattern to 4-6NNspaceN

MIke Stansberry

Revision history for this message
Ralph Corderoy (ralph-inputplus) said :
#4

Please compare carefully to mine. The "^" has to be part of the regexp by being after the leading "/". I don't expect what you've given would work on HP-UX as you earlier said.

Revision history for this message
James M Stansberry (stansberrymj) said :
#5

I typed in the ^ wrong. It's actually

  awk 'BEGIN { FS = "\n"; RS = ""; }
  /^303 [0-9]/ { printf "%s\n\n",$0 }
' $infile > $temp_file

I have the ^ removed from the script and had typed it in wrong.

Here is the "almost full" script:

#!/bin/bash

temp_file=$(mktemp)
for infile in $(ls *.nts)
do
   awk 'BEGIN { FS = "\n"; RS = ""; }
   /^303 [0-9]/ { printf "%s\n\n",$0 }
   $0 ~ /720 [0-9]/ { printf "%s\n\n",$0 }
   $0 ~ /970 [0-9]/ { printf "%s\n\n",$0 }
   $0 ~ /719 [0-9]/ { printf "%s\n\n",$0 }
   $0 ~ /NO PHONE/ { printf "%s\n\n",$0 }
' $infile > $temp_file

   #put the sorted data back into the infile.
   mv $temp_file $infile
done
exit 0

And here is a sample of the file to be read from:

2210 R HXC XXXXX 10 SOMETOWN AZ SEP 5
JOHN DOE
1234 ANYROAD DR
ANYTOWN CO 80909
303 555 5555
BT
THIS IS SOME MESSAGE IN
THE BODY OF THE MESSAGE
BT
SENDERS NAME
AR

If I remove the ^ from the 303 pattern match, it finds the above message. If
the ^ is included, it does not. (I have not put the ^ back in the other patterns).

A simple file/script gives the desired results:

file:
303 555 5555

script:

#!/bin/bash

awk '
/^303 [0-9]/
' $1

So what in the "almost full" script is messing up the pattern matching?
Thank you,

Mike Stansberry

Revision history for this message
Ralph Corderoy (ralph-inputplus) said :
#6

^ in awk is the beginning of the string whereas you want /\n303/ if you know it's never the first field in the record or /(^|\n)303/ if it may be either. Can explain more if that's unclear, just say, in a rush at the moment.

Revision history for this message
James M Stansberry (stansberrymj) said :
#7

^ in awk is the beginning of the string whereas you want /\n303/ if you know it's never the first field in the record or /(^|\n)303/ if it may be either. Can explain more if that's unclear, just say, in a rush at the moment.

It's always at the beginning of a line that I want to match but /^303 [0-9] does not find a match
Your solution does except it is being printed twice. I'll keep at it and go back again to the
awk/sed book AND study regexp some more.

I haven't had too much of a problem with pattern matching and regexp in the past, but this one
does have me wondering just why the ^ will work for a single line or as a pipe from the command
line ( echo '303 555 5555' | awk '/^303 [0-9]'

Thanks,

Mike Stansberry

Revision history for this message
Ralph Corderoy (ralph-inputplus) said :
#8

Alternatively, if you know it's always $5 that holds the ZIP code then $5 ~ /^303 [0-9]/ should work.

Revision history for this message
Ralph Corderoy (ralph-inputplus) said :
#9

It's quite simple. :-) A plain /foo/ matches against $0 which is the
whole record. In this multi-line record example below the first $0 is
"a\n1". "^" in a regexp is always start of *string*, not start of line.
$0 starts with an "a", not a digit. Splitting that record into fields
on newlines makes $1 be "a" and $2 be "1". Now $2 does start with a
digit so using "^" against it works.

    $ printf 'a\n1\n\nb\n2\n\n' | awk 'BEGIN {RS = ""; FS = "\n"; ORS = "\n\n"} /^[0-9]/'
    $ printf 'a\n1\n\nb\n2\n\n' | awk 'BEGIN {RS = ""; FS = "\n"; ORS = "\n\n"} /\n[0-9]/'
    a
    1

    b
    2

    $ printf 'a\n1\n\nb\n2\n\n' | awk 'BEGIN {RS = ""; FS = "\n"; ORS = "\n\n"} $2 ~ /^[0-9]/'
    a
    1

    b
    2

    $

When one wants to match "foo" at the start of any field but doesn't know
if it's $1, $2, etc., then "^" against $0 would only be testing $1 at
the start of the string, but "\nfoo" would test against all but $1 as it
doesn't have a \n before it. Thus, /(^|\n)foo/ handles both cases.

Revision history for this message
James M Stansberry (stansberrymj) said :
#10

re:

Alternatively, if you know it's always $5 that holds the ZIP code then $5 ~ /^303 [0-9]/ should work.

Interesting idea (and I know you meant 'AREA code'). I just recently learned that each line would be a
different variable to awk. I'll try locking it to $5 and see if it works.

It appears that:

awk 'BEGIN { FS = "\n"; RS = ""; }
   $5 ~ /^303 [0-9]/ { printf "%s\n\n",$0 }
   $5 ~ /^720 [0-9]/ { printf "%s\n\n",$0 }

is working. And even:

   $5 ~ /^8[0-9][0-9] [0-9]/ { printf "%s\n\n",$0 }

seems to find any area codes starting with 8. I actually need to search for that
as well as I'm seeing cellular phone numbers popping up.

But my primary goal was to search for the CO area codes so I can have the
messages sorted in order of the area code so that I can more easily send
them on to the proper place.

So, I would say that you've led me in the right direction and I'm certainly
going to copy and paste the information you have sent me and save it
all into a file AND print it!! You have also taught me several things I
did not realize about awk and regexp.

Thanks again.

Mike Stansberry