4

I have a list of data, like a CSV but some lines are missing a value. I'd like to generate a value for the missing line based on the lines before and after using linux shell script.
Take this table for instance.

linepersonage
1Adam45
2Bob50
3Cindy47
4*#
5Ed49

What I'd like to do is fill in the "*" in line 4 with "Cindy:Ed" (a concatenation of the nearest, valid data in each direction in column B with a ":" delimiter) and the "#" with 48 (the average of 47 and 49, the nearest valid data points in each direction from column C).

Output:

linepersonage
1Adam45
2Bob50
3Cindy47
4Cindy:Ed48
5Ed49

My data is formatted as a space-delimited text file of arbitrary row count. All rows are three columns.

While I know my way around a For loop and grep etc., I'm at a loss as to how I'd handle this in vanilla linux shell script.

My guess is to make an initial pass to find the lines that have asterisks and hashes. Then make a second pass to replace the asterisks with (awk '{print $2}'):(awk '{print $2}') of the lines before and after, respectively.

If the missing data is on the first or last line, I'm happy to leave it as is. If missing data is on consecutive lines, I'm ok with setting all missing lines to the same "Cindy:Ed" and same average. It'd be even cooler if I could set "Cindy:Ed:1" and Cindy:Ed:2" etc.

An accurate example of worst case scenario raw input: (it's a traceroute with added "#" for the missing latency)

 1 192.168.200.2 1 2 192.168.200.1 1 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 * # 7 11.22.44.66 9 8 * # 9 * # 10 8.8.8.0 25 11 * # 12 * # 13 * # 

What I'd like:

1 192.168.200.2 1 2 192.168.200.1 1 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66 7 7 11.22.44.66 9 8 11.22.44.66:8.8.8.0 17 9 11.22.44.66:8.8.8.0 17 10 8.8.8.0 25 11 * # 12 * # 13 * # 
4
  • 1
    And for consecutively missing values?
    – FelixJN
    CommentedJan 26, 2022 at 17:31
  • @KamilMaciorowski Sorry. Good catch! When I got to the second part I changed it to a 9 to make the math cleaner and didn't update the original. fixed now. FWIW, I don't care if the "average" is an integer or floating.CommentedJan 26, 2022 at 17:52
  • 1
    @FelixJN I updated post to say consecutively missing values can search outward to the nearest valid data points. Also row count shouldn't go above 30 if that matters. These are traceroutes after all.CommentedJan 26, 2022 at 17:55
  • What should the output be if the first line has a * for the 2nd field?
    – Ed Morton
    CommentedJan 26, 2022 at 20:36

5 Answers 5

3

With awk:

#if a previous line with proper IP has been read oldip != "" { #i is counter for consecutive invalid lines i=0 #if IP is set, just print and go to next record if ($2!="*") { print ; oldip=$2 ; oldlat=$3 ; next } #otherwise get following line and increase counter else { #getline exit status => fails for the last line while (getline > 0) {i++ #check if new line has IP, if so #set IPold:IPnew and average latency value if ($2!="*") { ipfill=oldip":"$2 ; latfill=(oldlat+$3)/2 #print filler lines for all consecutive records without value for (j=1 ; j<=i ; j++) { print NR-i+j-1,ipfill,latfill #alternative printing with oldIP:newIP:counter # print NR-i+j-1,ipfill":"j,latfill } #save current IP+lat and print "good" line oldp=$2; oldlat=$3 print ; next } } } #in case getline failed => all previous lines had no value #just fill them with N/A data as in input for (j=0 ; j<=i ; j++) { print NR-i+j,"*","#" } } #If leading lines have no IP value, print them until IP is found oldip == "" { if ($2=="*") {print ; next} ; oldip=$2 ; oldlat=$3 ; print } 

Input:

1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 * # 7 11.22.44.66 10 8 * # 9 * # 10 8.8.8.0 25 11 * # 12 * # 13 * # 

Output:

1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66 7.5 7 11.22.44.66 10 8 11.22.33.55:8.8.8.0 17.5 9 11.22.33.55:8.8.8.0 17.5 10 8.8.8.0 25 11 * # 12 * # 13 * # 

Alternative output with counter for calculated lines:

1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66:1 7.5 7 11.22.44.66 10 8 11.22.33.55:8.8.8.0:1 17.5 9 11.22.33.55:8.8.8.0:2 17.5 10 8.8.8.0 25 11 * # 12 * # 13 * # 
5
  • It'll take me a minute to wrap my brain around this but it worked as soon as I figured out how to use it. Thanks!CommentedJan 26, 2022 at 18:58
  • @DariusDauer feel free to ask if stuck.
    – FelixJN
    CommentedJan 26, 2022 at 19:03
  • 1
    @EdMorton I see, corrected to while (getline > 0) - I missed this in the manuals: "If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1". The problem regarding output is a bit confusing regarding OPs requirement of "average", but surely can be handled with printf "%s %s %d\n" as alternative.
    – FelixJN
    CommentedJan 26, 2022 at 20:49
  • 1
    @EdMorton it is working. I just "changed the goalpost" When he copied my original input data it had line 6 ending in 10. but I edited line 6 later on to show 9 to make the math prettier. His script works right with the data he used.CommentedJan 26, 2022 at 20:49
  • 1
    @FelixJN yeah, getline is a slippery one which is why I wrote awk.freeshell.org/AllAboutGetline. It's one of those things that looks easy to use but requires a lot of thought and defensive coding and makes future enhancements harder. It's usually simpler to just not use it and let awk's normal processing loop handle the input.
    – Ed Morton
    CommentedJan 26, 2022 at 20:52
3
$ cat tst.awk $2 == "*" { buf[++bufSz] = $0 next } bufSz > 0 { split(prev,p) rng = p[2] ":" $2 val = ($3 + p[3]) / 2 for (i=1; i<=bufSz; i++) { split(buf[i],flds) print (prev == "" ? buf[i] : flds[1] OFS rng OFS val) } bufSz = 0 } { print prev = $0 } END { for (i=1; i<=bufSz; i++) { print buf[i] } } 

$ awk -f tst.awk file 1 192.168.200.2 1 2 192.168.200.1 1 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66 7 7 11.22.44.66 9 8 11.22.44.66:8.8.8.0 17 9 11.22.44.66:8.8.8.0 17 10 8.8.8.0 25 11 * # 12 * # 13 * # 
3
  • This fails if the file starts with empty data (i.e. 1 * # as first line). Expected is to see unchanged lines as from the input in this case.
    – FelixJN
    CommentedJan 26, 2022 at 21:02
  • @FelixJN that depends on the OPs currently requirements for that case, hence my question. They may simply never have that situation, idk, but obviously it's not part of the sample input/output they provided for us to test with.
    – Ed Morton
    CommentedJan 26, 2022 at 22:04
  • 1
    @FelixJN actually, I see now the OP did say in the text If the missing data is on the first or last line, I'm happy to leave it as is so I updated my answer to accommodate that.
    – Ed Morton
    CommentedJan 26, 2022 at 22:48
0

With GNU sed using it's extended regex mode (-E)

S='(\S+)'; _re="$S $S" re="^$_re\\n$_re\$" _avg='1k\4 \2+2/f' avg='"$(echo '"'$_avg'"'|dc)"' sed -E ' s/^(\S+ )[*] #(\n.*\n(.*))/\1\3\2/ ta s/\n.*// /[*] #$/b $!N;//!ba :loop ${//q;bb} N;//bloop :b;h s/\n.*\n/\n/ s/^\S+ //Mg'" s#$re#echo '\1:\3' $avg#e x;G :a P;D " file 

Perl using the range operator

perl -lane 'print,next unless my $e = /\d$/ ... /\d$/; push @A,[@F]; next unless $e =~ /E0/ || eof; if (@A>2&&$A[-1][-1] =~ /\d/) { my($str,$avg); for (0,-1) { $avg += $A[$_][2] / 2.0; $str .= $A[$_][1] . ":"; } $str =~ s/.$//; @{$A[$_]}[1,2] = ($str,$avg) for 1..$#A-1; } print "@$_" for splice @A,0,@A-(eof?0:1); @A=(); redo if ! eof; ' file 

python3 -c 'import sys, itertools as it prev = "" p = lambda x: print(*x,sep="",end="") q = lambda x: x.split() g = lambda x: x.endswith("* #\n") with open(sys.argv[1]) as f: for t in it.groupby(f,g): G = list(t[1]) if prev == "": p(G) if not t[0]: prev = G[-1] else: if t[0]: M = G else: a,b = map(q,[prev,G[0]]) x = f"{a[1]}:{b[1]}" y = sum(map(int,[a[2],b[2]]))/2.0 for l in M: for e in q(l)[0]: print(e,x,y) p(G); prev = G[-1] p(M) ' file 

$ cat file 1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 * # 7 11.22.44.66 10 8 * # 9 * # 10 8.8.8.0 25 11 * # 12 * # 13 * # 

Output;

1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66 7.5 7 11.22.44.66 10 8 11.22.44.66:8.8.8.0 17.5 9 11.22.44.66:8.8.8.0 17.5 10 8.8.8.0 25 11 * # 12 * # 13 * # 
    0

    If the file is not huge, we can use the slurp mode (-0777) to solve this:

    perl -MList::Util=sum -0777 -pe ' s{(.*\d\n) ((?-x:.*\* \#\n)+) (?=(.*\d\n)) }[$1 . do{my($s,@A); push @A,(split)[1,2] for $1,$3; $s = join ":", @A[0,2]; @A = ("", $s, sum(@A[1,3])/2); $2 =~ s/^\S+\K.*/@A/gmr; }]xge; ' file 
      -2
      var1=$(awk '{a[++i]=$0}/#/{for(x=NR-1;x<NR;x++)print a[x]}' file.txt | awk '{print $NF}') var2=$(awk '/#/{x=NR+1}(NR==x){print $NF}' file.txt) sed -i "s/#/$var3/g" file.txt sed -i "s/\*/Cindy:Ed/g" file.txt output cat file.txt line person age 1 Adam 45 2 Bob 50 3 Cindy 47 4 Cindy:Ed 48 5 Ed 49 
      1
      • Read the question again, you can’t hard-code “Cindy:Ed”. Where’s var3 set?CommentedJan 28, 2022 at 7:41

      You must log in to answer this question.

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.