Manipulating text using shell script: How can I fill in "missing" lines?

Question

I have a list of data, like a CSV but some lines are missing a value. I'd like to generate a value for the missing line based on the lines before and after using linux shell script.
Take this table for instance.

line	person	age
1	Adam	45
2	Bob	50
3	Cindy	47
4	*	#
5	Ed	49

What I'd like to do is fill in the "*" in line 4 with "Cindy:Ed" (a concatenation of the nearest, valid data in each direction in column B with a ":" delimiter) and the "#" with 48 (the average of 47 and 49, the nearest valid data points in each direction from column C).

Output:

line	person	age
1	Adam	45
2	Bob	50
3	Cindy	47
4	Cindy:Ed	48
5	Ed	49

My data is formatted as a space-delimited text file of arbitrary row count. All rows are three columns.

While I know my way around a For loop and grep etc., I'm at a loss as to how I'd handle this in vanilla linux shell script.

My guess is to make an initial pass to find the lines that have asterisks and hashes. Then make a second pass to replace the asterisks with (awk '{print $2}'):(awk '{print $2}') of the lines before and after, respectively.

If the missing data is on the first or last line, I'm happy to leave it as is. If missing data is on consecutive lines, I'm ok with setting all missing lines to the same "Cindy:Ed" and same average. It'd be even cooler if I could set "Cindy:Ed:1" and Cindy:Ed:2" etc.

An accurate example of worst case scenario raw input: (it's a traceroute with added "#" for the missing latency)

 1 192.168.200.2 1 2 192.168.200.1 1 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 * # 7 11.22.44.66 9 8 * # 9 * # 10 8.8.8.0 25 11 * # 12 * # 13 * #

What I'd like:

1 192.168.200.2 1 2 192.168.200.1 1 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66 7 7 11.22.44.66 9 8 11.22.44.66:8.8.8.0 17 9 11.22.44.66:8.8.8.0 17 10 8.8.8.0 25 11 * # 12 * # 13 * #

@KamilMaciorowski Sorry. Good catch! When I got to the second part I changed it to a 9 to make the math cleaner and didn't update the original. fixed now. FWIW, I don't care if the "average" is an integer or floating. — Darius Dauer, CommentedJan 26, 2022 at 17:52
@FelixJN I updated post to say consecutively missing values can search outward to the nearest valid data points. Also row count shouldn't go above 30 if that matters. These are traceroutes after all. — Darius Dauer, CommentedJan 26, 2022 at 17:55
What should the output be if the first line has a * for the 2nd field? — Ed Morton, CommentedJan 26, 2022 at 20:36

FelixJN · Accepted Answer · 2022-01-26 20:46:21Z

With awk:

#if a previous line with proper IP has been read oldip != "" { #i is counter for consecutive invalid lines i=0 #if IP is set, just print and go to next record if ($2!="*") { print ; oldip=$2 ; oldlat=$3 ; next } #otherwise get following line and increase counter else { #getline exit status => fails for the last line while (getline > 0) {i++ #check if new line has IP, if so #set IPold:IPnew and average latency value if ($2!="*") { ipfill=oldip":"$2 ; latfill=(oldlat+$3)/2 #print filler lines for all consecutive records without value for (j=1 ; j<=i ; j++) { print NR-i+j-1,ipfill,latfill #alternative printing with oldIP:newIP:counter # print NR-i+j-1,ipfill":"j,latfill } #save current IP+lat and print "good" line oldp=$2; oldlat=$3 print ; next } } } #in case getline failed => all previous lines had no value #just fill them with N/A data as in input for (j=0 ; j<=i ; j++) { print NR-i+j,"*","#" } } #If leading lines have no IP value, print them until IP is found oldip == "" { if ($2=="*") {print ; next} ; oldip=$2 ; oldlat=$3 ; print }

Input:

1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 * # 7 11.22.44.66 10 8 * # 9 * # 10 8.8.8.0 25 11 * # 12 * # 13 * #

Output:

1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66 7.5 7 11.22.44.66 10 8 11.22.33.55:8.8.8.0 17.5 9 11.22.33.55:8.8.8.0 17.5 10 8.8.8.0 25 11 * # 12 * # 13 * #

Alternative output with counter for calculated lines:

1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66:1 7.5 7 11.22.44.66 10 8 11.22.33.55:8.8.8.0:1 17.5 9 11.22.33.55:8.8.8.0:2 17.5 10 8.8.8.0 25 11 * # 12 * # 13 * #

It'll take me a minute to wrap my brain around this but it worked as soon as I figured out how to use it. Thanks! — Darius Dauer, CommentedJan 26, 2022 at 18:58
@EdMorton I see, corrected to while (getline > 0) - I missed this in the manuals: "If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1". The problem regarding output is a bit confusing regarding OPs requirement of "average", but surely can be handled with printf "%s %s %d\n" as alternative. — FelixJN, CommentedJan 26, 2022 at 20:49
@EdMorton it is working. I just "changed the goalpost" When he copied my original input data it had line 6 ending in 10. but I edited line 6 later on to show 9 to make the math prettier. His script works right with the data he used. — Darius Dauer, CommentedJan 26, 2022 at 20:49
@FelixJN yeah, getline is a slippery one which is why I wrote awk.freeshell.org/AllAboutGetline. It's one of those things that looks easy to use but requires a lot of thought and defensive coding and makes future enhancements harder. It's usually simpler to just not use it and let awk's normal processing loop handle the input. — Ed Morton, CommentedJan 26, 2022 at 20:52

Ed Morton · Accepted Answer · 2022-01-27 12:12:54Z

$ cat tst.awk $2 == "*" { buf[++bufSz] = $0 next } bufSz > 0 { split(prev,p) rng = p[2] ":" $2 val = ($3 + p[3]) / 2 for (i=1; i<=bufSz; i++) { split(buf[i],flds) print (prev == "" ? buf[i] : flds[1] OFS rng OFS val) } bufSz = 0 } { print prev = $0 } END { for (i=1; i<=bufSz; i++) { print buf[i] } }

$ awk -f tst.awk file 1 192.168.200.2 1 2 192.168.200.1 1 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66 7 7 11.22.44.66 9 8 11.22.44.66:8.8.8.0 17 9 11.22.44.66:8.8.8.0 17 10 8.8.8.0 25 11 * # 12 * # 13 * #

This fails if the file starts with empty data (i.e. 1 * # as first line). Expected is to see unchanged lines as from the input in this case. — FelixJN, CommentedJan 26, 2022 at 21:02
@FelixJN that depends on the OPs currently requirements for that case, hence my question. They may simply never have that situation, idk, but obviously it's not part of the sample input/output they provided for us to test with. — Ed Morton, CommentedJan 26, 2022 at 22:04
@FelixJN actually, I see now the OP did say in the text If the missing data is on the first or last line, I'm happy to leave it as is so I updated my answer to accommodate that. — Ed Morton, CommentedJan 26, 2022 at 22:48

guest_7 · Accepted Answer · 2022-01-30 08:21:49Z

With GNU sed using it's extended regex mode (-E)

S='(\S+)'; _re="$S $S" re="^$_re\\n$_re\$" _avg='1k\4 \2+2/f' avg='"$(echo '"'$_avg'"'|dc)"' sed -E ' s/^(\S+ )[*] #(\n.*\n(.*))/\1\3\2/ ta s/\n.*// /[*] #$/b $!N;//!ba :loop ${//q;bb} N;//bloop :b;h s/\n.*\n/\n/ s/^\S+ //Mg'" s#$re#echo '\1:\3' $avg#e x;G :a P;D " file

Perl using the range operator

perl -lane 'print,next unless my $e = /\d$/ ... /\d$/; push @A,[@F]; next unless $e =~ /E0/ || eof; if (@A>2&&$A[-1][-1] =~ /\d/) { my($str,$avg); for (0,-1) { $avg += $A[$_][2] / 2.0; $str .= $A[$_][1] . ":"; } $str =~ s/.$//; @{$A[$_]}[1,2] = ($str,$avg) for 1..$#A-1; } print "@$_" for splice @A,0,@A-(eof?0:1); @A=(); redo if ! eof; ' file

python3 -c 'import sys, itertools as it prev = "" p = lambda x: print(*x,sep="",end="") q = lambda x: x.split() g = lambda x: x.endswith("* #\n") with open(sys.argv[1]) as f: for t in it.groupby(f,g): G = list(t[1]) if prev == "": p(G) if not t[0]: prev = G[-1] else: if t[0]: M = G else: a,b = map(q,[prev,G[0]]) x = f"{a[1]}:{b[1]}" y = sum(map(int,[a[2],b[2]]))/2.0 for l in M: for e in q(l)[0]: print(e,x,y) p(G); prev = G[-1] p(M) ' file

$ cat file 1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 * # 7 11.22.44.66 10 8 * # 9 * # 10 8.8.8.0 25 11 * # 12 * # 13 * #

Output;

1 * # 2 * # 3 10.10.10.1 1 4 11.22.33.44 2 5 11.22.33.55 5 6 11.22.33.55:11.22.44.66 7.5 7 11.22.44.66 10 8 11.22.44.66:8.8.8.0 17.5 9 11.22.44.66:8.8.8.0 17.5 10 8.8.8.0 25 11 * # 12 * # 13 * #

guest_7 · Accepted Answer · 2022-01-31 08:24:22Z

If the file is not huge, we can use the slurp mode (-0777) to solve this:

perl -MList::Util=sum -0777 -pe ' s{(.*\d\n) ((?-x:.*\* \#\n)+) (?=(.*\d\n)) }[$1 . do{my($s,@A); push @A,(split)[1,2] for $1,$3; $s = join ":", @A[0,2]; @A = ("", $s, sum(@A[1,3])/2); $2 =~ s/^\S+\K.*/@A/gmr; }]xge; ' file

Praveen Kumar BS · Accepted Answer · 2022-01-28 04:24:35Z

var1=$(awk '{a[++i]=$0}/#/{for(x=NR-1;x<NR;x++)print a[x]}' file.txt | awk '{print $NF}') var2=$(awk '/#/{x=NR+1}(NR==x){print $NF}' file.txt) sed -i "s/#/$var3/g" file.txt sed -i "s/\*/Cindy:Ed/g" file.txt output cat file.txt line person age 1 Adam 45 2 Bob 50 3 Cindy 47 4 Cindy:Ed 48 5 Ed 49

Read the question again, you can’t hard-code “Cindy:Ed”. Where’s var3 set? — Stephen Kitt, CommentedJan 28, 2022 at 7:41

Stack Exchange Network

Manipulating text using shell script: How can I fill in "missing" lines?

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Manipulating text using shell script: How can I fill in "missing" lines?

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions