How to display duplicate lines with different first field

Question

Regarding this information below:

807:Lipstick:Cosmetics:50:250 808:MixerGrinder:Electronics:10:35000 809:MixerGrinder:Electronics:10:35000

I am expecting to display this information below:

808:MixerGrinder:Electronics:10:35000 809:MixerGrinder:Electronics:10:35000

I am using this script, but nothing is returned:

grep -f < (printf ':%s\$\n' $(cut -d ':' -f2,5 catalog.txt | sort | uniq -d)) catalog.txt

What is the reason my script is displaying nothing? Is there another way to make it possible? How do I correct my script?

Thank you

If your first field has a fixed number of characters and the input is already sorted, you can use uniq -D -s3 catalog.txt. — Sundeep, CommentedApr 1 at 9:49

Kusalananda · Accepted Answer · 2025-03-29 11:26:27Z

The reason your command outputs nothing is that your printf statement outputs

:MixerGrinder:35000\$

... which matches nothing in the file, instead of

:MixerGrinder:Electronics:10:35000$

... which matches the lines that you wanted to match.

It does that because you have escaped the $ in the single-quoted printf format string (it does not need escaping since the string is single-quoted, and it wouldn't need it in any other case either as it does not introduce a shell expansion in this case), and because you are using 2,5 rather than 2-5 in the field range argument to cut.

You also wrote < (...) instead of <(...) (there should not be a space between < and (...)), which broke the process substitution and should have caused a syntax error (bash: syntax error near unexpected token `(').

An alternative approach, similar to yours, but that allows for spaces in the fields:

$ grep -f <(cut -d : -f 2-5 catalog.txt | sort | uniq -d | sed 's/.*/:&$/') catalog.txt 808:MixerGrinder:Electronics:10:35000 809:MixerGrinder:Electronics:10:35000

Using Miller (mlr) to add a temporary count field to each record, containing the count of the total number of records having the same fields 2, 3, 4, and 5. We then filter the resulting data, retaining only those records with a count of more than 1, and remove the count field.

$ mlr --nidx --fs colon count-similar -g 2,3,4,5 then filter '$count>1' then cut -x -f count catalog.txt 808:MixerGrinder:Electronics:10:35000 809:MixerGrinder:Electronics:10:35000

This reads the data as an integer-indexed input file (--nidx), which uses colons for field separators (--fs colon). If the data is actually a header-less CSV file, then use --csv and -N in place of --nidx. This would allow Miller to understand quoted CSV fields.

Stéphane Chazelas · Accepted Answer · 2025-03-29 11:31:40Z

No need to do two passes in the file or end up potentially doing hundreds of regexp matches per line of input.

With awk, you can do something like:

awk -F: -v OFS=: ' { line = $0; $1 = ""; key = $0 if (count[key]++) { print saved[key] line; delete saved[key] } else {saved[key] = line ORS} }' < your-file

The count[key]++ which increments a counter for each occurrence of the key (the key here being the full line after the first field has been emptied) does a hash table look-up, not a comparison (let alone a regex match) for each member of the array so is going to be orders of magnitude faster.

That means it can work inline on a pipe as well.

We do end up storing each unique key value in memory (plus the full line in saved for those without duplicates), so that wouldn't be suitable for an input that has billions of lines few of them with duplicate keys.

We can reduce the memory usage by saving only the first field rather than the full line in saved which complicates the code slightly:

awk -F: -v OFS=: ' { id = $1; $1 = ""; key = $0 if (count[key]++) { if (key in saved) { print saved[key] key delete saved[key] } print id key } else {saved[key] = id} }' < your-file

Hauke Laging · Accepted Answer · 2025-03-27 17:22:58Z

grep -f <(printf ':%s$\n' $(cut -d ':' -f2-5 catalog.txt | sort | uniq -d) ) catalog.txt

no \ before the $
2-5 instead of 2,5

You knew that you wanted

grep :MixerGrinder:Electronics:10:35000$ catalog.txt

... to be executed. To test your proposed command, you would have used

printf ':%s\$\n' $(cut -d ':' -f2,5 catalog.txt | sort | uniq -d)

... to see whether this produced the intended output to be used with grep.

If you quote $(cut -d ':' -f2-5 catalog.txt | sort | uniq -d) then it's one argument that is passed to printf. If you leave it unquoted, then split+glob applies and you should tune split and disable glob. — Stéphane Chazelas, CommentedMar 27 at 14:08

Stéphane Chazelas · Accepted Answer · 2025-03-29 09:57:33Z

The command sort & uniq may also be of value. Your data is delimited by colons (:), and you can sort on whatever columns you'd like. To sort only on Column 2, you would type

sort -t: -k2,2

The -t is the field separator (:), the -k is the key (column#) to sort on. To sort on All columns starting with column 2

sort -t: -k2

Then you can try the uniq command, whose GNU and FreeBSD implementations have to ability to show all of the duplicate lines with their -D option:

 uniq -D -f 1

the -D option prints all duplicate lines (assuming the input data is already sorted). the -f skips filed 1. However, uniq uses spaces as a field separator. Do you need the first field printed?

There are some options here. If you don't need the first field, send your output through cut and eliminate the first field:

cut -d: -f2- | sort

It looks like your input data does not contain tabs, if that is the case, you can replace the ':' with a <tab>, then run uniq, and replace tabs with colons:

sort -t: -k2 | tr ':' '\t' | uniq -D -f 1 | tr '\t' ':'

BTW: on the uniq command (on my system, anyway), there is a difference between -d and -D. Per the man page:

-d, --repeated only print duplicate lines, one for each group

-D print all duplicate lines

As the old adage says "There is more than one way to skin a cat." try a few different ways and see what works best for you, or which one you understand best.

Ed Morton · Accepted Answer · 2025-03-29 16:54:34Z

Using any awk plus any sort and so storing close to nothing in memory in awk no matter how large the input is (sort is designed to do demand paging, etc. to handle massive input so its memory usage is less of a potential issue):

$ sort -t: -k2 -k1,1n file | awk ' { key=$0; sub(/^[^:]*/,"",key) } key != prev { first=$0 ORS; prev=key; next } { print first $0; first="" } ' 808:MixerGrinder:Electronics:10:35000 809:MixerGrinder:Electronics:10:35000

jubilatious1 · Accepted Answer · 2025-03-30 16:50:57Z

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %hash; %hash.append: .split(":", 2).reverse; END for %hash.grep( *.value.elems > 1 ).invert() { put .key ~":"~ .value; };' file

Above is an answer written in Raku, a member of the Perl-family of programming languages. The -ne command line flags enable non-autoprinting linewise (awk-like) operation.

We BEGIN by declaring a hash, which will maintain keys unique while accumulating associated values.
In the main loop we take the topic (line) and .split on colon into 2 pieces, andthen reverse the order of text fragments and append to the hash. This makes the 'predicate' of the line into the key while the first field becomes the value.
At the END we grep through the hash for kv-pairs that have > 1 elements, then value/key invert back to the original number of duplicate input lines, finally in the block restoring the intervening : colon.

Sample Input:

807:Lipstick:Cosmetics:50:250 808:MixerGrinder:Electronics:10:35000 809:MixerGrinder:Electronics:10:35000

Sample Output:

808:MixerGrinder:Electronics:10:35000 809:MixerGrinder:Electronics:10:35000

https://docs.raku.org/language/hashmap
https://raku.org

Stack Exchange Network

How to display duplicate lines with different first field

6 Answers 6

You must log in to answer this question.

Hot Network Questions

How to display duplicate lines with different first field

6 Answers 6

You must log in to answer this question.

Related

Hot Network Questions