Extracting multiple strings, according to a pattern, in a bash script

Question

I'm writing a shell script to generate a directory listing.

as an input a receive a long html string :

https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw","$type":"com.traver.voyager.feed.actions.Action"}, link to post","url":"https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO","$type": article","$type":"com.traver.voyager.feed.actions.Action"},{"actionType":"SHARE_VIA","text":"Copy link to post","url":"https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T","$type":"com.traver.voyager

To make the output easily customizable, the script just display a url-table :

https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T

the pattern to search is : begins by "https://www." then XXXXX letters (dynamic size) then finishes with " (quote not to extract)

My current solution was based on cut -f but the total input size is dynamic, so it is not possible to find the pattern.

Is your long string html or json? if json, use jq or a json-parsing library for the language of your choice. If it's HTML, you can extract links from it easily with lynx -dump -listonly -nonumbers "$URL" (lynx can also read from a file or from stdin). — cas, CommentedSep 8, 2019 at 1:23
it is html with hexa and octal code I convert with recode tool. I tried your command but just it's ongoing and nothing happens. what is $URL value ? — magique, CommentedSep 8, 2019 at 1:34
like this : lynx -dump -listonly -nonumbers "www" -stdin test07-09-B.txt or lynx -dump -listonly -nonumbers "$URL" -stdin test07-09-B.txt — magique, CommentedSep 8, 2019 at 1:34
my "command" was not meant for you to type exactly. It was an example, showing the lynx options which can be used to extract URLs from HTML data. "$URL" in the example is a stand-in for the original url you used to fetch the HTML data. Or, as i mentioned, lynx can read a file or stdin. — cas, CommentedSep 8, 2019 at 1:43
btw, your sample "html string" really doesn't look like HTML. It looks like json or some very similar structured text. if it's not actually HTML, lynx won't be able do anything useful with it. — cas, CommentedSep 8, 2019 at 1:45

cas · Accepted Answer · 2019-09-08 23:27:27Z

Your sample data looks like a broken fragment of json, so you really should use jq to extract what you need from it before doing whatever it is you did to the original input that caused it to look like that.

However, to extract URLs beginning with https://www and not containing a double-quote character from what you have, you can use grep:

$ grep -o 'https://www[^"]*' input.txt https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T

Stack Exchange Network

Extracting multiple strings, according to a pattern, in a bash script

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Extracting multiple strings, according to a pattern, in a bash script

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions