0

I'm writing a shell script to generate a directory listing.

as an input a receive a long html string :

https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw","$type":"com.traver.voyager.feed.actions.Action"}, link to post","url":"https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO","$type": article","$type":"com.traver.voyager.feed.actions.Action"},{"actionType":"SHARE_VIA","text":"Copy link to post","url":"https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T","$type":"com.traver.voyager 

To make the output easily customizable, the script just display a url-table :

https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T 

the pattern to search is : begins by "https://www." then XXXXX letters (dynamic size) then finishes with " (quote not to extract)

My current solution was based on cut -f but the total input size is dynamic, so it is not possible to find the pattern.

6
  • Is your long string html or json? if json, use jq or a json-parsing library for the language of your choice. If it's HTML, you can extract links from it easily with lynx -dump -listonly -nonumbers "$URL" (lynx can also read from a file or from stdin).
    – cas
    CommentedSep 8, 2019 at 1:23
  • it is html with hexa and octal code I convert with recode tool. I tried your command but just it's ongoing and nothing happens. what is $URL value ?
    – magique
    CommentedSep 8, 2019 at 1:34
  • like this : lynx -dump -listonly -nonumbers "www" -stdin test07-09-B.txt or lynx -dump -listonly -nonumbers "$URL" -stdin test07-09-B.txt
    – magique
    CommentedSep 8, 2019 at 1:34
  • my "command" was not meant for you to type exactly. It was an example, showing the lynx options which can be used to extract URLs from HTML data. "$URL" in the example is a stand-in for the original url you used to fetch the HTML data. Or, as i mentioned, lynx can read a file or stdin.
    – cas
    CommentedSep 8, 2019 at 1:43
  • btw, your sample "html string" really doesn't look like HTML. It looks like json or some very similar structured text. if it's not actually HTML, lynx won't be able do anything useful with it.
    – cas
    CommentedSep 8, 2019 at 1:45

1 Answer 1

0

Your sample data looks like a broken fragment of json, so you really should use jq to extract what you need from it before doing whatever it is you did to the original input that caused it to look like that.

However, to extract URLs beginning with https://www and not containing a double-quote character from what you have, you can use grep:

$ grep -o 'https://www[^"]*' input.txt https://www.mycompany.com/posts/aureliaflore_china-seoul-startup-activity-6571925510337728512-acAw https://www.mycompany.com/posts/aureliaflore_reuters-top-news-on-twitter-activity-6571392661482233856-T3dO https://www.mycompany.com/posts/aureliaflore_are-you-thinking-to-the-benefits-of-digitalization-activity-6570119712154451968-927T 

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.