0

I need to get the file paths from a log file. I thought I will try this with regex.

A file path would look like this:

75/751234/751234V0001_test-tag1-tag02-75x75_01.jpg 

I'm not a pro in Regex, so I only could get to the second slash with the following expression. I also get the beginning of the filenames via regex, but I can't get the several keywords that can follow.

([0-9]{2})[\/]([0-9]{2,10})[\/] 

Now I'm still missing the regex for the actual filename. The filename always has a number at the beginning. After that, theoretically infinite keywords can follow.

The file extension can be .jpg, .tif, .zip, etc.

So the output should be the filepath

75/751234/751234V0001_test-tag1-tag02-75x75_01.jpg 

Maybe someone has a solution, and or even an improvement to the regex I have so far.

2
  • 3
    It would benefit your question to include some actual lines from the log file. I doubt that you will have to match the actual pathname, but that you instead could extract the pathnames based on other information in the log file.
    – Kusalananda
    CommentedAug 22, 2022 at 13:07
  • grep -Po '\d+/\d+/[\w-]+\.\w+' with GNU grep may or may not be enough, hard to tell with the very few requirements you give about what should and shouldn't be matched.CommentedAug 22, 2022 at 13:41

1 Answer 1

1

It seems that your file paths are organized as follows:

  • The filename starts with a multi-digit number.
  • The path starts with a directory consisting of the first two digits of that number.
  • It continues with a sub-directory consisting of the entire number.
  • The file in question is directly in that sub-directory and, apart from starting with the above mentioned number, has an extension from a limited set of possibilities.

If you have different means of identifying lines that contain a filename, this might be preferable. If there are filenames of different patterns, and you want to focus on the pattern shown, then the following RegEx should work (example uses GNU grep in ERE mode):

grep -E -w -o '([[:digit:]]{2})/(\1[[:digit:]]+)/\2[^[:digit:]][^/]*\.(jpg|tif|zip)' logfile.txt 

This uses backreferences (\1 and \2) to ensure that "the same text" is matched at different places in the string.

  • The string is required to start with two digits and a slash.
  • It then needs to continue with the same two digits as the start, followed by an unspecified number (you can substitute the + with {2,10} if the number has a fixed range) of further digits, and a slash
  • Then, it starts with the same number as the second path element, followed by a non-digit character (to ensure the number really is the same as the second path element) and any number of further characters except the / (to exclude files in sub-directories or guard agains multiple file paths being present on the same line) up to a final alternative of filename extensions (the amount of which you can adapt to your needs).
  • The -o option ensures that only the matched part of the line (i.e. the file path) is returned. The -w option ensures that the result only matches full strings, i.e. not the sub-string of a possible longer filepath. This requires that filenames not contain spaces (which is a valid character for filenames!).

Note that strictly speaking, back-references are a feature guaranteed to work only in basic regular expressions, while alternatives are only guaranteed to work in extended regular expressions. GNU grep extended regular expressions do allow for the backreferences, so it works in this case (which maybe is not too much of a limitation).

0

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.