AWK with RS not matching a pattern (asking again becuase I accidently labled as solved. Better explanation ths time.)

I have an odt file with blank lines between lines of text. I want to search for a term and output the whole group of text where there is a match to the term. My approach is to say that the blank lines in the odt file are the record separators. Odt files are zip archives with the text contained in content.xml. After unzipping the odt file, I have used xmllint --format content.xml to insert newlines (as below) and "blank" lines are actually lines without text in between > and <. So I want to set RS to be any such line that does not have text between > and <. If the formatted content.xml file is as follows:

<long line of alphanumerics, slashes, single and double quotes><more or the same><and many more> <office:text> <text:sequence-decls> <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/> <text:sequence-decl text:display-outline-level="0" text:name="Table"/> <text:sequence-decl text:display-outline-level="0" text:name="Text"/> <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/> <text:sequence-decl text:display-outline-level="0" text:name="Figure"/> </text:sequence-decls> <text:p text:style-name="P1">This is the first line</text:p> <text:p text:style-name="P1"/> <text:p text:style-name="P1">This is the third line</text:p> <text:p text:style-name="P1">and this is some more text that is to be included</text:p> <text:p text:style-name="P1"/> <text:p text:style-name="P1">This is the sixth. I want it included,</text:p> <text:p text:style-name="P1">with this line</text:p> <text:p text:style-name="P1">and this one</text:p> </office:text>

and code is

$ awk '/line/' RS='\n[ \t]*<[^>]*>\n' file.xml

The whole file is output. But I only want:

 <text:p text:style-name="P1">This is the first line</text:p> <text:p text:style-name="P1">This is the third line</text:p> <text:p text:style-name="P1">and this is some more text that is to be included</text:p> <text:p text:style-name="P1">This is the sixth. I want it included,</text:p> <text:p text:style-name="P1">with this line</text:p> <text:p text:style-name="P1">and this one</text:p>

2 Answers

Your approach is fraught with problems. Most importantly, there's no obvious way to restrict the regex match to the body text of the document - in the case of /line/ for example, that's going to match tags such as <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>

(There's also an issue with your RS regex consuming two newlines, which will prevent it from properly handling adjacent separators; RS='\n([ \t]*<[^>]*>\n)+' might fix that but I won't guarantee it).

Instead, what I'd suggest is extracting the document's body text first and then applying awk in "traditional" paragraph mode (i.e. using the empty record separator):

xmlstarlet sel -t -v "//office:body/office:text/text:p" -n content.xml | awk -v RS= '/line/{print $0 ORS}'

or with GNU awk, preserving the actual record separators as parsed:

xmlstarlet sel -t -v "//office:body/office:text/text:p" -n content.xml | gawk -v RS= '/line/{printf $0 RT}'

You could even omit the intermediate file altogether, piping stdout from unzip -p:

unzip -p somefile.odt content.xml | xmlstarlet sel -t -v "//office:body/office:text/text:p" -n - | gawk -v RS= '/line/{printf $0 RT}'

Answering my own question by following inspiration from steeldriver, I modified the file, before using awk:

sed '/>.*</! s/.*/---/' test.txt > modfile.txt # overwrites lines matching the pattern with what I will name as the record separator, “---”

Then I was able to extract the entire record on matches of $searchterm

awk "/$searchterm/" RS="---" modfile.txt > results.txt

AWK with RS not matching a pattern (asking again becuase I accidently labled as solved. Better explanation ths time.)

2 Answers

Your Answer

Sign up or log in

Post as a guest

More in updates

How do I see what I am currently subscribed to?

Wes Anderson in the Land of Dahl