1

I am just beginning to learn sed and awk. I have to submit an homework assignment tomorrow, which is a copy-paste from Wikipedia. Just the opportunity to practice some sed scripting!

So I have the document in html format. Now I need to replace [<number>] with nothing. How would I do this?

This is what I tried, but I think it does not even match the pattern I want:

cat content.xml | sed 's/\[\d+\]/ /g' > content2.xml

As a next stage, I will be implementing the replacement of these patterns, which are hyperlinks, but even the above mentioned simple pattern is not being matched:

<a href="https://en.wikipedia.org/wiki/Immune_system">immune system</a>

and then remove the citations:

<a name="cite_ref-Gleeson2007_27-0"/><a href="https://en.wikipedia.org/wiki/Physical_exercise#cite_note-Gleeson2007-27">[27]</a>
daltonfury42
  • 5,459
  • 5
  • 31
  • 62
  • "So I have the document in .odt format. I extract the content.xml file from it using Archive Manager." ... What? Why? O.o – muru Aug 13 '15 at 13:42
  • @muru can I run sed scripts inside .odt files? Anyways, I've exported it to html file for simplicity. I've updated the question. – daltonfury42 Aug 13 '15 at 13:44
  • 3
    Neither `\d` nor the `+` modifier are recognized in BRE (Basic Regular Expression) syntax AFAIK: try `[0-9]\+` or (POSIX) `[0-9]\{1,\}` or switch to ERE (Extended Regular Expression) using the `-E` or `-r` switch. **However** you should generally try to avoid parsing HTML/XML using regular expressions: *"that way, madness lies"*. – steeldriver Aug 13 '15 at 13:50
  • daltonfury42 well, if you look at the HTML, the references are actually: `[N]`. Also, I second @steeldriver. – muru Aug 13 '15 at 13:53

1 Answers1

1

You went the Wrong direction, you should learn XML/XSLT instead :) (XML Style Sheet). Either for use with ODT or XHTML. For ODT, a macro may be be better, but I don't know it.

Make a look on this accepted answer: RegEx match open tags except XHTML self-contained tags

The solution in this answer for How to replace all images in Libreoffice with their description should work for you too with little modification.

user.dz
  • 47,137
  • 13
  • 140
  • 258
  • 1
    Well, thanks for the suggestion. Actually I wanted to remove citation numbers from a lot of text. The work was due the next day. So I wrote a sed script and it did the job! – daltonfury42 Sep 20 '15 at 17:36