Removing superscripts from documents?

Forum for TextAloud version 4

Moderator: Jim Bretti

Post Reply
PaulM
Posts: 5
Joined: Mon Nov 30, 2020 9:20 pm
Contact:

Removing superscripts from documents?

Post by PaulM »

Hi:

I did see this on the TextAloud 3 forum:

([a-z][.?!])(\d+)([,-]\d+)*

This works OK, except sometimes the textual output of the epub files Text Aloud generates appears to have a quotation mark (") immediately proceeding the superscript. I'd imagine this regular expression can be modified to handle this case as well.

However, what is generating the textual output that Text Aloud uses? I'd think much of the epub file and be extracted as HTML. If so, is it possible to remove the superscripts then when they'll be easy to clearly and accurately identify?

Thanks,

Paul
Jim Bretti
Posts: 1558
Joined: Wed Oct 29, 2003 11:07 am
Contact:

Re: Removing superscripts from documents?

Post by Jim Bretti »

Hi Paul,

I'll take a look a this, seems like we should be able to strip superscripts when reading the source html.
Jim Bretti
NextUp.com
PaulM
Posts: 5
Joined: Mon Nov 30, 2020 9:20 pm
Contact:

Re: Removing superscripts from documents?

Post by PaulM »

Hi Jim:

At least for epubs, after you extract the epub archive, it seems you can run a replacement regrex like this over the HTML files in the archive:

Search: <sup[^>]*>([^<]*)</sup>
Replacement: [\1]

It might need to be modified depending on the flavor of regrex engine, but it seems to be working to replace the HTML superscript tags with just the superscript number enclosed in square brackets. Then, you can simply use a Text Filter to 'Filter text in square brackets'.

Paul
Jim Bretti
Posts: 1558
Joined: Wed Oct 29, 2003 11:07 am
Contact:

Re: Removing superscripts from documents?

Post by Jim Bretti »

Hi Paul,

Thanks for the tip, that helps!
Jim Bretti
NextUp.com
Post Reply