Creating an HTML Based eALSRB

ColinJ

Member
Joined
Dec 19, 2009
Messages
129
Reaction score
107
Location
Calgary
Country
llCanada
I've taken the plunge and decided to create an HTML based eASLRB. Thoughts of adding in some JavaScript have entered my mind, but only briefly, since that is a whole other topic. Though there are rumours of many eASLRB being created out there, I have started from scratch. I went with HTML due to the ease, barring isms with various browsers, of being able to display, scale and move the content to different devices and operating systems. Since I've had some hiccups and issues along the way I am starting this thread to assist others who have recently made the "choice" to go down the eASLRB road. Or, maybe assist them in adding in new modules to their existing eALSRB.

It all started with just scanning everything into PDF. Reading the PDFs on tablets was a hassle due to the columns. To fix that I created two copies of say chapter A. Cropped the LHS on the first copy to 3.75" wide columns and the second copy the RHS column at 3.75". Then used a collate plugin in Acrobat to merge the two documents. It was a pain as the columns shifted slightly on different pages. That resulted in learning that scanning the pages manually made for a better scan. The sheet feeder would sometime put a slight twist on the page, maybe a 0.25-0.5 degree that screwed up the page cropping. BTW, this was on a large Minolta commercial copier/scanner, not a small home printer. A bit of tweaking here and there made the continuous scroll work well with minimal jumping left or right between pages. Had to adjust some two-column portions back to full width, but it worked ok. That worked well for reading but not searching and the file sizes were substantial. Added in hyperlinks to each section to make the navigation easier. A-G in one PDF, Chapter H in another, and the remainder in a 3rd.

After a bit, I took the next step, using the Optical Character Recognition (OCR) in Acrobat and convert the document to a text file. From Adobe, I saved the document as a Word document, then from Word, saved it as an HTML. The text conversion was ok, say 95%, but the OCR really struggled to extract the graphics and text in italics. What an F'in mess. Just ended up saving the mess to a text file, read through the sections, rule book on one half of the screen and the text document on the other. Did corrections and search/replace as I went, and after many hours had an accurate text version of Chapter A. When you did a search and replace and had 100 hits, it was great, when there was 1 or 2, not so much. Figured out after a bit what was likely a big issue and what wasn't. BTW, Used Notepad++ as the text editor.

Next up, the joys of converting the text document to HTML.

Colin
 

Kijug

Senior Member
Joined
May 30, 2008
Messages
418
Reaction score
390
Location
Texas
First name
Matt
Country
llUnited States
I took the challenge years ago. I OCR'd the text and created an HTML eASLRB--only text, no picts. I did hand create most tables (I can do manual HTML no problem.) What started me, however, was the challenge if I could write a program that would read the raw text, find the paragraph headings, bold them, find all glossary terms and hyperlink 'em. Then have the program find all rule numbers and hyperlink those. Not all numbers are rules, BTW...ha! I got about 99%. A few hand HTML fixes and it works well. Again, just text and links all over the place from the glossary to the text.

I did this about ten years ago. Even using the eASLRB on the computer for VASL, I still find myself going back to the ASLRB. The reason is that I remember the rule I need is on the bottom left corner of page.... So It is easy to flip and find that rule more quickly than even using the eASLRB. Also, each sub-chapter was an HTML page so it would load quickly.

I use this mostly on my devices when I'm on the go and want to look something up (e.g., in the car, on a trip, etc.) Otherwise, I'm back to the real ASLRB (home) and pASLRB (away games).

Good luck! It was really, fun, I have to admit...but did take a bit of effort...and my program was rather successful.
 

Vinnie

See Dummies in the index
Joined
Feb 9, 2005
Messages
17,425
Reaction score
3,364
Location
Aberdeen , Scotland
Country
llUnited Kingdom
Good luck to you.
Learn some basic css to help with the layout
I recommend notepad++ as the ability to create macros for simple repetitive tasks really helps.
 

Pacman Ghost

Senior Member
Joined
Feb 25, 2017
Messages
590
Reaction score
298
Location
A maze of twisty little passages, all alike
Country
llAustralia
FWIW, I built an eASLRB last year, with a search engine on the RB index only, and it works really well. If you're looking for, say, "morale", you don't want the hundreds of hits in the main body of the text, you just want the rules pertaining specifically to morale.

So, I just spent the time making sure the OCR was good for the index - you'll go mad if you try to correct it for the whole RB! - scanned the RB to a PDF and added bookmarks for each rule section, so that clicking on a search result jumps to the correct place in the PDF.

Yes, there are no hyperlinks within the body of the text, but I find that I don't really need it, and on the odd occasion where I do want to jump to another rule, I can just do another search (the search index also supports searching for rule numbers).

BTW, ABBYY FineReader is worth checking out, especially for correcting OCR.
 

Vinnie

See Dummies in the index
Joined
Feb 9, 2005
Messages
17,425
Reaction score
3,364
Location
Aberdeen , Scotland
Country
llUnited Kingdom
Hyperlinks are really easy to add using notepad++. Search through the text and replace.
I think it's ([a-z]+)([0-9]+).([0-9]+) replace with <A href="../\1/\1.html#\1\2.\3>\1\2.\3</a>

This is from memory but should pick up all references to links in the format D3.56 and so on.
It assumes you have a base directory with chapters as folders containing theirown html files and each paragraph having an I'd code of chapter letter section and paragraph number.
 

Pacman Ghost

Senior Member
Joined
Feb 25, 2017
Messages
590
Reaction score
298
Location
A maze of twisty little passages, all alike
Country
llAustralia
I think it's ([a-z]+)([0-9]+).([0-9]+) replace with <A href="../\1/\1.html#\1\2.\3>\1\2.\3</a>
Trying to do this with regex's is really messy :)

For example, you want to detect A-Z, case-insensitive, and probably only a single character i.e. "[A-Z]", not "[a-z]+". The dot should be a literal dot, not a "any character" dot, and so needs to be escaped. The regex above also doesn't detect things like "C3" or "D.4" or "A1.2.3".

And that's before you consider the OCR doing things like reading "All" as "A11", or vice versa ?

Source: beating my head against a wall trying to do this using regex's for my own eASLRB ?
 

Vinnie

See Dummies in the index
Joined
Feb 9, 2005
Messages
17,425
Reaction score
3,364
Location
Aberdeen , Scotland
Country
llUnited Kingdom
Personally, I OCR each page into text and format it as I go along. The automated search function just makes it easier if I'm converting from one format to another. There is no substitute for just going through the text word by word. It takes me about 8 hours to OCR the average CG rules. Then another 8 to insert graphics and proof.
 

Sparafucil3

Forum Guru
Joined
Oct 7, 2004
Messages
11,335
Reaction score
5,070
Location
USA
First name
Jim
Country
llUnited States
Man, I am glad I did this over time. All done these days. Cross-indexed. Fully linked. All the pictures. At one point, I was integrating the Perry Sez and Q&A. After all that, the only reason I still use it is because the pocket ASLRB still doesn't have chapter H. If it did, I don't know that I would use it much. -- jim
 

Pacman Ghost

Senior Member
Joined
Feb 25, 2017
Messages
590
Reaction score
298
Location
A maze of twisty little passages, all alike
Country
llAustralia
It takes me about 8 hours to OCR the average CG rules. Then another 8 to insert graphics and proof.
Chapters A-D are some 160 pages, E-G are about another 100 pages, and that's before you get onto any modules and TPP products. That's an insane amount of work! And that's just OCR'ing - I remember just scanning the bloody pages being a major PITA.

And I have no doubt whatsoever that had I've gone down that road, the day I finished it, MMP would announce the 3rd Edition of the ASLRB, with surely an electronic version available ?
 

Vinnie

See Dummies in the index
Joined
Feb 9, 2005
Messages
17,425
Reaction score
3,364
Location
Aberdeen , Scotland
Country
llUnited Kingdom
Chapters A-D are some 160 pages, E-G are about another 100 pages, and that's before you get onto any modules and TPP products. That's an insane amount of work! And that's just OCR'ing - I remember just scanning the bloody pages being a major PITA.

And I have no doubt whatsoever that had I've gone down that road, the day I finished it, MMP would announce the 3rd Edition of the ASLRB, with surely an electronic version available ?
I had a document feed scanner when I got myself a new rulebook. Thst made the scanning so much easier! Add to that, I found a version where much of the work had been done so started with that.
 

ColinJ

Member
Joined
Dec 19, 2009
Messages
129
Reaction score
107
Location
Calgary
Country
llCanada
Converting the text document, as mentioned above by Vinnie, used regex. I'm certainly no expert in regex, look ahead, look behind, etc., but it worked though it took some time to learn. There are some shortcuts such as "\d" looks for digits [0-9], "\u" for upper case letters [A-Z], which helped shorten the setup.
How to id each of the sections for the hyperlinks? Ended up using L#_#, so A12.15 was id "A12_15" in A.html.
Since there are separate versions of rules references, A1.1-A1.2, A1., and A1.11 it took some work to complete the conversion and it needed to be done in the order shown. Using a "\(" or space before the reference helped avoid doubling up on the hyperlinks. Search "([\( ])A([\d]+).)" and replace with \1<a href="#A\2">2.</a>. For rules sections B and up added in another term ([B-Z]) and changed the replacement terms accordingly, "([\( ])([B-Z]+)([\d]+).)" with \1<a href="#\2\3">\3.</a>.
Also used regex to add in paragraph tags and id attributes for each rules section.
For the paragraphs, baring typos, this works:
\r\n([\d]+)\.([\d]+) and replaced with </p>\r\n<p id="\1_\2"><span class="h4ch">\1\.\2

Then followed up with additional search/replace to close out the span, usually replaced ":" with "</span>:", thus bolding the text from the section number to the colon. Of course, there were a few areas that don't have colons just numbered section so always had to do multiple passes and checks. Also had some major F'ups by missing a bracket or something and had to go back several steps to fix it. Lots of backups and saves were done!
It may have been a better idea to keep the heading and the section numbers separate. This I did not do, though I may have to go back and tweak it.

Colin
 
Last edited:

Vinnie

See Dummies in the index
Joined
Feb 9, 2005
Messages
17,425
Reaction score
3,364
Location
Aberdeen , Scotland
Country
llUnited Kingdom
With the I'd for the headings I'd include the chapter letter in them. It just makes it that bit easier to bring together.
 

Vinnie

See Dummies in the index
Joined
Feb 9, 2005
Messages
17,425
Reaction score
3,364
Location
Aberdeen , Scotland
Country
llUnited Kingdom
An example.

hr>
<p>
<span id="C1" class="headder">1. OFFBOARD ARTILLERY (OBA)</span>


<p>
<h2 id="C1.1" class="start">1.1</h2><p class="start"> OBA represents a battery of Guns outside the area represented by the mapboard, using radio-directed Indirect Fire to fire HE or SMOKE (or IR; <a href="../E/E.html#E1.93">E1.93</a>) onto designated areas of the mapboard. OBA availability is usually symbolized by the presence of a radio counter in the scenario OB and is further defined by SSR or DYO purchase as to type. Each radio in the OB represents one predesignated available OBA battery (aka module). Each battery may produce a variable number of Fire Missions, but only via the one radio representing it in the scenario. If that radio is lost, so is the opportunity to contact that battery; another radio may not be used to call in the remaining Fire Missions of that battery.


</p>
<p>
<img src="../C/Images/C1_2.GIF"><h2 id="C1.2" class="start">1.2 RADIO CONTACT ATTEMPT: </h2><p class="start">OBA may be called-for/placed/Corrected/Converted/voluntarily-Cancelled only if the friendly player currently has Radio Contact and Battery Access. Only an Observer (i.e., a Good Order leader possessing a functioning radio/field-phone counter [<a href="../C/C.html#C1.6">1.6</a>], or an OP tank [<a href="../H/H.html#H1.46">H1.46</a>], or an Offboard Observer [<a href="../C/C.html#C1.63"></a>; <a href="../E/E.html#E7.6">E7.6</a>]) may attempt Radio Contact [EXC: not if he is a Rider, or a Guard whose US# is < that of his prisoners], and may do so only at the start of the PFPh/DFPh<sup><a href="../C/C.html#CF1">1</a></sup> (as given in the Advanced Sequence of Play) as his sole action for that phase aside from other allowed OBA activities. Radio Contact is established with a DR &le; the Radio Contact value printed on the radio counter (&#9651;). If the Radio Contact DR is failed, neither the radio nor the Observer may attempt Radio Contact again until the start of his next PFPh/DFPh (whichever comes first; see also <a href="../C/C.html#C1.61">1.61</a>). The Contact value may vary with nationality and, in some cases, time frame of the scenario. If a radio counter has a multiple Contact value, the dates for their use are listed on the reverse side.
</p>
<p>
<strong>EX:</strong><span class="example"> The Russian Contact value of 6/7/8 indicates a Radio Contact value of 6 through June '42, 7 from July '42 through June '43, and 8 thereafter.
</span>
 

ColinJ

Member
Joined
Dec 19, 2009
Messages
129
Reaction score
107
Location
Calgary
Country
llCanada
An example.

hr>
<p>
<span id="C1" class="headder">1. OFFBOARD ARTILLERY (OBA)</span>


<p>
<h2 id="C1.1" class="start">1.1</h2><p class="start"> OBA represents a battery of Guns outside the area represented by the mapboard, using radio-directed Indirect Fire to fire HE or SMOKE (or IR; <a href="../E/E.html#E1.93">E1.93</a>) onto designated areas of the mapboard. OBA availability is usually symbolized by the presence of a radio counter in the scenario OB and is further defined by SSR or DYO purchase as to type. Each radio in the OB represents one predesignated available OBA battery (aka module). Each battery may produce a variable number of Fire Missions, but only via the one radio representing it in the scenario. If that radio is lost, so is the opportunity to contact that battery; another radio may not be used to call in the remaining Fire Missions of that battery.


</p>
<p>
<img src="../C/Images/C1_2.GIF"><h2 id="C1.2" class="start">1.2 RADIO CONTACT ATTEMPT: </h2><p class="start">OBA may be called-for/placed/Corrected/Converted/voluntarily-Cancelled only if the friendly player currently has Radio Contact and Battery Access. Only an Observer (i.e., a Good Order leader possessing a functioning radio/field-phone counter [<a href="../C/C.html#C1.6">1.6</a>], or an OP tank [<a href="../H/H.html#H1.46">H1.46</a>], or an Offboard Observer [<a href="../C/C.html#C1.63"></a>; <a href="../E/E.html#E7.6">E7.6</a>]) may attempt Radio Contact [EXC: not if he is a Rider, or a Guard whose US# is < that of his prisoners], and may do so only at the start of the PFPh/DFPh<sup><a href="../C/C.html#CF1">1</a></sup> (as given in the Advanced Sequence of Play) as his sole action for that phase aside from other allowed OBA activities. Radio Contact is established with a DR &le; the Radio Contact value printed on the radio counter (&#9651;). If the Radio Contact DR is failed, neither the radio nor the Observer may attempt Radio Contact again until the start of his next PFPh/DFPh (whichever comes first; see also <a href="../C/C.html#C1.61">1.61</a>). The Contact value may vary with nationality and, in some cases, time frame of the scenario. If a radio counter has a multiple Contact value, the dates for their use are listed on the reverse side.
</p>
<p>
<strong>EX:</strong><span class="example"> The Russian Contact value of 6/7/8 indicates a Radio Contact value of 6 through June '42, 7 from July '42 through June '43, and 8 thereafter.
</span>
Turns out I did add in the chapter letter into the ID! Blurb edited above! ? My version of the above section C1.2. Thanks for the noting the code for greek delta. I found some info and &Delta; does the same.

<p id="C1_2"><span class="h4ch">1.2 RADIO CONTACT ATTEMPT</span>: OBA may be called for/<wbr>placed/<wbr>Corrected/<wbr>Converted/<wbr>voluntarily Cancelled only if the friendly player currently has Radio Contact and Battery Access.&nbsp; Only an Observer (i.e., a Good Order leader possessing a functioning radio/<wbr>field phone counter [<a href="#C1_6">1.6</a>], or an OP tank [<a href="H_DYO.html#H1_46">H1.46</a>], or an Offboard Observer [<a href="#C1_63">1.63</a>; <a href="E.html#E7_6">E7.6</a>]) may attempt Radio Contact <span class="ASLEXC">[EXC: not if he is a Rider, or a Guard whose US# is &lt; that of his prisoners]</span>, and may do so only at the start of the PFPh/<wbr>DFPh<a href="#CFN_1"><sup>1</sup></a> (as given in the Advanced Sequence of Play) as his sole action for that phase aside from other allowed OBA activities.&nbsp; Radio Contact is established with a DR &le; the Radio Contact value printed on the radio counter (&Delta;).&nbsp; If the Radio Contact DR is failed, neither the radio nor the Observer may attempt Radio Contact again until the start of his next PFPh/<wbr>DFPh (whichever comes first; see also <a href="#C1_61">1.61</a>).&nbsp; The Contact value may vary with nationality and, in some cases, time frame of the scenario.&nbsp; If a radio counter has a multiple Contact value, the dates for their use are listed on the reverse side.</p>

Ends up looking like so:
8139

Since we are at it, the <wbr> inserts a break when needed to keep long sections like for/<wbr>placed/<wbr>Corrected/<wbr>Converted/<wbr>voluntarily Cancelled to wrap at the <wbr> if needed. I choose to add in the <wbr> if more than 6 characters where involved. MF/MP left as is and PFPh/<wbr>DFPh will wrap.
 

ColinJ

Member
Joined
Dec 19, 2009
Messages
129
Reaction score
107
Location
Calgary
Country
llCanada
Excellent, these little gems that others can make use of and save them the pain of finding them is the primary reason I started this thread!
The soft hyphen, html "&shy;", may be helpful to hyphenate longer words.
 
Top