<?xml version="1.0" encoding="US-ASCII" standalone="yes"?>
<!DOCTYPE paper [
<!ENTITY % docbook3 PUBLIC "-//Norman Walsh//DTD DocBk30 XML V0.7//EN"
"http://nwalsh.com/docbook/xml/0.7/db3xml07.dtd">
%docbook3;

<!-- same content model as a DocBook <chapter>, except <title> could
     be in <docinfo> -->
<!ELEMENT paper (docinfo?, title?, titleabbrev?, tocchap?,
                 (((calloutlist | glosslist | itemizedlist |
                    orderedlist | segmentedlist | simplelist |
                    variablelist | caution | important | note | tip |
                    warning | literallayout | programlisting |
                    programlistingco | screen | screenco | screenshot
                    | synopsis | cmdsynopsis | funcsynopsis |
                    formalpara | para | simpara | address | blockquote
                    | graphic | graphicco | informalequation |
                    informalexample | informaltable | equation |
                    example | figure | table | msgset | procedure |
                    sidebar | anchor | bridgehead | comment |
                    highlights | abstract | authorblurb | epigraph |
                    indexterm | beginpage)+,
                   (sect1* | refentry* | simplesect*)) |
                  (sect1+ | refentry+ | simplesect+)),
                 (index | glossary | bibliography)*)>
]>
<paper>
  <docinfo>
    <title>Initial Audio Characteristics for
<acronym>XSL</acronym></title>
    <author>
      <firstname>Christopher</firstname>
      <othername>R.</othername>
      <surname>Maden</surname>
      <affiliation>
        <jobtitle>Senior Tools Specialist</jobtitle>
        <orgname>O'Reilly &amp; Associates, Inc.</orgname>
        <address>
          <street>90 Sherman Street</street>
          <city>Cambridge</city>
          <state>MA</state>
          <country>USA</country>
          <postcode>02140</postcode>
          <email>crism@oreilly.com</email>
        </address>
      </affiliation>
    </author>
    <orgname>World-Wide Web Consortium XSL Working Group</orgname>
    <revhistory>
      <revision>
        <revnumber>0.1</revnumber>
        <date>7 May 1998</date>
        <revremark>Initial draft.</revremark>
      </revision>
    </revhistory>
    <subjectset>
      <subject>
        <subjectterm>Proposed Audio Characteristics for XSL</subjectterm>
      </subject>
    </subjectset>
    <keywordset>
      <keyword>XSL</keyword>
      <keyword>ACSS</keyword>
      <keyword>Audio Cascading Style Sheets</keyword>
      <keyword>XML</keyword>
      <keyword>accessibility</keyword>
      <keyword>blind</keyword>
      <keyword>style</keyword>
      <keyword>audio flow objects</keyword>
      <keyword>audio formatting objects</keyword>
    </keywordset>
    <abstract>
      <para>In keeping with the <acronym>W3C</acronym>'s mission of
accessibility, <acronym>XSL</acronym> must make an effort to make
accessible documents as easy to create as possible.  The more
difficult accessibility is to achieve, the fewer documents will be
accessible, as authors will not perform extra work to benefit what is
often perceived as an insignificant minority. I propose adding audio
characteristics derived from Aural Cascading Style Sheets to the
otherwise visual flow objects in the July 1998 draft of
<acronym>XSL</acronym>.</para>
    </abstract>
  </docinfo>
  <sect1>
    <title>Introduction</title>
    <para>One of the greatest problems that faces accessibility
advocates on the Web today is the perception by many Web designers
that accessibility requires extra work on their part, or even that it
requires lower-quality designs for their sighted audience. Because of
this perception, I feel very strongly that <acronym>XSL</acronym> must
allow <quote>easy accessibility</quote> from the very beginning of its
design.</para>
    <para>This must be done in a way that does not impede future
capabilities enabling high-quality audio stylesheets, high-quality
visual stylesheets, or high-quality multimedia stylesheets. The goal
is that careful designers should be able to create works of beauty in
any medium, but that a lazy designer's visual stylesheet should have a
well-defined audio analog.</para>
    <para>It is arguable that enabling easy accessibility will justify
continued laziness on the part of designers. However, empirical
evidence on the Web indicates that most Web designers will perform the
minimal necessary work to achieve a desired effect for most of the
audience. Recent application of the Americans with Disabilities Act to
Internet information resources is beginning to have an effect on this,
but this may only affect public resources such as government
information pages. Unless accessibility becomes much easier to
achieve, the visually impaired will continue to be second-class
citizens of the Web, and if accessibility itself is a second-class
priority of the <acronym>XSL</acronym> design effort, that will only
worsen the situation.</para>
    <para>The <acronym>W3C</acronym> working draft on Aural Cascading
Style Sheets (<acronym>ACSS</acronym>)<footnote>
        <para><systemitem
            role="url">http://www.w3.org/TR/WD-acss-970630</systemitem></para>
      </footnote> defines a set of audio characteristics
(or properties) for use with <acronym>CSS</acronym> level 1.<footnote>
        <para><systemitem
            role="url">http://www.w3.org/TR/REC-CSS1-961217</systemitem></para>
</footnote> Some of these are still preliminary, and the
<acronym>ACSS</acronym> draft contains unanswered questions about
their exact nature. Other properties seem robust and simple enough to
be included in the initial <acronym>XSL</acronym> draft.</para>
  </sect1>
  <sect1>
    <title>Applicability of Audio Characteristics</title>
    <para>Most of the audio characteristics alter the manner in which
the content of an element is read, while others provide background
noise or sound effects. The applicability of those affecting textual
presentation seems obvious, while the difference between
<acronym>XSL</acronym> and <acronym>CSS</acronym> has an interesting
effect on sound effects.</para>
    <sect2>
      <title>Characteristics Affecting Textual Delivery</title>
      <para>These characteristics are only applicable to flow objects
whose content is a series of character flow objects. The
characteristics that inherit can be meaningful when applied to larger
flow objects, naturally (<foreignphrase>e.g.</foreignphrase>, an
entire series of paragraphs could have a slightly higher
volume).</para>
      <para>It is interesting to note that while visual formatting can
be a function of individual character flow objects (with some
interaction between neighboring characters in certain scripts), speech
rendering is a function of a series of characters, possibly across
flow object boundaries, necessitating a slight change in
terminology.</para>
      <para>These characteristics are:<itemizedlist>
          <listitem>
            <para>azimuth</para>
          </listitem>
          <listitem>
            <para>elevation</para>
          </listitem>
          <listitem>
            <para>pitch</para>
          </listitem>
          <listitem>
            <para>pitch-range</para>
          </listitem>
          <listitem>
            <para>richness</para>
          </listitem>
          <listitem>
            <para>speak-numeral</para>
          </listitem>
          <listitem>
            <para>speak-punctuation</para>
          </listitem>
          <listitem>
            <para>speech-rate</para>
          </listitem>
          <listitem>
            <para>stress</para>
          </listitem>
          <listitem>
            <para>voice-family</para>
          </listitem>
          <listitem>
            <para>volume</para>
          </listitem>
        </itemizedlist></para>
    </sect2>
    <sect2>
      <title>Backgrounds and Cues</title>
      <para><acronym>ACSS</acronym> provides properties that generate
sound effects before, during, or after the text of an element is
read. The ability of <acronym>XSL</acronym> to generate multiple flow
objects for a single element removes the need for multiple events to
be characteristics, but it introduces a new problem.</para>
      <para>Flow objects that provide only a sound effect can be
thought of as flow objects of some length with no content, but with a
background sound. However, <acronym>ACSS</acronym> allows the sound
effect to be specified by <acronym>URI</acronym>. There are a few ways
we could provide this functionality in <acronym>XSL</acronym>.</para>
      <para>The first is to provide a set of characteristics roughly
equivalent to the property and keywords of <acronym>ACSS</acronym>'s
play-during property. One characteristic would take a
<acronym>URL</acronym> as its value, another would specify whether or
not to repeat the sound effect, and a third would state whether the
sound effect should combine with or replace any background currently
in place.</para>
      <para>The other would be to have an audio flow object. This
object would have as a <acronym>URI</acronym> as a characteristic,
along with controls such as volume. The flow object itself would
create a single iteration of the specified sound; repetition of the
sound would be handled by another flow object <quote>tiling</quote>
the audio flow object for its own duration.</para>
      <para>The second option is cleaner: it can be used for sound
effect before or after (or even during) an element's audio rendering,
and could possibly be generalized to video or other non-textual
events, or serve as a template for other specialized flow
objects. Sound effect cues under the first model would need to be
silent flow objects of a certain duration with a background sound,
since the <acronym>ACSS</acronym> specification states that the
background sound shall be truncated to the length of the element being
rendered. The cue flow object would need to know the length of the
sound effect in order to provide clearance, which seems unnecessarily
cumbersome.</para>
      <para>On the other hand, providing clearance for the background
sounds nicely provides a mechanism for pauses; see the discussion of
debatable characteristics.</para>
      <para>These characteristics affect the duration of an element's
rendering, by events before or after, or continual
effects:<itemizedlist>
          <listitem>
            <para>pause</para>
          </listitem>
          <listitem>
            <para>pause-after</para>
          </listitem>
          <listitem>
            <para>pause-before</para>
          </listitem>
          <listitem>
            <para>play-during</para>
          </listitem>
        </itemizedlist></para>
    </sect2>
  </sect1>
  <sect1>
    <title>Audio Characteristic Definitions</title>
    <para>Each characteristic specified as a property in
<acronym>ACSS</acronym> is discussed below. Two are proposed as
(relatively) uncontroversial candidates for inclusion in the July
draft of <acronym>XSL</acronym>; eight others are proposed as more
controversial candidates, with pros and cons discussed. Five more are
considered inappropriate for inclusion, and five are worth
consideration, but are too immature for the July draft.</para>
    <sect2>
      <title>Characteristics Retained from
<acronym>ACSS</acronym></title>
      <sect3>
        <title>volume</title>
        <para>The volume of a spoken element seems like an obvious and
uncontroversial characteristic. When applied to any flow object with
character flow object children, it governs the volume used to speak
the words composed of those characters.</para>
        <para><acronym>ACSS</acronym> specifies that this
characteristic should take a percentage, where 0% is the quietest
audible volume, and 100% is the loudest comfortable volume. It also
specifies five named volumes, plus <quote>silent</quote>, which means
that the element's rendering occupies sufficient time for it to be
spoken, but that no sound is actually produced.</para>
        <para>The default value for an <acronym>XSL</acronym> flow
object with no explicit or inherited specification should be
50%.</para>
      </sect3>
      <sect3>
        <title>speak-punctuation</title>
        <para>This characteristic governs whether punctuation within
the rendered element should be spoken specifically, or merely used to
guide inflection of the text. In code listings or grammar texts where
the exact punctuation is important, this is a useful
characteristic.</para>
        <para>In <acronym>ACSS</acronym>, this characteristic takes
the values <quote>code</quote> or <quote>none</quote>, with a default
value of <quote>none</quote>. I feel that a Boolean value is more
appropriate, and that the default should be false.</para>
      </sect3>
    </sect2>
    <sect2>
      <title>Debatable Characteristics</title>
      <sect3>
        <title>pause-before, pause-after, pause</title>
        <para>Given <acronym>XSL</acronym>'s ability to generate
multiple flow objects from a single element, pause-before and
pause-after are not strictly necessary if pause is present. However,
by analogy with space-before and space-after in visual stylesheets,
they are conceptually associated with the element being rendered, and
probably deserve their own characteristics.</para>
        <para>In addition, providing these characteristics makes
specification of audio behavior easier for visual flow objects, as
that behavior can be specified as defaults on these characteristics;
otherwise the specification must be left to prose description. See the
section on user agent behavior.</para>
        <para>The pause characteristic itself seems meritorious, to
provide sonic whitespace. The reason it is listed as debatable is that
it depends on the outcome of an audio flow object; if one is created,
the pause could be presented as an audio flow object of a certain
duration with no background.</para>
      </sect3>
      <sect3>
        <title>play-during</title>
        <para>As in the discussion about audio flow objects,
play-during could be handled as a <acronym>URI</acronym> directly, or
its value could be an audio flow object. All of this assumes that we
think that it is important enough to include in the initial
draft.</para>
      </sect3>
      <sect3>
        <title>azimuth, elevation</title>
        <para>These are simple and unambiguous; they specify the
location from which the sound of an element should appear to come. The
question is whether they are too simple and might block the
development of a clean solution for more complex audio location
capabilities, such as a speaker who moves while speaking.</para>
      </sect3>
      <sect3>
        <title>speech-rate</title>
        <para>This controls the rate at which an element is presented
to the listener. As with location characteristics, the question is
whether this characteristic represents an over-simplification that
might be clutter in the face of a more complete solution, such as
maturation of the voice-family characteristic.</para>
      </sect3>
      <sect3>
        <title>stress</title>
        <para>Initially, I had included this as uncontroversial, and
feel that it's the strongest of the debatable characteristics for
inclusion in an initail draft. However, its interaction with currently
immature characteristics such as pitch and pitch-range, and also with
speech-rate, raise the question whether this is sufficiently
well-understood to include in an initial draft.</para>
      </sect3>
    </sect2>
    <sect2>
      <title>Characteristics Intentionally Dropped</title>
      <sect3>
        <title>cue, cue-before, cue-after</title>
        <para>The cues can definitely be dropped in favor of
generation of a series of flow objects.  Whether the generated flow
objects will be audio flow objects with the desired sound effect or if
they will be pauses with background effects is uncertain, but either
is preferable to these characteristics; the <acronym>ACSS</acronym>
specification mentions that the :before and :after pseudo-elements
would be preferable to these properties.</para>
      </sect3>
      <sect3>
        <title>speak-date, speak-time</title>
        <para>This covers the same ground as discussion on the
<acronym>W3C</acronym>'s www-html list. The date and time should be
text generated by the user agent as appopriate for the language of the
document; once generated, it can be read as any other text can
be. Special characteristics to govern this, possibly at odds with the
native language of the document, are undesirable in my opinion.</para>
      </sect3>
    </sect2>
    <sect2>
      <title>Immature Characteristics</title>
      <sect3>
        <title>voice-family, pitch, pitch-range, richness</title>
        <para>According to the <acronym>ACSS</acronym> specification,
these characteristics aren't fully understood, and so are unsuitable
for inclusion in <acronym>XSL</acronym> right now. Full specification
of speaking voice is desirable in the long run, but is too complicated
for the initial draft.</para>
      </sect3>
      <sect3>
        <title>speak-numeral</title>
        <para>The implications of this characteristic is unclear from
the <acronym>ACSS</acronym> specification. Obviously, it controls
whether to read a number digit-by-digit or as text, but the
specification also provides for <quote>none</quote>, which is the
default value, which doesn't seem logical. I also think that
speak-digits would be a better name, with a Boolean value; more work
is needed on this characteristic.</para>
      </sect3>
    </sect2>
  </sect1>
  <sect1>
    <title>New Units</title>
    <para>The <acronym>ACSS</acronym> specification introduces three
new types of units, whose acceptance should be tied to acceptance of
characteristics that need them as values.</para>
    <sect2>
      <title>Angular Measurement</title>
      <para><acronym>ACSS</acronym> specifies three units of
measurements for angles: degrees, given as <quote>deg</quote>,
gradians, given as <quote>grad</quote>, and radians, given as
<quote>rad</quote>.</para>
      <para>Angular units are needed for these
characteristics:<itemizedlist>
          <listitem>
            <para>azimuth</para>
          </listitem>
          <listitem>
            <para>elevation</para>
          </listitem>
        </itemizedlist></para>
    </sect2>
    <sect2>
      <title>Time</title>
      <para>In order to specify duration, <acronym>ACSS</acronym>
introdueces the millisecond (<quote>ms</quote>) and the second
(<quote>s</quote>). They are used by<itemizedlist>
          <listitem>
            <para>pause-before,</para>
          </listitem>
          <listitem>
            <para>pause-after, and</para>
          </listitem>
          <listitem>
            <para>pause</para>
          </listitem>
        </itemizedlist>but it seems, in light of other discussion,
that some time measurement will be needed regardless of the exact
means of specifying pauses in <acronym>XSL</acronym>.</para>
    </sect2>
    <sect2>
      <title>Frequency</title>
      <para><acronym>ACSS</acronym> provides hertz and kilohertz
(<quote>Hz</quote> and <quote>kHz</quote>) to measure frequncy. The
measurements are used only by pitch, and so can probably safely be
left out of the initial <acronym>XSL</acronym> draft.</para>
    </sect2>
  </sect1>
  <sect1>
    <title>User Agent Behavior</title>
    <para>It is expected that a user agent with both audio and visual
rendering capabilities will give the user a choice of either or both
presentations. A completely blind user will not care about visuals
(and may not even have his monitor on), while a sighted user in an
office will probably not want any sound.  However, a nearsighted user
might want to have a page read, while still seeing the layout, and a
driver of a car might want a quick look at pictures while the text is
read.</para>
    <para>While in a single mode, or if the user agent is incapable of
visual or audio output, the user agent should at user option perform
some rendering of flow objects exclusive to the other mode. For
example, in visual-only mode, a user may still want to know about
sound cues or background sounds, so that she may request their
playing; in audio-only mode, a user may want to know about silent
elements that are present. The default behavior should be to follow
the stylesheet designer's instructions, though a user should be able
to change the default on her system.</para>
    <para>Whether or not pause-before and pause-after are kept as
characteristics, the default audio rendering of visual objects can be
specified in terms of what their defaults would be if they were
kept. For instance, the sequence and literal flow objects should not
have any pause before or after, by default, but the paragraph flow
object should have some brief pause before, perhaps one second. Table
cells, similarly, should provide some pause. If audio objects are
included in the July draft, a default separation should be specified
for every flow object.</para>
  </sect1>
  <sect1>
    <title>Braille</title>
    <para>Braille displays or embossers are the other primary means of
access for the visually impaired. However, Braille display is
computationally difficult and highly language-dependent, and should
not be addressed for the July draft.</para>
  </sect1>
</paper>

