Joho the BlogEditing audio by editing text - Joho the Blog

Editing audio by editing text

Jon Udell talks about his interview of Dan Bricklin in which about Dan talks about his experience entering the world of audio. Jon says:

When I embarked on my personal audio adventure a few years ago, I naively thought that our fancy new digital technologies would make the whole process very simple. Boy, was I wrong about that.

As a coda, Jon uses the story of the production of of that very interview as an example of the routine complexities of audio.

Too true. I’m often tempted to record an interview but then I remember just what a pain in the butt it would be to edit it, even with my very low standards for audio quality.

So, is there something wrong with the idea of writing software that:

1. Converts spoken audio into text (presumably using existing tools)

2. Lets you use an editor to delete pieces of the text and move other pieces around, as you would with a low-end word processor

3. Uses the edited text to edit and output the audio

Even if Step 1 worked only moderately well, this application would turn editing spoken audio into a trivial task, no harder than (in fact, exactly the same as) editing a text file.

Does this software exist? Is there a good reason why it doesn’t, shouldn’t or couldn’t?

[Tags: ]

Previous: « || Next: »

10 Responses to “Editing audio by editing text”

  1. Hi David,

    Although this is conceivable, the degradation introduced by speech-to-text and then text-to-speech processing would be severe.

    In any case, the bottleneck — at least for me — isn’t really the editing, that’s pleasant and straightforward. Rather I struggle, like most folks not expert in the audio domain, with /recording/ issues.

    – Jon

  2. Rather than speech to text to speech as I think Jon mentions, it should be (as I read David’s suggestion) speech to text with exact time-codes, then keep track of the edits (like track changes in a word processor) and then go back to the original sound with the edited points marked with easy delete/move and a nice trim control for tuning the edges if you want. That would let you do searches for when a person said “Oh, S…” when they spill coffee and cut that little section out of your G-rated hour-long podcast. How about a normal sound editor with the words shown underneath, like a music score with words?

    I think a lot of the editing, as I see it, for simple interviews is getting the volume levels even (which Levelator does) and deleting extra stuff on the ends and problems in the middle (like interruptions or something you can’t say or the SEC will get you).

    I agree with Jon that it is recording that is the problem. Almost all of my podcasts, including the most successful, had no editing (other than Levelator) or only adjusting the ends and adding an intro. Getting acceptable sound quality in an on-location environment is what is hard.

  3. Dan, thanks for putting better what I meant. Sorry for the confusion.

    My personal solution for the sound quality problem is to lower my standards. It’s just too hard for me. That leaves me with the editing problem. For that, having the text under the audio would help, but having a purely text-based editing option would, for the likes of me, be even better (because for me better=easier).

  4. Dan’s idea would work up to a point. Speech-to-text in a marked-up format with timecodes for each word. You edit the text in a special editor window — not complicated — then save the file and push the “render” button. If you’re going to have a “trim control” then you might as well render it in a format for some editor like Audacity and have the edit points appear as application-specific markers. OTOH, if you’re going to fine-tune the edit points, why not just use something like Audacity to begin with? Best for some to just use the approximate edit points and live with some awkward cuts.

    Jon and Dan are also correct that the real problem is in making the recording in the first place. I think you’ll find that a poor recording isn’t good enough for speech-to-text anyway, so this whole thing is moot.

    On-site recordings will remain a challenge, but for those who want to conduct telephone interviews without any hardware or software, I recommend the TalkShoe service. Everyone calls into a conference call and you end up with an MP3 file. And they use a high-quality switch and recorder than most others.

  5. “… speech to text with exact time-codes, then keep track of the edits (like track changes in a word processor) and then go back to the original sound with the edited points marked with easy delete/move and a nice trim control for tuning the edges if you want…..”

    Interesting idea, pragmatic… use word-processing skills for an audio extract, that’s the core, right?

    We’ve got the first part of that already, in the preview release of Soundbooth up on Adobe Labs:

    The next part would be a text editor, or a post-processing utility from your favorite text editor, and then reassembly of the audio from that text-based “edit decision list”. Interesting idea.


  6. Film editing is following a similar pattern, where film is transferred to video, edited digitally in video, and then digital edits are used to mechanically cut and reassemble the film frames.

    Also audio editing relative to animation addresses some similar issues, where a script is matched up to a storyboard, and then an actor’s recorded words are matched back to a text script, matched back to a storyboard, matched to animation cells, then all matched-up to a final video.

    For audio, I have a “beat detective” program that will generate reference points in relationship to dynamics in sound–it’s designed for extracting tempos and meters out of a rhythm, but it would recognize the distinctions between pauses and words in a recording of someone speaking.

    So, if you don’t need a literal transcription, you could think about edits in relationship to specific pauses, rather than in relationship to specific words.

  7. Sure wish that good speaker independent speech-to-text capability existed. Then all of you audio (and video) enthusiasts on the net could provide transcripts for those of us who are hearing impaired and can’t listen to either.

  8. Doug, “Dan’s idea”??? “Dan’s idea”???? Humph!!!

    Now, on issues less important than Pride of Ownership (as all issues are), some of you commenters are too good at using audio tools. You think they’re easy. Hah! I’ve used a few, and generally use Audacity. As a casual podcast sort of guy, I promise you that the vast majority of computer users would rather cut and paste a text transcript than find the beginning and end of a phrase in Audacity, set markers, listen to find the place in the audio stream where you’d like to put that phrase, and then cut and paste. That’s daunting and time consuming. It’d dramatically lower the bar to the editing of spoken audio files if we could just use something like a text editor.

    Dan’s idea (yes, this one is the estimable Dan’s) of at least displaying the text rendition in sync with the audio waves in, say, Audible (like the lyrics under the music) would be a vast improvement. But letting us just edit the text without having to enter an audio editor would change the paradigm.

    JD, I just want to be double clear: Once we have a text rendition, all we want to let users do is delete words (and phrases and pauses and ums) and move them around, so we definitely don’t want to let users do this in a standard word processor. Also, of course, the word processor would have to maintain the hidden timecode info. So, I think we’d want a super-minimal, homegrown text processor.

  9. Check out PortalVideo.

    Encodes video, ships it up to a host, where people transcribe it for you, then you and/or your client edit the transcript, and by editing the text, the video gets edited.

    You can then grab the edits and load htem into Final Cut Pro and finish the edit locally.

    That’s how I understand it. I have not seen this yet, only heard it described by Len Sitomer from Portal Video.

  10. David,
    As someone who first learned to edit audio on 1/4 inch tape with razor blades and scotch tape, my first reaction to this was one of horror, thinking of the kind of awful chopped up audio one could potentially produce with such a tool. It might engender a new kind of weird absurdist audio poetry, but not listenable podcasts based on interviews.

    But I can see how restrained use as Dan Bricklin’s comment describes could be useful (i.e., just the ability to cut out the few egregious waste-of-time, profane or irrelevant chunks out of an otherwise ok interview).

    The similar tool I want first (maybe it exists? anyone know??) is something that allows me to quote audio as easily as the transcript, (i.e., highlight, copy and paste) so in addition to sending people to the link for the whole NPR piece I’m critiquing (which many of them won’t care about) I can cut and paste the one interesting/offending sentence into my blog and they can click to hear it.

Web Joho only

Comments (RSS).  RSS icon