What is Corpus Linguistics?




Download 80.94 Kb.
TitleWhat is Corpus Linguistics?
Date23.06.2013
Size80.94 Kb.
TypePresentations


  • What is Corpus Linguistics?

  • Prepared by: Group A & B Students:

  • Muhammed Bakir Suleiman.

  • Shaswar Kamal

  • Jihan Nizammadin

  • Arsalan Ali

  • Ammanj Hassan


What is corpus linguistics?

  • What is corpus linguistics?

  • Corpus linguistics has enjoyed much greater popularity, both as means to:

  • Explore actual patterns of language use and,

  • As a tool for developing materials for classroom language instruction.

  • Definition according to Schmitt

  • Corpus linguistics uses large collections of both spoken and written natural texts (corpora or corpuses, singular corpus) that are stored on computers.



But according to Crystal is,

  • But according to Crystal is,

  • A collection of linguistic data, either written texts or translation of recorded speech, which can be used as a starting point of linguistic description or as a means of verifying hypotheses about a language, this known as corpus linguistics. (Crystal, 2003: 112.)

  • By using a variety of computer-based tools, corpus linguistics can explore different questions about language use. One of the major contributions of corpus linguistics is in the area of exploring patterns of language use.



Corpus linguistics provide an extremely powerful tool for the analysis of natural language and can provide tremendous insights as to how language use varies in different situations such us spoken versus written, or formal versus casual conversation.

  • Corpus linguistics provide an extremely powerful tool for the analysis of natural language and can provide tremendous insights as to how language use varies in different situations such us spoken versus written, or formal versus casual conversation.

  • The term ‘corpus’ in its present-day sense are pretty much synonymous with computerized corpora and methods, also before the advent of computers many empirical linguists who were interested in function and use did essentially what we now call corpus linguistics.



An empirical approach to linguistic analysis is one based on naturally occurring spoken or written data as opposed to an approach that gives priority to introspection.

  • An empirical approach to linguistic analysis is one based on naturally occurring spoken or written data as opposed to an approach that gives priority to introspection.

  • Advances in technology have led to a number of advantages for corpus linguistics, including the following:

  • Collection of ever larger language samples.

  • The ability for much faster and more efficient text processing and access

  • And the availability of easy to learn computer resources for linguistic analysis.



As a result of these advances, there are typically four features that are seen as characteristic of corpus-based analyses of language

  • As a result of these advances, there are typically four features that are seen as characteristic of corpus-based analyses of language

  • It is empirical, analyzing the actual patterns of use in natural texts.

  • It utilizes a large and principled collection of natural texts, known as a ‘corpus’ as the basis for analysis.

  • It makes extensive use of computers for analysis, use both automatic and interactive techniques.

  • It depends on both quantitative and qualitative analytical techniques (From Biber, Conard and Reppen, 1998: 4.)

  •  



A corpus refers to a large principled collection of natural texts. The use of natural texts means that language has been collected from naturally occurring sources rather than from surveys or questionnaires.

  • A corpus refers to a large principled collection of natural texts. The use of natural texts means that language has been collected from naturally occurring sources rather than from surveys or questionnaires.

  • The text collection process for building a corpus needs to be principled, so as to ensure representativeness and balance. The linguistic features or research questions being investigated will shape the collection of texts used in creating the corpus.



For example, if the research focus is to characterize the language used in business letters, the researcher would need to collect a representative sample of business letters. After considering the task of representing all of the various types of business and various kinds of correspondence that are included in the category of ‘business letters’ the researcher might decide to focus on how small business communicate with each other. Now, the researcher can set about the task of contacting small businesses and collecting inter-office communication.

  • For example, if the research focus is to characterize the language used in business letters, the researcher would need to collect a representative sample of business letters. After considering the task of representing all of the various types of business and various kinds of correspondence that are included in the category of ‘business letters’ the researcher might decide to focus on how small business communicate with each other. Now, the researcher can set about the task of contacting small businesses and collecting inter-office communication.



Corpus Design and Compilation

  • Corpus Design and Compilation

  • A corpus is a large and principled collection of texts stored in electronic format.

  • An early standard size set by the creators of the Brown Corpus was one million words, and there is a general assumption that larger corpora are more valuable.

  • Another feature of modern-day corpora is that they are usually made available to other researchers, most commonly for a modest fee and occasionally free of charge. It enables researchers all over the world to access the same sets of data, which first encourages a higher degree of accountability in data analysis and secondly permits collaborative work and follow up studies by different researchers.



Because such a wide range of corpora is accessible to individual teachers and researchers, it is not necessary for those interested in corpus linguistics and its applications to build their own corpus. It is also important to know how corpora are designed and compiled in order to evaluate existing corpora and understand what sorts of analysis they are best suited for.

  • Because such a wide range of corpora is accessible to individual teachers and researchers, it is not necessary for those interested in corpus linguistics and its applications to build their own corpus. It is also important to know how corpora are designed and compiled in order to evaluate existing corpora and understand what sorts of analysis they are best suited for.

  •  Types of Corpora

  • General corpora, such as Brown Corpus, the LOB Corpus or the BNC, aim to represent language in its broadest sense and to serve as a widely available resource for baseline or comparative studies of general linguistic features.



General corpora are designed to be quite large, for example BNC, designed in 1990s, contains 100 million words, and the American National Corpus, which is in the planning stages, is attempting to replicate the BNC’s model.

  • General corpora are designed to be quite large, for example BNC, designed in 1990s, contains 100 million words, and the American National Corpus, which is in the planning stages, is attempting to replicate the BNC’s model.

  • The early general corpora like Brown and LOB, at a mere one million words, seem tiny by today’s standards, but they continue to be used by both applied and computational linguists, and research has shown that one million words is sufficient to obtain reliable, generalizable results research questions.



A general corpus is designed is designed to be balanced and include language samples from a wide range of registers or genres, including both fiction and nonfiction in all their diversity.

  • A general corpus is designed is designed to be balanced and include language samples from a wide range of registers or genres, including both fiction and nonfiction in all their diversity.

  • Most of the early general corpora were limited to written language, but because of the advances in technology and increasing interest in spoken language among linguists, many of the general modern corpora include a spoken component, which similarly encompasses a wide variety of speech types, from casual compensation among friends and family to academic lectures and national radio broadcast.



Because written texts are vastly easier and cheaper to compile than transcript of speech, very few of the large corpora are balanced in terms of speech and writing. The compilers of the BNC had originally planned to include equal amounts of speech and writing and eventually settled for a spoken component of ten million words, or ten percent of the total.

  • Because written texts are vastly easier and cheaper to compile than transcript of speech, very few of the large corpora are balanced in terms of speech and writing. The compilers of the BNC had originally planned to include equal amounts of speech and writing and eventually settled for a spoken component of ten million words, or ten percent of the total.

  • A few corpora exclusively dedicated to spoken discourse have been developed, but they are inevitably much smaller than modern general corpora like the BNC, for example for Cambridge and Nottingham Corpus of Discourse in English.

  •  



Specialized corpora, those designed with more specific research goals in mind, may be the most crucial growth area for corpus linguistics, as researchers increasingly recognize the importance of register-specific descriptions and investigations of language.

  • Specialized corpora, those designed with more specific research goals in mind, may be the most crucial growth area for corpus linguistics, as researchers increasingly recognize the importance of register-specific descriptions and investigations of language.

  • Specialized corpora may include both spoken and written components, as do the International Corpus of English (ICE), a corpus designed for the study of national varieties of English, and the TOFEL-2000 Spoken and Written Academic Language Corpus.



Specialized corpus focuses on a particular spoke or written variety of language. It includes historical corpora and the Archer corpora, and corpora of newspaper writing, fiction or academic prose.

  • Specialized corpus focuses on a particular spoke or written variety of language. It includes historical corpora and the Archer corpora, and corpora of newspaper writing, fiction or academic prose.

  • Registers of speech that have been the focus of specialized spoken corpora include academic speech (the Michigan Corpus of Academic Spoken English; MICASE), teenage language (COLT) and child language (the CHILDES database).

  • The learner’s corpus which includes spoken or written language is becoming increasingly important for language teachers. The most well-known example is the International Corpus of Learner English (ICLE).

  •  



Issues in Corpus Deign

  • Issues in Corpus Deign

  • One of the most important factors in corpus linguistics is the design of the corpus (Biber, 1990).This factor impacts all of the analysis that can be carried out with the corpus and has serious implications for the reliability of the results.

  • The composition of the corpus should reflect the anticipated research goals.

  • A corpus that is intended to be used for exploring lexical questions needs to be very large to allow for accurate representation of a large number of words and of the different senses, or meanings, that a word might have.



It is essential that the overall design of the corpus reflects the issues being explored. For example, if a researcher is interested in comparing patterns of language found in spoken and written discourse, the corpus has to encompass a range of possible spoken and written texts, so that the information derived from the corpus accurately reflects the variation possible in the patterns being compared across the two registers.

  • It is essential that the overall design of the corpus reflects the issues being explored. For example, if a researcher is interested in comparing patterns of language found in spoken and written discourse, the corpus has to encompass a range of possible spoken and written texts, so that the information derived from the corpus accurately reflects the variation possible in the patterns being compared across the two registers.

  • A well designed corpus should be aim to be representative of the types of language included in it, but there are many different ways to conceive of and justify representativeness.



    • You can try to be representative primarily of different registers (for example, fiction, non-fiction, casual conversation, service encounters, broadcast speech) as well as discourse modes (monologic, dialogic, multi-party interactive) and topics (national versus local news, arts versus science). 
    • 2- Another category of representativeness involves the demographics of the speakers or writers (nationality, gender, age, education level, social class, native language/dialect).
    • 3- A third issue to consider in devising a representative sample is whether or not it should be based on production or reception. For example, e-mail messages constitute a type of writing produced by many people, whereas best sellers and major newspapers are produced by relatively few people, but read, or consumed, by many.


All these issues must be weighed when deciding how much of each category(genre topic, speaker type, etc.) to include. It is possible that certain aspects of all of these categories will be important for creating a balanced representative corpus. However, striving for representativeness in too many categories would necessitate an enormous corpus in order for each category to be meaningful. Once the categories and target number of texts and words from each category have been decided upon, it is important to incorporate a method of randomizing the texts or speakers and speech situations in order to avoid sampling bias on the part of the compilers. 

  • All these issues must be weighed when deciding how much of each category(genre topic, speaker type, etc.) to include. It is possible that certain aspects of all of these categories will be important for creating a balanced representative corpus. However, striving for representativeness in too many categories would necessitate an enormous corpus in order for each category to be meaningful. Once the categories and target number of texts and words from each category have been decided upon, it is important to incorporate a method of randomizing the texts or speakers and speech situations in order to avoid sampling bias on the part of the compilers. 



In thinking about the research goals of a corpus, compilers must bear in mind the intended distribution of the corpus.

  • In thinking about the research goals of a corpus, compilers must bear in mind the intended distribution of the corpus.

  • If the access to the corpus is to be limited to a relatively small group of the researchers, their own research agenda will be the only factor influencing corpus design decisions.

  • If the corpus is to be freely or widely available, decision might be made to include more categories of information, in anticipation of goals of other researchers who might use the corpus



Of course no corpus can be everything to everyone; the point is that in creating more widely distributed resources, it is worthwhile to think about potential future users during the design phase. Many of the decisions made about the design of a corpus have to do with practical considerations of funding and time.

  • Of course no corpus can be everything to everyone; the point is that in creating more widely distributed resources, it is worthwhile to think about potential future users during the design phase. Many of the decisions made about the design of a corpus have to do with practical considerations of funding and time.

  • Some of the questions that need to be addressed are:

  • How much time can be allotted to the project?

  • is there a dedicated staff of corpus compliers or are they full-time academics?

  • how much funding is available to support the collection and compilation of the corpus?

  • In the case of a spoken corpus, budget is especially critical because of the tremendous amount of time and skilled labour involved in transcribing speech accurately and consistently.



Corpus Complication

  • Corpus Complication

  • In creating a corpus, data collection involves obtaining or creating electronic versions of the target texts and storing and organizing them.

  • We have two ways for collecting data:

  • 1-Written corpora.

  • 2- Spoken corpora.

  • Written Corpora

  • Data collection for a written corpus most commonly means using a scanner and optical character recognition (OCR) software to scan paper documents into electronic text files.



Materials for a written corpus may be:

  • Materials for a written corpus may be:

  • Keyboarded manually (e.g. corpora of handwritten letters).

  • optical character recognition is not error-free, therefore; when documents are scanned some degree of manual proofreading and error-collection is necessary.

  • -The tremendous wealth of resources now available on the world wide web provides an additional option for the collection of some types of written corpora or some categories of documents. E.g. most newspapers and many popular periodicals are now produced in both print version and electronic version.



-Other types of documents readily available on the web that may comprise small specialized corpora or sub-sections of larger corpora include. E.g. government document. - In relying exclusively on electronically produced texts there is a danger, therefore; it is possible that the format itself engenders particular linguistic characteristics that differentiate the language of electronic texts from that of texts produced for print.

  • -Other types of documents readily available on the web that may comprise small specialized corpora or sub-sections of larger corpora include. E.g. government document. - In relying exclusively on electronically produced texts there is a danger, therefore; it is possible that the format itself engenders particular linguistic characteristics that differentiate the language of electronic texts from that of texts produced for print.



2-Spoken corpus

  • 2-Spoken corpus

  • The data collection phase of building a spoken corpus is lengthy and expensive because:

  • 1-The first step is to decide on a transcription system.

  • Most spoken corpora use an orthographic transcription system that does not attempt to capture prosodic details or phonetic variation.

  • 2- deciding how the interactional characteristics of the speech will be represented in the transcripts; over-lapping speech, backchannels, pauses and non-verbal contextual events are all features of interactive speech that may be represented to varying degrees of detail in a spoken corpus.



This usually involves informing speakers or copyright owners about the purposes of the corpus, how and to whom it will be available?

  • This usually involves informing speakers or copyright owners about the purposes of the corpus, how and to whom it will be available?

  • And in the case of spoken corpora, what measures will be taken to ensure anonymity?

  • Therefore; it is usually impractical to use existing recordings or transcripts as part of a new spoken corpus, unless the speakers can still be contacted.







Word lists derived from corpora can be useful for vocabulary instruction and test development.

  • Word lists derived from corpora can be useful for vocabulary instruction and test development.

  • In addition to frequency lists, concordancing packages can provide additional information about lexical co-occurrence patterns.

  • A concordance program cam also provide information about words that tend to occur together in the corpus.

  • Words that commonly occur with, or in the vicinity of a target word(with greater probability than random chance) are called ‘collocate’ and the resulting sequences or sets of words are called ‘collocation’ which provides important information about grammatical and semantic patterns of use for individual lexical items.

  • Through the use of corpus analyses we can discover patterns of use that previously were unnoticed.



Words and grammatical structures that seem synonymous often have strong patterns of association or preferences for use with certain structures.

  • Words and grammatical structures that seem synonymous often have strong patterns of association or preferences for use with certain structures.

  • E,g begin and start have the same grammatical potential. From corpus-based investigation we have learned that start has a strong preference for an intransitive pattern.

  • Lexical phrases or lexical bundles is another area of collocational studies that has come to light through corpus linguistics.

  • Like collocation lexical phrases or lexical bundles are patterns that occur with a greater than random frequency.



Markup: it is the use of codes to provide additional information about the origins, authors, speakers, structure or contents of texts.

  • Markup: it is the use of codes to provide additional information about the origins, authors, speakers, structure or contents of texts.

  • Structural Markup: refers to the use of codes in the texts to identify structural features of the text. For example, in a written corpus, it may be desirable to identify and code structural entities such as titles, authors, paragraphs, subheadings, chapters.



In a spoken corpus, turns and speakers are almost always identified and coded, but there are a number of other features that may be encoded as well, including, for example, contextual events or paralinguistic features.

  • In a spoken corpus, turns and speakers are almost always identified and coded, but there are a number of other features that may be encoded as well, including, for example, contextual events or paralinguistic features.

  • Header: it is attached to the beginning of a text or stored in a separate database which provides information about the contents and creation of each text.



The information that may be encoded in header includes, for spoken corpora, demographic information about the speakers (such as gender, social class, occupation, age, native language or dialect), when and where the event took place, relationships among the participants and so forth. For written corpora, demographic information about the author(s) as well as title and publication details may be encoded in a header.

  • The information that may be encoded in header includes, for spoken corpora, demographic information about the speakers (such as gender, social class, occupation, age, native language or dialect), when and where the event took place, relationships among the participants and so forth. For written corpora, demographic information about the author(s) as well as title and publication details may be encoded in a header.



For both spoken and written corpora, headers sometimes include classification of the text into categories, such as register, genre, topic domain, discourse mode or formality.

  • For both spoken and written corpora, headers sometimes include classification of the text into categories, such as register, genre, topic domain, discourse mode or formality.



Annotation: there are a number of different kinds of linguistic processing or annotation that can be carried out to make the corpus a more powerful resource.

  • Annotation: there are a number of different kinds of linguistic processing or annotation that can be carried out to make the corpus a more powerful resource.

  • Part-of-Speech tagging: it is the most common kind of linguistic annotation. This involves assigning a grammatical category tag to each word in the corpus. For example, the sentence : ‘ A goat can eat shoes’ could be coded as follows: A (indefinite article) goat (noun, singular) can (modal) eat (main verb) shoes (noun, plural).



Prosodic and phonetic annotation: they are other types of annotation which are not uncommon and synactic parsing which is much less common, and used especially, though not exclusively, by comutational linfuistics.

  • Prosodic and phonetic annotation: they are other types of annotation which are not uncommon and synactic parsing which is much less common, and used especially, though not exclusively, by comutational linfuistics.



1-A tagged corpus allows researchers to explore and answer different types of questions.

  • 1-A tagged corpus allows researchers to explore and answer different types of questions.

  • 2-It allows what grammatical structures co-occur.

  • 3-It addresses the problem of words that have multiple meaning or functions.



































Welcome to add document to your blog or website

Related:

What is Corpus Linguistics? iconCorpus linguistics: a general introduction What is Corpus Linguistics?

What is Corpus Linguistics? iconCorpus Linguistics Developing a

What is Corpus Linguistics? iconLela 30922 Lecture 2 Corpus-based research in Linguistics

What is Corpus Linguistics? iconIntroduction to Linguistics and Basic Terms Linguistics and Linguists

What is Corpus Linguistics? iconIntroduction to Linguistics Fakry Hamdani linguistics

What is Corpus Linguistics? iconWfo corpus Christi Why do an ice storm study for Corpus Christi?...

What is Corpus Linguistics? iconComputational linguistics a brief overview Computational Linguistics

What is Corpus Linguistics? iconLes Données Textuelles Corpus

What is Corpus Linguistics? iconThe Tycho Brahe Historical Corpus of Portuguese The Tycho Brahe Historical Corpus of Portuguese

What is Corpus Linguistics? iconRussian National Corpus today: overview and perspectives

Place this button on your site:
shrdocs.com


The database is protected by copyright © 2013
send message
shrdocs.com
Main page