Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64673 > unrolled thread
| Started by | theguy <kvxdelta@gmail.com> |
|---|---|
| First post | 2014-01-24 02:05 -0800 |
| Last post | 2014-01-28 17:31 +1000 |
| Articles | 17 — 14 participants |
Back to article view | Back to comp.lang.python
Need Help with Programming Science Project theguy <kvxdelta@gmail.com> - 2014-01-24 02:05 -0800
Re: Need Help with Programming Science Project Peter Otten <__peter__@web.de> - 2014-01-24 12:07 +0100
Re: Need Help with Programming Science Project bob gailer <bgailer@gmail.com> - 2014-01-24 18:38 -0500
Re: Need Help with Programming Science Project Chris Angelico <rosuav@gmail.com> - 2014-01-25 11:34 +1100
Re: Need Help with Programming Science Project Ben Finney <ben+python@benfinney.id.au> - 2014-01-25 11:59 +1100
Re: Need Help with Programming Science Project Roy Smith <roy@panix.com> - 2014-01-24 20:38 -0500
Re: Need Help with Programming Science Project Terry Reedy <tjreedy@udel.edu> - 2014-01-24 20:10 -0500
Re: Need Help with Programming Science Project kvxdelta@gmail.com - 2014-01-24 18:42 -0800
Re: Need Help with Programming Science Project Rustom Mody <rustompmody@gmail.com> - 2014-01-24 19:06 -0800
Re: Need Help with Programming Science Project theguy <kvxdelta@gmail.com> - 2014-01-24 20:58 -0800
Re: Need Help with Programming Science Project Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-01-25 20:30 +1300
Re: Need Help with Programming Science Project Denis McMahon <denismfmcmahon@gmail.com> - 2014-01-25 11:31 +0000
Re: Need Help with Programming Science Project Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-01-25 09:42 -0500
Re: Need Help with Programming Science Project Rustom Mody <rustompmody@gmail.com> - 2014-01-25 08:15 -0800
Re: Need Help with Programming Science Project Dave Angel <davea@davea.name> - 2014-01-25 01:38 -0500
Re: Need Help with Programming Science Project Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-01-25 20:25 +1300
Re: Need Help with Programming Science Project alex23 <wuwei23@gmail.com> - 2014-01-28 17:31 +1000
| From | theguy <kvxdelta@gmail.com> |
|---|---|
| Date | 2014-01-24 02:05 -0800 |
| Subject | Need Help with Programming Science Project |
| Message-ID | <b1831c17-d9f9-4576-9488-7463e76ccf3b@googlegroups.com> |
I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. I would post the code, but I don't know if it's fine to put it here, as it contains pieces from books. I do believe that would go against copyright laws. If I can figure out a way to put it in without the bits from the stories, then I'll do so, but as of now, any help is appreciated. I understand I'm not exactly making it easy since I'm not putting up any code, but I'm kind of desperate for help here, as I can't seem to find anybody or anything else helpful in any way. Thank you.
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-01-24 12:07 +0100 |
| Message-ID | <mailman.5934.1390561622.18130.python-list@python.org> |
| In reply to | #64673 |
theguy wrote:
> I have a science project that involves designing a program which can
> examine a bit of text with the author's name given, then figure out who
> the author is if another piece of example text without the name is given.
> I so far have three different authors in the program and have already put
> in the example text but for some reason, the program always leans toward
> one specific author, Suzanne Collins, no matter what insane number I try
> to put in or how much I tinker with the coding. I would post the code, but
> I don't know if it's fine to put it here, as it contains pieces from
> books. I do believe that would go against copyright laws. If I can figure
> out a way to put it in without the bits from the stories, then I'll do so,
> but as of now, any help is appreciated. I understand I'm not exactly mak
> ing it easy since I'm not putting up any code, but I'm kind of desperate
> for help here, as I can't seem to find anybody or anything else helpful
> in any way. Thank you.
If I were to speculate what your program might look like:
text_samples = {
"Suzanne Collins": "... some text by collins ...",
"J. K. Rowling": "... some text by rowling ...",
#...
}
unknown = "... sample text by unknown author ..."
def calc_match(text1, text2):
import random
return random.random()
guessed_author = None
guessed_match = None
for author, text in text_samples.items():
match = calc_match(unknown, text)
print(author, match)
if guessed_author is None or match > guessed_match:
guessed_author = author
guessed_match = match
print("The author is", guessed_author)
The important part in this script are not the text samples or the loop to
determine the best match -- it's the algorithm used to determine how good
two texts match.
In the above example that algorithm is encapsulated in the calc_match()
function and it's really bad, it gives you random numbers between 0 and 1.
For us to help you it should be sufficient when you post the analog of this
function in your code together with a description in plain english of how it
is meant to calculate the similarity between two texts.
Alternatavely, instead of the copyrighted texts grab text samples from
project gutenberg with expired copyright.
Make sure that the resulting post is as short as possible -- long text
samples don't make the post clearer than short ones.
[toc] | [prev] | [next] | [standalone]
| From | bob gailer <bgailer@gmail.com> |
|---|---|
| Date | 2014-01-24 18:38 -0500 |
| Message-ID | <mailman.5956.1390609805.18130.python-list@python.org> |
| In reply to | #64673 |
On 1/24/2014 5:05 AM, theguy wrote: > I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. I would post the code, but I don't know if it's fine to put it here, as it contains pieces from books. I do believe that would go against copyright laws. AFAIK copyright laws apply to reproducing something for profit. I doubt that posting it here will matter. In any case do post your code; you could trim the fat out of the text if you need to,
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-25 11:34 +1100 |
| Message-ID | <mailman.5957.1390610061.18130.python-list@python.org> |
| In reply to | #64673 |
On Sat, Jan 25, 2014 at 10:38 AM, bob gailer <bgailer@gmail.com> wrote: > On 1/24/2014 5:05 AM, theguy wrote: >> >> I have a science project that involves designing a program which can >> examine a bit of text with the author's name given, then figure out who the >> author is if another piece of example text without the name is given. I so >> far have three different authors in the program and have already put in the >> example text but for some reason, the program always leans toward one >> specific author, Suzanne Collins, no matter what insane number I try to put >> in or how much I tinker with the coding. I would post the code, but I don't >> know if it's fine to put it here, as it contains pieces from books. I do >> believe that would go against copyright laws. > > AFAIK copyright laws apply to reproducing something for profit. I doubt that > posting it here will matter. Incorrect; posting not-for-profit can still be a violation of copyright. But as Peter said, the text itself isn't critical. Post with placeholder text, as he suggested, and we can look at the code. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2014-01-25 11:59 +1100 |
| Message-ID | <mailman.5959.1390611612.18130.python-list@python.org> |
| In reply to | #64673 |
bob gailer <bgailer@gmail.com> writes: > On 1/24/2014 5:05 AM, theguy wrote: > > I would post the code, but I don't know if it's fine to put it here, > > as it contains pieces from books. I do believe that would go against > > copyright laws. > AFAIK copyright laws apply to reproducing something for profit. That's a common misconception that has never been true. <URL:http://www.faqs.org/faqs/law/copyright/myths/part1/> Copyright is a legal monopoly in a work, reserving a large set of actions to the copyright holders. Without license from the copyright holders, or an exemption under the law, you cannot legally perform those actions. Paying money may sometimes help one acquire a license to perform some reserved actions (though frequently the license is severely restricted, and frequently the license you need isn't available for any price). But “I'm not seeking a profit” nor “I didn't get any money for it” are never grounds for copyright exemptions under any jurisdiction I've ever heard of. -- \ “People are very open-minded about new things, as long as | `\ they're exactly like the old ones.” —Charles F. Kettering | _o__) | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-01-24 20:38 -0500 |
| Message-ID | <roy-E86E5A.20385224012014@news.panix.com> |
| In reply to | #64711 |
In article <mailman.5959.1390611612.18130.python-list@python.org>, Ben Finney <ben+python@benfinney.id.au> wrote: > bob gailer <bgailer@gmail.com> writes: > > > On 1/24/2014 5:05 AM, theguy wrote: > > > I would post the code, but I don't know if it's fine to put it here, > > > as it contains pieces from books. I do believe that would go against > > > copyright laws. > > > AFAIK copyright laws apply to reproducing something for profit. > > That's a common misconception that has never been true. > > <URL:http://www.faqs.org/faqs/law/copyright/myths/part1/> > > Copyright is a legal monopoly in a work, reserving a large set of > actions to the copyright holders. Without license from the copyright > holders, or an exemption under the law, you cannot legally perform those > actions. [The rest of this post is based on my "I am not a lawyer" understanding of the law. Also, this is based on US copyright law; things may be different elsewhere, and I haven't the foggiest idea what law applies to an international forum such as this] On the other hand (where Ben Finney's post is the first hand), there is the Fair Use Doctrine (FUD), which grants certain exemptions. The US Copyright Office has a page (http://www.copyright.gov/fls/fl102.html) about this. As a real-life example, I believe I can safely invoke the FUD to quote the leading paragraphs from today's New York Times and New York Post articles about the same event and give their Fleish-Kincaid Reading Ease and Grade Level scores, if I was comparing the writing style of the two newspapers: ---------------------------------------------- NY Times: The crime gripped the public’s imagination, for both its magnitude and its moxie: In the predawn hours of Dec. 11, 1978, a group of masked gunmen seized about $6 million in cash and jewels from a cargo building at Kennedy International Airport. Reading Ease Score: 56.6 Grade Level: 10.6 ---------------------------------------------- NY Post: On Dec. 11, 1978, armed mobsters stole $5 million in cash and nearly $1 million in jewels from a Lufthansa airlines vault at JFK Airport, in what would be for decades the biggest-ever heist on US soil. Reading Ease Score: 76.2 Grade Level: 7.3 ---------------------------------------------- The scores above were computed by http://www.readability-score.com/ In my opinion, this meets all of the requirements of the FUD. I'm quoting short passages, and using them to critique the writing styles of the two papers. In the OP's case, he's analyzing published works as input to a text analysis algorithm. In my personal opinion, posting samples of those texts, for the purpose of discussing how his algorithm works, would be well within the bounds of Fair Use.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-01-24 20:10 -0500 |
| Message-ID | <mailman.5962.1390612504.18130.python-list@python.org> |
| In reply to | #64673 |
On 1/24/2014 7:34 PM, Chris Angelico wrote: > On Sat, Jan 25, 2014 at 10:38 AM, bob gailer <bgailer@gmail.com> wrote: >> On 1/24/2014 5:05 AM, theguy wrote: >>> >>> I have a science project that involves designing a program which can >>> examine a bit of text with the author's name given, then figure out who the >>> author is if another piece of example text without the name is given. I so >>> far have three different authors in the program and have already put in the >>> example text but for some reason, the program always leans toward one >>> specific author, Suzanne Collins, no matter what insane number I try to put >>> in or how much I tinker with the coding. I would post the code, but I don't >>> know if it's fine to put it here, as it contains pieces from books. I do >>> believe that would go against copyright laws. >> >> AFAIK copyright laws apply to reproducing something for profit. I doubt that >> posting it here will matter. > > Incorrect; posting not-for-profit can still be a violation of > copyright. But as Peter said, the text itself isn't critical. Post > with placeholder text, as he suggested, and we can look at the code. In the US, short quotations are allowed for 'fair use'. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | kvxdelta@gmail.com |
|---|---|
| Date | 2014-01-24 18:42 -0800 |
| Message-ID | <ba404807-08a2-48e6-bfa2-dd8ea8d1acca@googlegroups.com> |
| In reply to | #64673 |
Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:
#D.J. Machale - Pendragon
#Pendragon: Book Six - The Rivers of Zadaa
#Page 98
#The sample sentences for this author. I put each sentence into a seperate variable because I knew no other way to divide the sentence. I also removed spaces so they wouldn't be counted.
djmachale_1 = 'WheretonowIaskedLoor'
djmachale_2 = 'ToaplacewherewewillnotbedisturbedbyBatuorRokadorsheanswered'
djmachale_3 = 'WelefttheroomfollowingLoorthroughthetwistingtunnelthatIhadwalkedthroughseveraltimesbeforeonvisitingtoZadaa'
djmachale_4 = 'Shortlyweleftthesmallertunneltoenterthehugecavernthatonceheldanundergroundriver'
djmachale_5 = 'WhenSpaderandIwerefirstheretherewasafour-storywaterfallononesideoftheimmensecavernthatfedadeepragingriver'
djmachale_6 = 'Nowtherewasonlyadribbleofwaterthatfellfromarockymouthintoapathetictrickleofastreamatthebottomofthemostlydryriverbed'
djmachale_7 = 'WhathappenedhereAlderasked'
djmachale_8 = 'ThereisalottotellLooranswered'
djmachale_9 = 'Later'
djmachale_10 = 'Alderacceptedthat'
djmachale_11 = 'Hewasaneasyguy'
djmachale_12 = 'Loorledustotheopeningthatwasoncehiddenbehindthewaterfallbutwasnowinplainsight'
djmachale_13 = 'Weclimbedafewstonestairssteppedthroughtheportalandenteredaroomthatheldthewater-controldeviceIhavedescribedtoyoubefore'
djmachale_14 = 'Toremindyouguysthisthinglookedlikeoneofthosegiantpipe-organsthatyouseeinchurch'
djmachale_15 = 'Butthesepipesranhorizontallydisappearingintotherockwalloneithersideoftheroom'
djmachale_16 = 'Therewasaplatforminfrontofitthatheldanamazingarrayofswitchesandvalves'
djmachale_17 = 'WhenIfirstcameheretherewasaRokadorengineeronthatplatformfeverishlyworkingthecontrolslikeanexpert'
djmachale_18 = 'Ihadnoideawhatthedevicedidotherthanknowingithadsomethingtodowithcontrollingtheflowofwaterfromtherivers'
djmachale_19 = 'Theguyhadmapsanddiagramsthathereferredtowhilehequicklymadeadjustmentsandtoggledswitches'
djmachale_20 = 'Nowtheplatformwasempty'
#djmwords contains the amount of words in each sentence
#djmwords_total is the total word count between all the samples
djmwords = [6, 15, 22, 17, 26, 29, 5, 8, 1, 3, 5, 19, 25, 18, 16, 17, 20, 25, 18, 5]
djmwords_total = sum(djmwords)
avgWORDS_per_SENTENCE_DJMACHALE = (djmwords_total/20)
#Each variable becomes the total number of letters in each sentence
djmachale_1 = len(djmachale_1)
djmachale_2 = len(djmachale_2)
djmachale_3 = len(djmachale_3)
djmachale_4 = len(djmachale_4)
djmachale_5 = len(djmachale_5)
djmachale_6 = len(djmachale_6)
djmachale_7 = len(djmachale_7)
djmachale_8 = len(djmachale_8)
djmachale_9 = len(djmachale_9)
djmachale_10 = len(djmachale_10)
djmachale_11 = len(djmachale_11)
djmachale_12 = len(djmachale_12)
djmachale_13 = len(djmachale_13)
djmachale_14 = len(djmachale_14)
djmachale_15 = len(djmachale_15)
djmachale_16 = len(djmachale_16)
djmachale_17 = len(djmachale_17)
djmachale_18 = len(djmachale_18)
djmachale_19 = len(djmachale_19)
djmachale_20 = len(djmachale_20)
#DJMACHALE_TOTAL is the total letter count between all the samples
DJ_Machale = [djmachale_1, djmachale_2, djmachale_3, djmachale_4, djmachale_5, djmachale_6, djmachale_7, djmachale_8, djmachale_9, djmachale_10, djmachale_11, djmachale_12, djmachale_13, djmachale_14, djmachale_15, djmachale_16, djmachale_17, djmachale_18, djmachale_19, djmachale_20]
DJMACHALE_TOTAL = (djmachale_1+djmachale_2+djmachale_3+djmachale_4+djmachale_5+djmachale_6+djmachale_7+djmachale_8+djmachale_9+djmachale_10+djmachale_11+djmachale_12+djmachale_13+djmachale_14+djmachale_15+djmachale_16+djmachale_17+djmachale_18+djmachale_19+djmachale_20)
avgLETTERS_per_SENTENCE_DJMACHALE = (DJMACHALE_TOTAL/20)
avgLETTERS_per_WORD_DJMACHALE = (DJMACHALE_TOTAL/djmwords_total)
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Suzanne Collins - The Hunger Games
#The Hunger Games
#Page 103
suzannecollins_1 = 'AsIstridetowardtheelevatorIflingmybowtoonesideandmyquivertotheother'
suzannecollins_2 = 'IbrushpastthegapingAvoxeswhoguardtheelevatorsandhitthenumbertwelvebuttonwithmyfist'
suzannecollins_3 = 'ThedoorsslidetogetherandIzipupward'
suzannecollins_4 = 'Iactuallymakeitbacktomyfloorbeforethetearsstartrunningdownmycheeks'
suzannecollins_5 = 'IcanheartheotherscallingmefromthesittingroombutIflydownthehallintomyroomboltthedoorandflingmyselfontomybed'
suzannecollins_6 = 'ThenIreallybegintosob'
suzannecollins_7 = 'NowIvedoneit'
suzannecollins_8 = 'NowIveruinedeverything'
suzannecollins_9 = 'IfIdevenstoodaghostofachanceitvanishedwhenIsentthatarrowflyingattheGamemakers'
suzannecollins_10 = 'Whatwilltheydotomenow'
suzannecollins_11 = 'Arrestme'
suzannecollins_12 = 'Executeme'
suzannecollins_13 = 'CutmytongueandturnintoanAvoxsoIcanwaitonthefutretributesofPanem'
suzannecollins_14 = 'WhatwasIthinkingshootingattheGamemakers'
suzannecollins_15 = 'OfcourseIwasntIwasshootingatthatapplebecauseIwassoangryatbeingignored'
suzannecollins_16 = 'Iwasnttryingtokilloneofthem'
suzannecollins_17 = 'IfIweretheydbedead'
suzannecollins_18 = 'Ohwhatdoesitmatter'
suzannecollins_19 = 'ItsnotlikeIwasgoingtowintheGamesanyway'
suzannecollins_20 = 'Whocareswhattheydotome'
suzcwords = [19, 19, 8, 16, 6, 4, 4, 20, 7, 2, 2, 19, 8, 18, 8, 6, 5, 11, 7]
suzcwords_total = (19+19+8+16+6+4+4+20+7+2+2+19+8+18+8+6+5+11+7)
avgWORDS_per_SENTENCE_SUZANNECOLLINS = (suzcwords_total/20)
suzannecollins_1 = len(suzannecollins_1)
suzannecollins_2 = len(suzannecollins_2)
suzannecollins_3 = len(suzannecollins_3)
suzannecollins_4 = len(suzannecollins_4)
suzannecollins_5 = len(suzannecollins_5)
suzannecollins_6 = len(suzannecollins_6)
suzannecollins_7 = len(suzannecollins_7)
suzannecollins_8 = len(suzannecollins_8)
suzannecollins_9 = len(suzannecollins_9)
suzannecollins_10 = len(suzannecollins_10)
suzannecollins_11 = len(suzannecollins_11)
suzannecollins_12 = len(suzannecollins_12)
suzannecollins_13 = len(suzannecollins_13)
suzannecollins_14 = len(suzannecollins_14)
suzannecollins_15 = len(suzannecollins_15)
suzannecollins_16 = len(suzannecollins_16)
suzannecollins_17 = len(suzannecollins_17)
suzannecollins_18 = len(suzannecollins_18)
suzannecollins_19 = len(suzannecollins_19)
suzannecollins_20 = len(suzannecollins_20)
Suzanne_Collins = [suzannecollins_1, suzannecollins_2, suzannecollins_3, suzannecollins_4, suzannecollins_5, suzannecollins_6, suzannecollins_7, suzannecollins_8, suzannecollins_9, suzannecollins_10, suzannecollins_11, suzannecollins_12, suzannecollins_13, suzannecollins_14, suzannecollins_15, suzannecollins_16, suzannecollins_17, suzannecollins_18, suzannecollins_19, suzannecollins_20]
SUZANNECOLLINS_TOTAL = (suzannecollins_1+suzannecollins_2+suzannecollins_3+suzannecollins_4+suzannecollins_5+suzannecollins_6+suzannecollins_7+suzannecollins_8+suzannecollins_9+suzannecollins_10+suzannecollins_11+suzannecollins_12+suzannecollins_13+suzannecollins_14+suzannecollins_15+suzannecollins_16+suzannecollins_17+suzannecollins_18+suzannecollins_19+suzannecollins_20)
avgLETTERS_per_SENTENCE_SUZANNECOLLINS = (SUZANNECOLLINS_TOTAL/20)
avgLETTERS_per_WORD_SUZANNECOLLINS = (SUZANNECOLLINS_TOTAL/suzcwords_total)
#-----------------------------------------------------------------------------------------------------------------------------------------
#Richard Peck - The Last Safe Place on Earth
#The Last Safe Place on Earth
#Page 1-2
richardpeck_1 = 'HalloweensaweekandahalfawayHomecomingtheweekendafter'
richardpeck_2 = 'ItsthattimeofyearandcominghomeImthinkingWhatagreateveningtobegoingsomewherewithagirlmyarmdrapedoverhersoftshoulderthetwoofusscuffingthroughtheleaves'
richardpeck_3 = 'ImseeinggirlseverywhereIlooksomeofthemrealmostnot'
richardpeck_4 = 'Iseegirlsintheshapesthetreetrunksmakeandintheformationsoftheclouds'
richardpeck_5 = 'Iseealotofgirlsthisfall'
richardpeck_6 = 'Imnotobsessed'
richardpeck_7 = 'Imintenthgrade'
richardpeck_8 = 'SoIwascominghomeonfoot'
richardpeck_9 = 'Therewereacoupleofbooksinmybackpack'
richardpeck_10 = 'OnewasRayBradburysFahrenheit451whichweweresupposedtobereadingforMrsLenkysclass'
richardpeck_11 = 'Iplannedtobuckledownonschoolworkandreallyhitthebooksnextyearsenioryearatthelatest'
richardpeck_12 = 'MeanwhileIwastakingeverydayasitcametryingtogetatoeholdonhighschool'
richardpeck_13 = 'ButthefactisIdidntreallythinkhighschoolwashappeninguntilIfoundagirl'
richardpeck_14 = 'ItwasapostcardeveningalongTranquilyLanetheactualnameofourstreet'
richardpeck_15 = 'Thehazewaslikebonfiresmoke,thoughwecantburnleaveswithinthevillagelimits'
richardpeck_16 = 'Itwasared-and-goldworldwithpurpleeveningcomingon'
richardpeck_17 = 'OurhouseisthebigwhitebrickwiththegreenshutterslikeahouseonaChristmascard'
richardpeck_18 = 'Weusedtoliveinthewesternsuburbs'
richardpeck_19 = 'ButwhenDianaandIwereinsixthgradethejuniorhighouttherehadacoupleofknifefightsthatmadethenews'
richardpeck_20 = 'Thegangsweremovinginsowemovedout'
richwords = [11, 36, 12, 17, 8, 3, 4, 7, 9 , 17, 19, 17, 17, 14, 15, 12, 18, 9, 23, 9]
richwords_total = (11+36+12+17+8+3+4+7+9+17+19+17+17+14+15+12+18+9+23+9)
avgWORDS_per_SENTENCE_RICHARDPECK = (richwords_total/20)
richardpeck_1 = len(richardpeck_1)
richardpeck_2 = len(richardpeck_2)
richardpeck_3 = len(richardpeck_3)
richardpeck_4 = len(richardpeck_4)
richardpeck_5 = len(richardpeck_5)
richardpeck_6 = len(richardpeck_6)
richardpeck_7 = len(richardpeck_7)
richardpeck_8 = len(richardpeck_8)
richardpeck_9 = len(richardpeck_9)
richardpeck_10 = len(richardpeck_10)
richardpeck_11 = len(richardpeck_11)
richardpeck_12 = len(richardpeck_12)
richardpeck_13 = len(richardpeck_13)
richardpeck_14 = len(richardpeck_14)
richardpeck_15 = len(richardpeck_15)
richardpeck_16 = len(richardpeck_16)
richardpeck_17 = len(richardpeck_17)
richardpeck_18 = len(richardpeck_18)
richardpeck_19 = len(richardpeck_19)
richardpeck_20 = len(richardpeck_20)
Richard_Peck = [richardpeck_1, richardpeck_2, richardpeck_3, richardpeck_4, richardpeck_5, richardpeck_6, richardpeck_7, richardpeck_8, richardpeck_9, richardpeck_10, richardpeck_11, richardpeck_12, richardpeck_13, richardpeck_14, richardpeck_15, richardpeck_16, richardpeck_17, richardpeck_18, richardpeck_19, richardpeck_20]
RICHARDPECK_TOTAL = (richardpeck_1+richardpeck_2+richardpeck_3+richardpeck_4+richardpeck_5+richardpeck_6+richardpeck_7+richardpeck_8+richardpeck_9+richardpeck_10+richardpeck_11+richardpeck_12+richardpeck_13+richardpeck_14+richardpeck_15+richardpeck_16+richardpeck_17+richardpeck_18+richardpeck_19+richardpeck_20)
avgLETTERS_per_SENTENCE_RICHARDPECK = (RICHARDPECK_TOTAL/20)
avgLETTERS_per_WORD_RICHARDPECK = (RICHARDPECK_TOTAL/richwords_total)
#---------------------------------------------------------------------------------------------------------
#EXAMPLE SLOT
example1 = 'Wepulledthefilmfortheten-thirtynewstohearhowtheWarriorshaddoneagainsttheLakeVillaVikinsontheVikingshomefield'
example2 = 'WedlostbutitwascloseandC.E.andIwentbacktotheDracula'
example3 = 'Itwasgettinglatewhenthephonerang'
example4 = 'DeepinhispopcornworldDaddidntanswerit'
example5 = 'Ipickedupinthedenanditwasawoman'
example6 = 'IwavedatC.E.toturndownthesoundsbecausethewomanwascrying'
example7 = 'Whoisthis'
example8 = 'ItwasMrsCunningham'
example9 = 'Icantfindmydaughtershesaid'
example10 = 'IcantfindPace'
example11 = 'SheshereIsaid'
example12 = 'Shesupstairswithmysister'
example13 = 'AmomentofsilencethenandMrsCunninghamsvoiceshuddered'
example14 = 'IssheYoutellhertostayrightthereImcomingover'
example15 = 'SoweneverdidseehowtheDraculafilmended'
example16 = 'HeyPaceIsaidupthestairs'
example17 = 'Yourmomscomingover'
example18 = 'ThisbroughteverybodytothefronthallPacefirst'
example19 = 'DianawasbehindherandMominherrobeandMarnieinherpajamas'
example20 = 'BeforeDrandMrsCunninghamgothereDadwasinthefronthalltooinhisapron'
examplewords = [25, 15, 8, 9, 11, 14, 3, 4, 6, 4, 4, 5, 10, 7, 4, 9, 14, 17]
examplewords_total = sum(examplewords)
avgWORDS_per_SENTENCE_EXAMPLE = (examplewords_total/20)
example1 = len(example1)
example2 = len(example2)
example3 = len(example3)
example4 = len(example4)
example5 = len(example5)
example6 = len(example6)
example7 = len(example7)
example8 = len(example8)
example9 = len(example9)
example10 = len(example10)
example11 = len(example11)
example12 = len(example12)
example13 = len(example13)
example14 = len(example14)
example15 = len(example15)
example16 = len(example16)
example17 = len(example17)
example18 = len(example18)
example19 = len(example19)
example20 = len(example20)
example = [example1, example2, example3, example4, example5, example6, example7, example8, example9, example10, example11, example12, example13, example14, example15, example16, example17, example18, example19, example20]
EXAMPLE_TOTAL = (example1+example2+example3+example4+example5+example6+example7+example8+example9+example10+example11+example12+example13+example14+example15+example16+example17+example18+example19+example20)
avgLETTERS_per_SENTENCE_EXAMPLE = (EXAMPLE_TOTAL/20)
avgLETTERS_per_WORD_EXAMPLE = (EXAMPLE_TOTAL/examplewords_total)
#------------------------------------------------------------------------------------------------------------------------------
#Tests for similarities and prints (displays) the author whom the program believes to have written the example text
#I used a scoreboard system of sorts to determine which author was most similar to the example. Each time the program finds a match to one in each of the tests, it adds a point to that author here.
DJMachalePossibility = 0
SuzanneCollinsPossibility = 0
RichardPeckPossibility = 0
#Matches average letters/sentence in example with most likely author
#I attempted to find the closest value by subtracting the example's value from each of the authors. The author with the smallest distance from the example would be marked up one point.
avgLPS_DJ_EXAMPLE = (avgLETTERS_per_SENTENCE_DJMACHALE-avgLETTERS_per_SENTENCE_EXAMPLE)
avgLPS_SUZC_EXAMPLE = (avgLETTERS_per_SENTENCE_SUZANNECOLLINS-avgLETTERS_per_SENTENCE_EXAMPLE)
avgLPS_RICH_EXAMPLE = (avgLETTERS_per_SENTENCE_RICHARDPECK-avgLETTERS_per_SENTENCE_EXAMPLE)
LPS_Comparisons = [avgLPS_DJ_EXAMPLE, avgLPS_SUZC_EXAMPLE, avgLPS_RICH_EXAMPLE]
avgLPS_Match = min(LPS_Comparisons)
if avgLPS_Match == avgLPS_DJ_EXAMPLE:
DJMachalePossibility = (DJMachalePossibility+1)
if avgLPS_Match == avgLPS_SUZC_EXAMPLE:
SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)
if avgLPS_Match == avgLPS_RICH_EXAMPLE:
RichardPeckPossibility = (RichardPeckPossibility+1)
#Matches average words/sentence in example with most likely author
avgWPS_DJ_EXAMPLE = (avgWORDS_per_SENTENCE_DJMACHALE-avgWORDS_per_SENTENCE_EXAMPLE)
avgWPS_SUZC_EXAMPLE = (avgWORDS_per_SENTENCE_SUZANNECOLLINS-avgWORDS_per_SENTENCE_EXAMPLE)
avgWPS_RICH_EXAMPLE = (avgWORDS_per_SENTENCE_RICHARDPECK-avgWORDS_per_SENTENCE_EXAMPLE)
WPS_Comparisons = [avgWPS_DJ_EXAMPLE, avgWPS_SUZC_EXAMPLE, avgWPS_RICH_EXAMPLE]
avgWPS_Match = min(WPS_Comparisons)
if avgWPS_Match == avgWPS_DJ_EXAMPLE:
DJMachalePossibility = (DJMachalePossibility+1)
if avgWPS_Match == avgWPS_SUZC_EXAMPLE:
SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)
if avgWPS_Match == avgWPS_RICH_EXAMPLE:
RichardPeckPossibility = (RichardPeckPossibility+1)
#Matches average letters/word in example with most likely author
avgLPW_DJ_EXAMPLE = (avgLETTERS_per_WORD_DJMACHALE-avgLETTERS_per_WORD_EXAMPLE)
avgLPW_SUZC_EXAMPLE = (avgLETTERS_per_WORD_SUZANNECOLLINS-avgLETTERS_per_WORD_EXAMPLE)
avgLPW_RICH_EXAMPLE = (avgLETTERS_per_WORD_RICHARDPECK-avgLETTERS_per_WORD_EXAMPLE)
LPW_Comparisons = [avgLPW_DJ_EXAMPLE, avgLPW_SUZC_EXAMPLE, avgLPW_SUZC_EXAMPLE]
avgLPW_Match = min(LPW_Comparisons)
if avgLPW_Match == avgLPW_DJ_EXAMPLE:
DJMachalePossibility = (DJMachalePossibility+1)
if avgLPW_Match == avgLPW_SUZC_EXAMPLE:
SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)
if avgLPW_Match == avgLPW_RICH_EXAMPLE:
RichardPeckPossibility = (RichardPeckPossibility+1)
AUTHOR_SCOREBOARD = [DJMachalePossibility, SuzanneCollinsPossibility, RichardPeckPossibility]
#The author with the most points on them would be considered the program's guess.
Match = max(AUTHOR_SCOREBOARD)
print AUTHOR_SCOREBOARD
if Match == DJMachalePossibility:
print "The author should be D.J. Machale."
if Match == SuzanneCollinsPossibility:
print "The author should be Suzanne Collins."
if Match == RichardPeckPossibility:
print "The author should be Richard Peck."
------------------------------------------------------------------------------
Hopefully, there won't be any copyright issues. Like someone said, this should be fair use. The problem I'm having is that it always gives Suzanne Collins, no matter what example is put in. I'm really sorry that the code isn't very clean. Like I said, it was rushed and I have little experience. I'm just desperate for help as it's a bit too late to change projects, so I have to stick with this. Also, if it's of any importance, I have to be able to remove or add any of the "average letters per word/average letters per sentence/average words per sentence things" to test the program at different levels of strictness. I would GREATLY appreciate any help with this. Thank you!
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-01-24 19:06 -0800 |
| Message-ID | <d87429ec-7143-40a4-8079-456a2e33a040@googlegroups.com> |
| In reply to | #64717 |
On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote: > Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways: <snipped> Ewwww! If you (or anyone with basic python experience) rewrites that code, it will become 1/50th the size and all that you call 'code' will reside in data files. That can mean one of json, xml, yml, ini, pickle, ini, csv etc If you need further help in understanding/choosing, post back
[toc] | [prev] | [next] | [standalone]
| From | theguy <kvxdelta@gmail.com> |
|---|---|
| Date | 2014-01-24 20:58 -0800 |
| Message-ID | <1eeb0e4b-ff9a-4b5e-86ec-773ca98fbf1b@googlegroups.com> |
| In reply to | #64718 |
On Friday, January 24, 2014 7:06:55 PM UTC-8, Rustom Mody wrote: > On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote: > > > Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways: > > > > <snipped> > > > > Ewwww! > > > > If you (or anyone with basic python experience) rewrites that code, it will become > > 1/50th the size and all that you call 'code' will reside in data files. > > > > That can mean one of json, xml, yml, ini, pickle, ini, csv etc > > > > If you need further help in understanding/choosing, post back I know. I'm kind of ashamed of the code, but it does the job I need it to up to a certain point, where it for some reason continually gives me Suzanne Collins as the author. It always gives three points to her name in the AUTHOR_SCOREBOARD list. The code, though, is REALLY bad. I'm trying to simply get it to do the things needed for the program. If I could get it to actually calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems would be solved. Luckily, I'm not being graded on the elegance or conciseness of my code. Thank you for the constructive criticism, though I am really seeking help with my little problem involving that dang scoreboard. Thank you.
[toc] | [prev] | [next] | [standalone]
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
|---|---|
| Date | 2014-01-25 20:30 +1300 |
| Message-ID | <bkh7hiFkehfU1@mid.individual.net> |
| In reply to | #64720 |
theguy wrote: > If I could get it to actually > calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems > would be solved. Have you tried getting it to print out the values it's getting for the scores, and comparing them with what you calculate by hand? -- Greg
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2014-01-25 11:31 +0000 |
| Message-ID | <lc079s$5fh$1@dont-email.me> |
| In reply to | #64720 |
On Fri, 24 Jan 2014 20:58:50 -0800, theguy wrote:
> I know. I'm kind of ashamed of the code, but it does the job I need it
> to up to a certain point
OK, well first of all take a step back and look at the problem.
You have n exemplars, each from a known author.
You analyse each exemplar, and determine some statistics for it.
You then take your unknown sample, determine the same statistics for the
unknown sample.
Finally, you compare each exemplar's stats with the sample's stats to try
and find a best match.
So, perhaps you want a dictionary of { author: statistics }, and a
function to analyse a piece of text, which might call other functions to
get eg avg words / sentence, avg letters / sentence, avg word length, and
the sd in each, and the short word ratio (words <= 3 chars vs words >= 4
chars) and some other statistics.
Given the statistics for each exemplar, you might store these in your
dictionary as a tuple.
this isn't python, it's a description of an algorithm, it just looks a
bit pythonic:
# tuple of weightings applied to different stats
stat_weightings = ( 1.0, 1.3, 0.85, ...... )
def get_some_stat( t ):
# calculate some numerical statistic on a block of text
# return it
def analyse( f ):
text = read_file( f )
return ( get_some_stat( text ), ...... )
exemplars = {}
for exemplar_file in exemplar_files:
exemplar_data[author] = analyse( exemplar_file )
sample_data = analyse( sample_file )
scores = {}
tmp = 0
x = 0
# score for a piece of work is sum of ( diff of stat * weighting )
# for all the stats, lower score = closer match
for author in keys( exemplar_data ):
for i in len( exemplar_data[ author ] ):
tmp = tmp + sqrt( exemplar_data[ author ][ i ] -
sample_data[ i ] ) * stat_weightings( i )
scores[ author ] = tmp
if tmp > x:
x = tmp
names = []
for author in keys( scores ):
if scores[ author ] < x:
x = scores[ author ]
names = [ author ]
elif scores[ author ] == x:
names.append( [ author ] )
print "the best matching author(s) is/are: ", names
Then all you have to do is find enough ways to calculate stats, and the
magic coefficients to use in the stat_weightings
--
Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2014-01-25 09:42 -0500 |
| Message-ID | <mailman.5977.1390660939.18130.python-list@python.org> |
| In reply to | #64718 |
On Fri, 24 Jan 2014 19:06:55 -0800 (PST), Rustom Mody
<rustompmody@gmail.com> declaimed the following:
>On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote:
>> Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:
>
><snipped>
>
>Ewwww!
I think my reaction was more guttural -- <barf!>
>
>If you (or anyone with basic python experience) rewrites that code, it will become
>1/50th the size and all that you call 'code' will reside in data files.
>
>That can mean one of json, xml, yml, ini, pickle, ini, csv etc
>
>If you need further help in understanding/choosing, post back
Heck, at the very least turn all those xxxx_99 variables into single
lists.... The posted code looks like something from 1968 K&K BASIC.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-01-25 08:15 -0800 |
| Message-ID | <2ebec4b9-66dd-4ddc-94a5-1b431e7b0edf@googlegroups.com> |
| In reply to | #64745 |
On Saturday, January 25, 2014 8:12:20 PM UTC+5:30, Dennis Lee Bieber wrote: > > Heck, at the very least turn all those xxxx_99 variables into single > lists.... The posted code looks like something from 1968 K&K BASIC. Yes thats correct. My suggestion of data-files is a second step. A first step is just converting to using internal (python) data structures. [And not 1968 BASIC scalars!]
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2014-01-25 01:38 -0500 |
| Message-ID | <mailman.5966.1390631789.18130.python-list@python.org> |
| In reply to | #64717 |
kvxdelta@gmail.com Wrote in message:
> Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:
.........
>
>
> LPW_Comparisons = [avgLPW_DJ_EXAMPLE, avgLPW_SUZC_EXAMPLE, avgLPW_SUZC_EXAMPLE]
> avgLPW_Match = min(LPW_Comparisons)
>
> if avgLPW_Match == avgLPW_DJ_EXAMPLE:
> DJMachalePossibility = (DJMachalePossibility+1)
>
> if avgLPW_Match == avgLPW_SUZC_EXAMPLE:
> SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)
>
> if avgLPW_Match == avgLPW_RICH_EXAMPLE:
> RichardPeckPossibility = (RichardPeckPossibility+1)
>
> AUTHOR_SCOREBOARD = [DJMachalePossibility, SuzanneCollinsPossibility, RichardPeckPossibility]
>
> #The author with the most points on them would be considered the program's guess.
> Match = max(AUTHOR_SCOREBOARD)
>
> print AUTHOR_SCOREBOARD
>
> if Match == DJMachalePossibility:
> print "The author should be D.J. Machale."
>
> if Match == SuzanneCollinsPossibility:
> print "The author should be Suzanne Collins."
>
> if Match == RichardPeckPossibility:
> print "The author should be Richard Peck."
>
>
> ------------------------------------------------------------------------------
> Hopefully, there won't be any copyright issues. Like someone said, this should be fair use. The problem I'm having is that it always gives Suzanne Collins, no matter what example is put in. I'm really sorry that the code isn't very clean. Like I said, it was rushed and I have little experience. I'm just desperate for help as it's a bit too late to change projects, so I have to stick with this. Also, if it's of any importance, I have to be able to remove or add any of the "average letters per word/average letters per sentence/average words per sentence things" to test the program at different levels of strictness. I would GREATLY appreciate any help with this. Thank you!
>
1. When you calculate averages, you should be using floating
point divide.
avg = float (a) / b
2. When you subtract two values, you need an abs, because
otherwise min () will hone in on the negative values.
3. Realize that having Match agree with more than one is not
that unlikely.
4. If you want to vary what you call strictness, you're really
going to need to learn about functions.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
|---|---|
| Date | 2014-01-25 20:25 +1300 |
| Message-ID | <bkh772Fkc9iU1@mid.individual.net> |
| In reply to | #64673 |
theguy wrote: > I so far have > three different authors in the program and have already put in the example > text but for some reason, the program always leans toward one specific > author, Suzanne Collins, no matter what insane number I try to put in or how > much I tinker with the coding. It's obvious what's happening here: all the other authors have heavily borrowed from Suzanne Collins. You've created a plagiarism detector! :-) -- Greg
[toc] | [prev] | [next] | [standalone]
| From | alex23 <wuwei23@gmail.com> |
|---|---|
| Date | 2014-01-28 17:31 +1000 |
| Message-ID | <lc7md4$bq0$1@dont-email.me> |
| In reply to | #64673 |
On 24/01/2014 8:05 PM, theguy wrote: > I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. This sounds like exactly the sort of thing NLTK was made for. Here's an example of using it for this requirement: http://www.aicbt.com/authorship-attribution/
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web